Web data extraction and integration

Number and Type:

181130 VU WS 2006/07

Lecturer: Robert Baumgartner
Keywords: Information Extraction, Approaches and methods for Wrapper Generation, Web Querying, Integration, XML.
Preliminary Meeting: Thursday 5th of October, 9:00 (s.t.), Zemanek Hörsaal, Favoritenstrasse 11
Registration: until 4th of October via TUWIS (limited participant number). Please de-register in TUWIS in case you decide not to take the course. I try to consider as many participants as possible, but sooner or later I have to limit the number...
Language: Slides in English, lecture language depending whether non-german speaking students from the computational logic study join
Timetable: about every other Thursday 10:15-12:15 (on the first session, 19th of October, we start on 9:15)
Procedure: Lecture coupled with exercises and group work; two exercise evaluation slots: one at 9:00 (groups 1-8, 16), one at 12:30 (groups 9-15,17-18) on the lecture dates; lecture at 10:15 (on the first lecture day at 9:15)
  • Information Extraction: Setting, History, IE vs. IR
  • Structured Data Extraction and Wrapping
  • XML Transformation and Query Languages (in particular XPath and XSLT, very short look on XQuery)
  • Web Wrapper Languages
  • Wrapper Generation Approaches
  • Inductive Wrapper Generation: Machine Learning on Strings/Trees, Tree Edit Distances
  • Automatic Data Extraction / Web Data Mining
  • Supervised Wrapper Generation
  • Deep Web Navigation Approaches
  • Data Extraction from PDF documents
  • Mediation and Integration Approaches
  • Web Data Cleaning
  • Lixto Visual Wrapper and Transformation Server
Fields of Study: This VU is a compulsory course or compulsory elective in some bachelor and master studies, and is furthermore part of the re-designed KfK Semantic Web, and is part of the European Master Programs Computational Logic.


Structure of the lecture and slides
Topics / Slides

Lecture Time
(all groups)

Lecture Location

G1-G8 and G16 Exercises
(Sem.room 1842)

G9-G15, G17-G18 Exercises
(Sem.room 1842)
Preliminary Meeting
5.10. 9:00-10:00
Zemanek HS
Session 1
Motivation/History IE, XPath, XSLT
Slides: 2 | 6 | Exercises | Resources
19.10. 9:15-12:15
Vortmann HS
Session 2 + Exercise Evaluation
Approaches, Methods and Tools for Wrapper Generation
Slides: 2 | 6 | Exercises | Resources
9.11. 10:15-12:15
Zemanek HS
Session 3 + Exercise Evaluation
Lixto Visual Wrapper and Transformation Server
Slides: 2 | 6 | Exercises
16.11. 10:15-12:15
Zemanek HS


Session 4 + Exercise Evaluation
Web Data Cleaning, Mediation and Integration
Slides: 2 | 6 | Exercises see below
30.11. 10:15-12:15
Zemanek HS
Session 5 + Exercise Evaluation

Inductive Wrapper Generation, Web Content Mining, Extr. from PDF
Slides | Exercises see below

14.12. 10:00-11:30 (*)
Zemanek HS
9:00-10:00, Zemanek room (*)

Zemanek room(*)

Session 6 + Exercise Evaluation
Extraction Workflows, Meta Search Concepts, Extr. on Visual Rendition
Slides: 2 | 6
11.1. 10:15-12:15
Seminar room 1842 (#)
Group Presentations
Seminar room 1842

Exercise Sheet 4 and 5 and Group Project Topics

Reference Solutions and Remarks to Exercise Sheet 1 and 2

  Group Projects: Presentation and Paper Downloads
Group Projects Unit 1: G1 (Web Mining), G3 (Stylus Studio), G4 (Protege), G5 (XMLDBMS), G7 (Castor), G8 (OpenKapow), G16 (MapForce)
Group Projects Unit 2: G9 (XMLDBMS), G10 (Gate), G11 (Protege), G13 (Castor), G15 (MetaSearch), G17 (Web Mining), G18 (MapForce)

(*) Note: On the 14th of December, due to time constraints of the lecturer, both exercise slots have to be considered in one single session, which will be held in the Zemanek room from 9 to 10. Participants of Slot B who can not attend at this time are excused. Note that each group will get e-mail feedback to their brainstorming/content slideset, too. The lecture has to start a bit earlier than usual, at 10:00. Everyone is encouraged to visit the lecture due to very interesting talks of PhD students.

(#) Note: Location changed! Everyone is encouraged to visit the lecture due to very interesting talks of PhD students.

Robert Baumgartner, last modified on 8/2/2007