New! A test version of the GraphWrap prototype can now be downloaded here (ZIP, 20 MB). Instructions are available here (PDF, in English).
Neu! Eine Testversion von GraphWrap können Sie nun hier herunterladen (ZIP, 20 MB). Die Anleitung dafür finden Sie hier. (PDF, auf Deutsch).
Imagine that you have a large amount of data in one or more PDF files, which is presented in a consistent format, such as product specifications, measurements, prices or contact information. In order to make this data amenable to machine processing, it must first be extracted into a structured format such as XML or a relational database. As most PDF files lack the structuring information which would allow us to locate the individual data instances, this is a challenging task.
GraphWrap, which is currently at prototype stage, allows a non-expert user to create such wrappers for almost any PDF file in an intuitive and interactive manner. After selecting an example instance on the document, a few clicks on the graph representation to set conditions and choose which data items to extract are usually all that is required. This wrapper can then be run on other pages and documents which exhibit a similar visual structure. A screenshot of the system is shown below.
This prototype was presented at CeBIT at the stand of the Austrian Computer Society from 3-5 March 2009. The accompanying handouts from the presentation with instructions for use can be downloaded here in English (PDF) or German (PDF).
A test version of GraphWrap can be downloaded here (ZIP, 20 MB). As GraphWrap has been developed in Java, it works on a variety of platforms and batch files and shell scripts have been provided to help get you started if you use Windows or Linux/Unix respectively.
This version may be used free of charge for academic and non-commercial use, as well as for evaluation purposes in a commercial environment. To obtain a licence for commercial use, please contact me.
More detailed instructions for the GraphWrap prototype are available in English (PDF) and German (PDF). If you have any further questions, please do not hesitate to send me an e-mail.
back to my homepage