This page currently contains information about my third-year project, on PDF to HTML Conversion, which was written in Summer 2003 as part of my degree in Computer Science at the University of Warwick.
These libraries are available here:
The original aim of this project was to convert PDF files into HTML as accurately as possible. After investigating the existing converters that were available, it was found that they all took this approach, and that the results were often unsatisfactory.
To avoid repeating work that already had been done, and to improve upon the situation, the project's aims moved towards "intelligent" text extraction - extracting textual data from a PDF file, which may include columns, images and other features, and creating a "clean" HTML file with all the text from the PDF, but without the original layout.
The implementation of the project has resulted in a program that can process fairly complex page layouts, including columns, and output the text in HTML with CSS, retaining formatting information where possible.
The program works solely by analysing the text blocks within the document. Suggestions for further improvements, such as dealing with graphics, are given in the final report.
The final report is available here in PDF format (2.2 MB).
The two program files are available for download:
Ensure that both java and javac are in your path. To compile the program, copy both Java files and both libraries into a directory. From within that directory type:
The following historic documents are also available for download in pdf format:
If you have any further questions or suggestions etc., please feel free to get in touch.back to my homepage