Wrapping PDF Documents

 

HTML and XML are not the only formats to spread and exchange textual information for the purpose of making it accessible to companies and private users. A further and related kind of textual document refers to print-oriented formats, whose Acrobat PDF (Portable Document Format) is the de-facto standard. In contrast to languages for semistructured text like HTML which are designed to describe a document through the organization of chunks of information logically structured (e.g. page frames, paragraphs, item lists etc.), print-oriented languages specify details at grapheme levels and are designed to describe and drive the document printing.
In particular, a PDF document consists of a collection of objects spanning over the pages. Every page in a document is described by a PDF content stream, which contains text passages, image and graphical objects, and optionally dynamic objects, such as hyperlinks, bookmarks, and attachments.

The intrinsic print-oriented nature of PDF documents raises many issues which make information  extraction particularly difficult. A first challenge consists in overcoming the lack of explicit information about both the structure and presentation of the contents within a PDF document. For example, the table of contents is often not available in a document, therefore no information is given on the organization of the document in chapters, sections and so on. Also, the layout functionality of a text portion (e.g. document title, section header, table or figure caption etc.) cannot be easily recognized.  Moreover, the availability of several PDF content stream generators causes different interpretations of the PDF format. Therefore, another challenge is strictly related to the ability in dealing with documents providing the same kind of information but exhibiting different layouts depending on their generating sources. For example, distinct departments of the same company may produce differently formatted reports. Finally, the subjectivity in the formatting styles that characterize even thematically similar PDF documents may lead to uncertainty in the specification of the syntactic extraction rules to be defined in a wrapper.

 
The problem of extracting information from PDF documents has been not investigated at all to date. We address the problem of wrapping PDF documents, by proposing a novel bottom-up wrapping approach to extract information tokens and integrate them into groups related according the logical structure of a document. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Due to the intrinsic uncertainty on the structure and presentation of PDF documents, we devise constraints on token groupings as fuzzy logic conditions. A formal semantics for PDF wrappers has been defined and an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document has been developed.
 
A system prototype is in advanced phase of development and is currently being applied for extracting information from company balance sheets.
Get a demo version of the PDF wrapping tool
 
 
Members  
Sergio Flesca DEIS, University of Calabria
Salvatore Garruzzo DIMET, University of Reggio Calabria
Elio Masciari ICAR-CNR, Institute of Italian National Research Council
Andrea Tagarelli DEIS, University of Calabria

 

Publications
S. Flesca, S. Garruzzo, E. Masciari, A. Tagarelli. Wrapping PDF Documents Exploiting Uncertain Knowledge. 18th Conference on Advanced Information Systems Engineering (CAiSE ’06). Luxembourg, June 5-9, 2006. TO APPEAR
S. Flesca, S. Garruzzo, E. Masciari, A. Tagarelli. Wrapping PDF Documents: A Preliminary Study. 13th Italian Symposium on Advanced Database Systems (SEBD ’05), pp. 272-283. Brixen-Bressanone, Italy, June 20-22, 2005.