Wrapping PDF Documents
HTML and XML are not the only formats
to spread and exchange textual information for the purpose of making it
accessible to companies and private users. A further and related kind of
textual document refers to print-oriented formats, whose Acrobat
PDF (Portable Document Format) is the de-facto standard. In contrast to
languages for semistructured text like HTML which are designed to describe
a document through the organization of chunks of information logically
structured (e.g. page frames, paragraphs, item lists etc.), print-oriented
languages specify details at grapheme levels and are designed to describe
and drive the document printing.
In particular, a PDF document consists of a collection of objects spanning
over the pages. Every page in a document is described by a PDF content
stream, which contains text passages, image and graphical objects, and
optionally dynamic objects, such as hyperlinks, bookmarks, and attachments.
The intrinsic print-oriented nature of PDF documents raises many issues
which make information extraction particularly difficult. A first
challenge consists in overcoming the lack of explicit information about
both the structure and presentation of the contents within a PDF document.
For example, the table of contents is often not available in a document,
therefore no information is given on the organization of the document in
chapters, sections and so on. Also, the layout functionality of a text
portion (e.g. document title, section header, table or figure caption etc.)
cannot be easily recognized. Moreover, the availability of several
PDF content stream generators causes different interpretations of the PDF
format. Therefore, another challenge is strictly related to the ability in
dealing with documents providing the same kind of information but
exhibiting different layouts depending on their generating sources. For
example, distinct departments of the same company may produce differently
formatted reports. Finally, the subjectivity in the formatting styles that
characterize even thematically similar PDF documents may lead to
uncertainty in the specification of the syntactic extraction rules to be
defined in a wrapper.
| Members | |
| Sergio Flesca | DEIS, University of Calabria |
| Salvatore Garruzzo | DIMET, University of Reggio Calabria |
| Elio Masciari | ICAR-CNR, Institute of Italian National Research Council |
| Andrea Tagarelli | DEIS, University of Calabria |
| Publications |
| S. Flesca, S. Garruzzo, E. Masciari, A. Tagarelli. Wrapping PDF Documents Exploiting Uncertain Knowledge. 18th Conference on Advanced Information Systems Engineering (CAiSE ’06). Luxembourg, June 5-9, 2006. TO APPEAR |
| S. Flesca, S. Garruzzo, E. Masciari, A. Tagarelli. Wrapping PDF Documents: A Preliminary Study. 13th Italian Symposium on Advanced Database Systems (SEBD ’05), pp. 272-283. Brixen-Bressanone, Italy, June 20-22, 2005. |