PDF Parser

Parses a PDF document and returns the text inside it as an XML document so that it can be easily parsed.

Properties

Document

Type: File Input
The PDF Document stream as a byte array (byte). If working with a PDF that originates from a file, use the File Node to retrieve a file as a byte array.

XmlText

Type: Xml Output
The extracted PDF text encapsulated in an XML document.

Remarks

Use this Node to easily extract text from a PDF document. The XmlText Property will contain an XML Document that contains a root Document element. Within this will be one or more Page elements. The p attribute indicates the page number (zero-based). Within this element will be a Rowelement which will also include a y attribute indicating the PDF document Y coordinate. Finally within this element is the Text element which also includes an x attribute containing the PDF document X coordinate.

PDF Document Coordinates

The origin of a PDF document page is the bottom, left (i.e. the Y coordinate is inverted). The coordinates are provided in a PDF unit where 1 PDF unit is 1/72 of an inch. The XmlText document returned will sort by Y descending so that the document can be scanned top to bottom. When matching text elements out of a PDF document, use XPath to match specific X and Y coordinates.

Acknowledgements

This Node is based on the fantastic MIT-licensed PDF Sharp library (http://www.pdfsharp.net).

If you make frequent use of this connector, please consider donating to Empira at http://www.pdfsharp.net/Donate.ashx