Parses a PDF document and returns the text inside it as an XML document so that it can be easily parsed.
Type: Xml Output
The extracted PDF text encapsulated in an XML document.
Use this Node to easily extract text from a PDF document. The XmlText Property will contain an XML Document that contains a root Document element. Within this will be one or more Page elements. The p attribute indicates the page number (zero-based). Within this element will be a Rowelement which will also include a y attribute indicating the PDF document Y coordinate. Finally within this element is the Text element which also includes an x attribute containing the PDF document X coordinate.
PDF Document Coordinates
The origin of a PDF document page is the bottom, left (i.e. the Y coordinate is inverted). The coordinates are provided in a PDF unit where 1 PDF unit is 1/72 of an inch. The XmlText document returned will sort by Y descending so that the document can be scanned top to bottom. When matching text elements out of a PDF document, use XPath to match specific X and Y coordinates.
This Node is based on the fantasic MIT-licensed PDF Sharp library (http://www.pdfsharp.net).
If you make frequent use of this connector, please consider donating to Empira at http://www.pdfsharp.net/Donate.ashx