OpenAI Document Parser

Parses documents using OpenAI's API to extract structured data based on a provided JSON schema. This node is useful for converting unstructured or semi-structured text documents into validated, machine-readable JSON.

This parser is best suited for ad hoc processing or low-volume workflows where minimal setup is preferred. It uses OpenAI's general-purpose language model to infer structure, which provides flexibility but may take longer to process each document.

For high-volume use cases, or scenarios involving recurring document formats (e.g., invoices, forms, or reports), consider using the DI Pdf Reader node instead. The DI Pdf Reader integrates with Azure Document Intelligence, allowing you to train custom models or use prebuilt templates for fast, reliable PDF parsing at scale.

Revision History

1.0.0.0 Initial Release

Properties

Connection

Type: Connection Input
Provides the OpenAI API key and model configuration used for all requests.

Action

Type: Enum Input
Determines the operation to perform. Options include:

  • CreateSchema: Generate a sample JSON schema from the input file.
  • CreateValidationScript: Generate a C# validation script that can be used to enforce additional rules on parsed JSON values.
  • ParseDocument: Parse a document into structured JSON using the provided schema and optional validation rules.

Schema

Type: JSON Input
The JSON schema describing the expected structure of the parsed document. Required for ParseDocument and CreateValidationScript.

ValidationScript

Type: Multiline Text Input
Optional C# code used to validate the parsed output beyond schema validation. This is run after parsing is complete.

DocumentName

Type: String Input
The name of the document being parsed, including file extension (e.g., invoice.pdf).

Document

Type: File Input
The file content of the document to parse. This must be provided for all actions.

Response

Type: JSON Output
The parsed output returned from OpenAI, formatted as structured JSON.

Remarks

Actions

CreateSchema

Generates a sample JSON schema for the document by requesting an inferred structure from OpenAI. This can be used as a starting point to build a more detailed schema manually.

CreateValidationScript

Generates a C# script that performs basic field validation based on the structure and types defined in your JSON schema. You can edit or extend this script manually if needed.

ParseDocument

Uses the provided Schema and Document to request a structured output from OpenAI.
After parsing, the following validations are applied in order:

  • The JSON is validated against the provided Schema.
  • CustomProperties can be used to assert known values within the result. Each key is a dot-path (e.g., invoice.totalAmount) that targets a property in the parsed JSON, and the corresponding value must match exactly. This is useful for verifying expected values such as invoice numbers, dates, or totals.
  • If provided, the ValidationScript is executed after schema and property validation to enforce additional custom rules.

Example Usage

This Node is often used in workflows that process uploaded documents such as invoices, orders, or forms. A typical workflow involves:

  1. Receiving a document from a file input or storage Node
  2. Passing the document to OpenAI Document Parser with a defined schema
  3. Validating and using the structured JSON in downstream business logic

Comparison: OpenAI Document Parser vs DI Pdf Reader

Feature / Criteria OpenAI Document Parser DI Pdf Reader
Setup Required Minimal - schema and validation logic can be auto-generated from the document Requires Azure setup and a trained custom model (prebuilt models also supported)
Best For Ad hoc, flexible, low-volume parsing High-volume, recurring document types
Performance Slower per document (runs via OpenAI LLM) Fast once model is trained
Accuracy Good for loosely structured or varied documents High for consistent, structured documents
Custom Rules Supports dot-path validation and C# validation scripts Must be enforced downstream (e.g., in workflow logic)
Supported File Types Limited - PDF, TXT, CSV, DOCX, JSON, etc. Primarily PDFs, but also supports image files (JPG, PNG)
Training Required None Always required (custom model training via Azure Document Intelligence)
Scalability Lower - slower processing time per document High - optimized for batch operations and parallel workflows
Pricing Considerations May consume more tokens per run depending on size and structure Predictable usage-based Azure pricing, better for high-volume scenarios
Schema Flexibility Schema can be edited or swapped at any time Changes require retraining or a new model

Links