Skip to content

Document Parser API Documentation

Document Parser Overview

The Document Parser module accepts a PDF file and leverages a powerful language model to extract information in two ways:

  1. Unconstrained JSON Output
    Generates a JSON representation of all detected data in the PDF, with automatically inferred keys from the language model.

  2. Schema-Constrained Output (Invoices)
    When provided with a specific schema (currently supporting invoice extraction), the parser produces a structured JSON output that adheres strictly to the predefined invoice format.

This module makes it easy to extract both free-form and schema-constrained data from PDFs with minimal configuration. One can also add additional fields to existing structured output schema.

EXAMPLE OUTPUT (Invoice)
   {
      'all_info': <KEY/VALUE JSON DICT WITH UNCONSTRAINED KEYS>,
      'document_info': {
            'currency': "USD",
            'due_date': 'Jul 05, 2023',
            'invoice_date': "'un 20, 2023',
            'invoice_id': "2-170-32026",
            'issuer_address': null,
            'issuer_name': 'FedEx',
            'issuer_siren': null,
            'issuer_tax_id': null,
            'line_items': [
               {
                  'description': 'FedEx Express Services',
                  'net_amount': 5304.48,
                  'quantity': 1,
                  'ref': null,
                  'tax_amount': 0,
                  'tax_rate': 0,
                  'total_amount': 5304.48,
                  'unit': null,
                  'unit_price': 5304.48
               }
            ],
            'receiver_address': '900 AMERICAN ROAD UNIT 4\\nMORRIS PLAINS NJ 07950',
            'receiver_name': 'PDP COURIER',
            'receiver_siren': null,
            'receiver_tax_id': null,
            'total_amount': 5304.48,
            'total_net_amount': 5304.48,
            'total_tax_amount': 0,
            'added_field1': 'blabla'
            },
      'markdown': <markdown contained in document>,
      'bucket': <bucket_name_of_the_document>,
      'object': <object_name_in_the_gcp_bucket>,
      'cost': <llm_cumulative_estimate_cost>
   }

Modules