Document Parser API Documentation
Document Parser Overview
The Document Parser module accepts a PDF file and leverages a powerful language model to extract information in two ways:
-
Unconstrained JSON Output
Generates a JSON representation of all detected data in the PDF, with automatically inferred keys from the language model. -
Schema-Constrained Output (Invoices)
When provided with a specific schema (currently supporting invoice extraction), the parser produces a structured JSON output that adheres strictly to the predefined invoice format.
This module makes it easy to extract both free-form and schema-constrained data from PDFs with minimal configuration. One can also add additional fields to existing structured output schema.
EXAMPLE OUTPUT (Invoice)
{
'all_info': <KEY/VALUE JSON DICT WITH UNCONSTRAINED KEYS>,
'document_info': {
'currency': "USD",
'due_date': 'Jul 05, 2023',
'invoice_date': "'un 20, 2023',
'invoice_id': "2-170-32026",
'issuer_address': null,
'issuer_name': 'FedEx',
'issuer_siren': null,
'issuer_tax_id': null,
'line_items': [
{
'description': 'FedEx Express Services',
'net_amount': 5304.48,
'quantity': 1,
'ref': null,
'tax_amount': 0,
'tax_rate': 0,
'total_amount': 5304.48,
'unit': null,
'unit_price': 5304.48
}
],
'receiver_address': '900 AMERICAN ROAD UNIT 4\\nMORRIS PLAINS NJ 07950',
'receiver_name': 'PDP COURIER',
'receiver_siren': null,
'receiver_tax_id': null,
'total_amount': 5304.48,
'total_net_amount': 5304.48,
'total_tax_amount': 0,
'added_field1': 'blabla'
},
'markdown': <markdown contained in document>,
'bucket': <bucket_name_of_the_document>,
'object': <object_name_in_the_gcp_bucket>,
'cost': <llm_cumulative_estimate_cost>
}