Mistral OCR PDF Analyzer

Classes

`MistralDocumentParser`

Functions

`call_gemini_structured_output(prompt, pydantic_model=None, model_name='gemini-2.0-flash')`

Calls the Gemini API to process and structure extracted OCR text.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The text prompt for the model.	required
`pydantic_model`	`Optional[BaseModel]`	Pydantic schema for structured output.	`None`
`model_name`	`str`	The model to use (e.g., "gemini-2.0-flash").	`'gemini-2.0-flash'`

Returns:

Name	Type	Description
`str`		The structured JSON response from the Gemini model.

`call_mistral_structured_output(prompt, model_name, pydantic_model=None)`

Calls the Mistral API for structured JSON output.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The prompt text to send to the API.	required
`model_name`	`str`	The model to use.	required
`pydantic_model`	`Optional[BaseModel]`	Pydantic schema to enforce structure.	`None`

Returns:

Name	Type	Description
`str`		The structured JSON output as a string.

`clean_and_structure_markdown_table(image_response_markdown, image_base64)`

Cleans and structures the OCR markdown extracted from an image.

Parameters:

Name	Type	Description	Default
`image_response_markdown`	`str`	The raw OCR markdown extracted from the image.	required
`image_base64`	`str`	The base64-encoded image data.	required

Returns:

Name	Type	Description
`ChatResponse`		The structured markdown response from Mistral API.

`final_parse(document_ocr_markdown, pydantic_model=None, model='gemini-1.5-flash')`

Parses the final reconstructed OCR markdown output into a structured JSON format.

Parameters:

Name	Type	Description	Default
`document_ocr_markdown`	`str`	The OCR-extracted text in markdown format.	required
`pydantic_model`	`Optional[BaseModel]`	The Pydantic model defining the expected schema.	`None`
`model`	`str`	The model to use for parsing (e.g., "gemini-1.5-flash").	`'gemini-1.5-flash'`

Returns:

Name	Type	Description
`dict`		The structured JSON output with extracted information.

`get_ocr_response(pdf_path, include_image_base64=True)`

Uploads a PDF file to Mistral API and retrieves the OCR response.

Parameters:

Name	Type	Description	Default
`pdf_path`	`str`	Path to the PDF file.	required
`include_image_base64`	`bool`	Whether to include base64-encoded images in the response.	`True`

Returns:

Name	Type	Description
`OCRResponse`		OCR processing response from Mistral API.

`ocr_image(image_base64)`

Processes an image using Mistral OCR API.

Parameters:

Name	Type	Description	Default
`image_base64`	`str`	The base64-encoded image data.	required

Returns:

Name	Type	Description
`OCRResponse`		OCR response from Mistral API.

`ocr_image_async(image_base64)` `async`

Asynchronously processes an image using Mistral OCR API.

Parameters:

Name	Type	Description	Default
`image_base64`	`str`	The base64-encoded image data.	required

Returns:

Name	Type	Description
`OCRResponse`		OCR response from Mistral API.

`parse(pdf_path, schema=None, additional_fields=None, final_parser_model='gemini-1.5-flash', max_retries=2, write_output=True)`

Performs full OCR processing and parsing of a given PDF document.

Parameters:

Name	Type	Description	Default
`pdf_path`	`str`	Path to the PDF document.	required
`schema`	`Optional[str]`	String that specifies the pydantic model that should be in config.ALLOWED_SCHEMA	`None`
`final_parser_model`	`str`	The model to use for final parsing.	`'gemini-1.5-flash'`
`max_retries`	`int`	Maximum retry attempts if JSON parsing fails.	`2`
`write_output`	`bool`	Whether to write intermediate outputs to files.	`True`

Returns:

Name	Type	Description
`dict`		The structured JSON output with extracted information.

`second_ocr_on_images_async(pdf_response)` `async`

Performs a second pass of OCR on detected images within the document.

Parameters:

Name	Type	Description	Default
`pdf_response`	`OCRResponse`	The initial OCR response from Mistral API.	required

Returns:

Name	Type	Description
`dict`	`dict`	A dictionary mapping image IDs to their OCR-extracted content.

`write_json_output(parsed_dict, json_out_file)`

Writes a dictionary as a JSON file.

Parameters:

Name	Type	Description	Default
`parsed_dict`	`dict`	The parsed dictionary to be written.	required
`json_out_file`	`str`	The output JSON file path.	required

`write_markdown(markdown_str, markdown_out_file)`

Writes a markdown string to a file.

Parameters:

Name	Type	Description	Default
`markdown_str`	`str`	The markdown content to be written.	required
`markdown_out_file`	`str`	The output markdown file path.	required

Functions

`async_wrapper(func, **kwargs)` `async`

Wrapper to handle both async and sync LLM functions

`fix_json(invalid_json)`

Attempt to fix invalid JSON, particularly focusing on escaped backslashes and other common issues.

Parameters:

Name	Type	Description	Default
`invalid_json`	`str`	The potentially invalid JSON string	required

Returns:

Name	Type	Description
`dict`		The parsed JSON object if successful, None otherwise

Mistral OCR PDF Analyzer

Classes

MistralDocumentParser

Functions

call_gemini_structured_output(prompt, pydantic_model=None, model_name='gemini-2.0-flash')

call_mistral_structured_output(prompt, model_name, pydantic_model=None)

clean_and_structure_markdown_table(image_response_markdown, image_base64)

final_parse(document_ocr_markdown, pydantic_model=None, model='gemini-1.5-flash')

get_ocr_response(pdf_path, include_image_base64=True)

ocr_image(image_base64)

ocr_image_async(image_base64) async

parse(pdf_path, schema=None, additional_fields=None, final_parser_model='gemini-1.5-flash', max_retries=2, write_output=True)

second_ocr_on_images_async(pdf_response) async

write_json_output(parsed_dict, json_out_file)

write_markdown(markdown_str, markdown_out_file)

Functions

async_wrapper(func, **kwargs) async

fix_json(invalid_json)

`MistralDocumentParser`

`call_gemini_structured_output(prompt, pydantic_model=None, model_name='gemini-2.0-flash')`

`call_mistral_structured_output(prompt, model_name, pydantic_model=None)`

`clean_and_structure_markdown_table(image_response_markdown, image_base64)`

`final_parse(document_ocr_markdown, pydantic_model=None, model='gemini-1.5-flash')`

`get_ocr_response(pdf_path, include_image_base64=True)`

`ocr_image(image_base64)`

`ocr_image_async(image_base64)` `async`

`parse(pdf_path, schema=None, additional_fields=None, final_parser_model='gemini-1.5-flash', max_retries=2, write_output=True)`

`second_ocr_on_images_async(pdf_response)` `async`

`write_json_output(parsed_dict, json_out_file)`

`write_markdown(markdown_str, markdown_out_file)`

`async_wrapper(func, **kwargs)` `async`

`fix_json(invalid_json)`