Skip to content

Mistral OCR PDF Analyzer

Classes

MistralDocumentParser

Functions

call_gemini_structured_output(prompt, pydantic_model=None, model_name='gemini-2.0-flash')

Calls the Gemini API to process and structure extracted OCR text.

Parameters:

Name Type Description Default
prompt str

The text prompt for the model.

required
pydantic_model Optional[BaseModel]

Pydantic schema for structured output.

None
model_name str

The model to use (e.g., "gemini-2.0-flash").

'gemini-2.0-flash'

Returns:

Name Type Description
str

The structured JSON response from the Gemini model.

call_mistral_structured_output(prompt, model_name, pydantic_model=None)

Calls the Mistral API for structured JSON output.

Parameters:

Name Type Description Default
prompt str

The prompt text to send to the API.

required
model_name str

The model to use.

required
pydantic_model Optional[BaseModel]

Pydantic schema to enforce structure.

None

Returns:

Name Type Description
str

The structured JSON output as a string.

clean_and_structure_markdown_table(image_response_markdown, image_base64)

Cleans and structures the OCR markdown extracted from an image.

Parameters:

Name Type Description Default
image_response_markdown str

The raw OCR markdown extracted from the image.

required
image_base64 str

The base64-encoded image data.

required

Returns:

Name Type Description
ChatResponse

The structured markdown response from Mistral API.

final_parse(document_ocr_markdown, pydantic_model=None, model='gemini-1.5-flash')

Parses the final reconstructed OCR markdown output into a structured JSON format.

Parameters:

Name Type Description Default
document_ocr_markdown str

The OCR-extracted text in markdown format.

required
pydantic_model Optional[BaseModel]

The Pydantic model defining the expected schema.

None
model str

The model to use for parsing (e.g., "gemini-1.5-flash").

'gemini-1.5-flash'

Returns:

Name Type Description
dict

The structured JSON output with extracted information.

get_ocr_response(pdf_path, include_image_base64=True)

Uploads a PDF file to Mistral API and retrieves the OCR response.

Parameters:

Name Type Description Default
pdf_path str

Path to the PDF file.

required
include_image_base64 bool

Whether to include base64-encoded images in the response.

True

Returns:

Name Type Description
OCRResponse

OCR processing response from Mistral API.

ocr_image(image_base64)

Processes an image using Mistral OCR API.

Parameters:

Name Type Description Default
image_base64 str

The base64-encoded image data.

required

Returns:

Name Type Description
OCRResponse

OCR response from Mistral API.

ocr_image_async(image_base64) async

Asynchronously processes an image using Mistral OCR API.

Parameters:

Name Type Description Default
image_base64 str

The base64-encoded image data.

required

Returns:

Name Type Description
OCRResponse

OCR response from Mistral API.

parse(pdf_path, schema=None, additional_fields=None, final_parser_model='gemini-1.5-flash', max_retries=2, write_output=True)

Performs full OCR processing and parsing of a given PDF document.

Parameters:

Name Type Description Default
pdf_path str

Path to the PDF document.

required
schema Optional[str]

String that specifies the pydantic model that should be in config.ALLOWED_SCHEMA

None
final_parser_model str

The model to use for final parsing.

'gemini-1.5-flash'
max_retries int

Maximum retry attempts if JSON parsing fails.

2
write_output bool

Whether to write intermediate outputs to files.

True

Returns:

Name Type Description
dict

The structured JSON output with extracted information.

second_ocr_on_images_async(pdf_response) async

Performs a second pass of OCR on detected images within the document.

Parameters:

Name Type Description Default
pdf_response OCRResponse

The initial OCR response from Mistral API.

required

Returns:

Name Type Description
dict dict

A dictionary mapping image IDs to their OCR-extracted content.

write_json_output(parsed_dict, json_out_file)

Writes a dictionary as a JSON file.

Parameters:

Name Type Description Default
parsed_dict dict

The parsed dictionary to be written.

required
json_out_file str

The output JSON file path.

required
write_markdown(markdown_str, markdown_out_file)

Writes a markdown string to a file.

Parameters:

Name Type Description Default
markdown_str str

The markdown content to be written.

required
markdown_out_file str

The output markdown file path.

required

Functions

async_wrapper(func, **kwargs) async

Wrapper to handle both async and sync LLM functions

fix_json(invalid_json)

Attempt to fix invalid JSON, particularly focusing on escaped backslashes and other common issues.

Parameters:

Name Type Description Default
invalid_json str

The potentially invalid JSON string

required

Returns:

Name Type Description
dict

The parsed JSON object if successful, None otherwise