Mistral OCR PDF Analyzer
Classes
MistralDocumentParser
Functions
call_gemini_structured_output(prompt, pydantic_model=None, model_name='gemini-2.0-flash')
Calls the Gemini API to process and structure extracted OCR text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The text prompt for the model. |
required |
pydantic_model
|
Optional[BaseModel]
|
Pydantic schema for structured output. |
None
|
model_name
|
str
|
The model to use (e.g., "gemini-2.0-flash"). |
'gemini-2.0-flash'
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
The structured JSON response from the Gemini model. |
call_mistral_structured_output(prompt, model_name, pydantic_model=None)
Calls the Mistral API for structured JSON output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The prompt text to send to the API. |
required |
model_name
|
str
|
The model to use. |
required |
pydantic_model
|
Optional[BaseModel]
|
Pydantic schema to enforce structure. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
The structured JSON output as a string. |
clean_and_structure_markdown_table(image_response_markdown, image_base64)
Cleans and structures the OCR markdown extracted from an image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_response_markdown
|
str
|
The raw OCR markdown extracted from the image. |
required |
image_base64
|
str
|
The base64-encoded image data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
ChatResponse |
The structured markdown response from Mistral API. |
final_parse(document_ocr_markdown, pydantic_model=None, model='gemini-1.5-flash')
Parses the final reconstructed OCR markdown output into a structured JSON format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_ocr_markdown
|
str
|
The OCR-extracted text in markdown format. |
required |
pydantic_model
|
Optional[BaseModel]
|
The Pydantic model defining the expected schema. |
None
|
model
|
str
|
The model to use for parsing (e.g., "gemini-1.5-flash"). |
'gemini-1.5-flash'
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
The structured JSON output with extracted information. |
get_ocr_response(pdf_path, include_image_base64=True)
Uploads a PDF file to Mistral API and retrieves the OCR response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
str
|
Path to the PDF file. |
required |
include_image_base64
|
bool
|
Whether to include base64-encoded images in the response. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
OCRResponse |
OCR processing response from Mistral API. |
ocr_image(image_base64)
Processes an image using Mistral OCR API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_base64
|
str
|
The base64-encoded image data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
OCRResponse |
OCR response from Mistral API. |
ocr_image_async(image_base64)
async
Asynchronously processes an image using Mistral OCR API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_base64
|
str
|
The base64-encoded image data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
OCRResponse |
OCR response from Mistral API. |
parse(pdf_path, schema=None, additional_fields=None, final_parser_model='gemini-1.5-flash', max_retries=2, write_output=True)
Performs full OCR processing and parsing of a given PDF document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
str
|
Path to the PDF document. |
required |
schema
|
Optional[str]
|
String that specifies the pydantic model that should be in config.ALLOWED_SCHEMA |
None
|
final_parser_model
|
str
|
The model to use for final parsing. |
'gemini-1.5-flash'
|
max_retries
|
int
|
Maximum retry attempts if JSON parsing fails. |
2
|
write_output
|
bool
|
Whether to write intermediate outputs to files. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
The structured JSON output with extracted information. |
second_ocr_on_images_async(pdf_response)
async
Performs a second pass of OCR on detected images within the document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_response
|
OCRResponse
|
The initial OCR response from Mistral API. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
A dictionary mapping image IDs to their OCR-extracted content. |
write_json_output(parsed_dict, json_out_file)
Writes a dictionary as a JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parsed_dict
|
dict
|
The parsed dictionary to be written. |
required |
json_out_file
|
str
|
The output JSON file path. |
required |
write_markdown(markdown_str, markdown_out_file)
Writes a markdown string to a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
markdown_str
|
str
|
The markdown content to be written. |
required |
markdown_out_file
|
str
|
The output markdown file path. |
required |
Functions
async_wrapper(func, **kwargs)
async
Wrapper to handle both async and sync LLM functions
fix_json(invalid_json)
Attempt to fix invalid JSON, particularly focusing on escaped backslashes and other common issues.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
invalid_json
|
str
|
The potentially invalid JSON string |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
The parsed JSON object if successful, None otherwise |