Claude OCR Parser (Sonnet PDF Analyzer)

Functions

`count_tokens_from_pdf_request(pdf_base64)`

Counts tokens used for processing the provided PDF data.

Parameters:

Name	Type	Description	Default
`pdf_base64`	`str`	Base64-encoded PDF data.	required

Returns:

Name	Type	Description
`int`	`int`	Number of input tokens used.

Example

tokens = count_tokens_from_pdf_request(pdf_base64_data)
print(f"Input tokens: {tokens}")

`estimate_price(input_tokens, output_tokens, model_name)`

Estimates the API call cost based on the input and output tokens for the specified model.

Parameters:

Name	Type	Description	Default
`input_tokens`	`int`	Number of input tokens.	required
`output_tokens`	`int`	Number of output tokens.	required
`model_name`	`str`	Name of the model used for the request.	required

Returns:

Name	Type	Description
`float`	`float`	Estimated cost of the API call.

Example

cost = estimate_price(1000, 500, "claude-3-5-sonnet")
print(f"Estimated cost: ${cost:.2f}")

`extract_data_from_pdf(pdf_path, schema='invoice', prompt=None, model_name=MODEL_NAME, max_tokens=2048, verbose=False)`

Extract structured information from a PDF file using Claude OCR API.

Parameters:

Name	Type	Description	Default
`pdf_path`	`str`	Path to the PDF file.	required
`schema`	`str`	Extraction model. Should be in config.ALLOWED_SCHEMA. Default is "invoice".	`'invoice'`
`prompt`	`Optional[str]`	Custom prompt for extraction. If None, a default prompt is used.	`None`
`model_name`	`str`	The model name to use for extraction. Default is Claude Sonnet.	`MODEL_NAME`
`max_tokens`	`int`	Maximum tokens for the output. Default is 2048.	`2048`
`verbose`	`bool`	If True, prints detailed prompt information. Default is False.	`False`

Returns:

Name	Type	Description
`BaseModel`		Structured response containing extracted information according to the specified schema.

Example

result = extract_data_from_pdf(
    pdf_path="invoice.pdf",
    mode="custom-fedex-shipment",
    verbose=True
)
print(result.model_dump_json(indent=2))