Skip to content

Claude OCR Parser (Sonnet PDF Analyzer)

Functions

count_tokens_from_pdf_request(pdf_base64)

Counts tokens used for processing the provided PDF data.

Parameters:

Name Type Description Default
pdf_base64 str

Base64-encoded PDF data.

required

Returns:

Name Type Description
int int

Number of input tokens used.

Example
tokens = count_tokens_from_pdf_request(pdf_base64_data)
print(f"Input tokens: {tokens}")

estimate_price(input_tokens, output_tokens, model_name)

Estimates the API call cost based on the input and output tokens for the specified model.

Parameters:

Name Type Description Default
input_tokens int

Number of input tokens.

required
output_tokens int

Number of output tokens.

required
model_name str

Name of the model used for the request.

required

Returns:

Name Type Description
float float

Estimated cost of the API call.

Example
cost = estimate_price(1000, 500, "claude-3-5-sonnet")
print(f"Estimated cost: ${cost:.2f}")

extract_data_from_pdf(pdf_path, schema='invoice', prompt=None, model_name=MODEL_NAME, max_tokens=2048, verbose=False)

Extract structured information from a PDF file using Claude OCR API.

Parameters:

Name Type Description Default
pdf_path str

Path to the PDF file.

required
schema str

Extraction model. Should be in config.ALLOWED_SCHEMA. Default is "invoice".

'invoice'
prompt Optional[str]

Custom prompt for extraction. If None, a default prompt is used.

None
model_name str

The model name to use for extraction. Default is Claude Sonnet.

MODEL_NAME
max_tokens int

Maximum tokens for the output. Default is 2048.

2048
verbose bool

If True, prints detailed prompt information. Default is False.

False

Returns:

Name Type Description
BaseModel

Structured response containing extracted information according to the specified schema.

Example
result = extract_data_from_pdf(
    pdf_path="invoice.pdf",
    mode="custom-fedex-shipment",
    verbose=True
)
print(result.model_dump_json(indent=2))