We use the standard JSON Schema format to organize all the directives needed to prompt an LLM for data extraction. The schema includes three custom annotations that provide additional context and guidance for large language models (LLMs):
X-SystemPrompt
A top-level directive that provides general instructions or context for the LLM, ensuring consistent behavior and improving the relevance of its responses during the data extraction process.
X-ReasoningPrompt
This annotation creates an auxiliary field for generating reasoning or explanatory context about a property. It allows the LLM to provide additional insights or justifications for extracted values, which can be helpful in complex or ambiguous scenarios.
These annotations help ensure structured and precise interactions with LLMs while remaining compatible with standard JSON Schema conventions.
Our promptify method can analyze and refine the directives provided in the schema to improve overall performance and response quality. Importantly, this optimization process leaves the extraction parameters, such as the schema’s structure and field definitions, unchanged to ensure consistency in data processing.
Here is a full example of a schema with all the X-Directives:
Copy
{ "X-SystemPrompt": "You are a useful assistant extracting information from documents.", "properties": { "name": { "description": "The name of the calendar event.", "title": "Name", "type": "string" }, "date": { "X-ReasoningPrompt": "The user can mention it in any format, like **next week** or **tomorrow**. Infer the right date format from the user input.", "description": "The date of the calendar event in ISO 8601 format.", "title": "Date", "type": "string" } }, "required": [ "name", "date" ], "title": "CalendarEvent", "type": "object"}
A top-level directive that provides general instructions or context for the LLM, ensuring consistent behavior and improving the relevance of its responses during the data extraction process.
Copy
{ "X-SystemPrompt": "You are a useful assistant extracting information from documents.", ...}
Generates a reasoning field alongside the data field.
Copy
{ "X-ReasoningPrompt": "The user can mention it in any format, like **next week** or **tomorrow**. Infer the right date format from the user input.", ...}
This schema should validate objects like this:
Copy
{ "name": "Example string value.", "date": "Example string in object."}
However, the LLM will internally produce additional reasoning fields for better extraction, such as:
Copy
{ "name": "Example string value.", "reasoning___date": "Reasoning for date.", "date": "Example string in object."}
As you can see, apart from the “reasoning___” fields, the LLM output follows the same structure as your supplied schema.
You can define the custom annotations in the pydantic.Field class using the json_schema_extra field.
Here is a minimalistic example with everything you should need:
Copy
from pydantic import BaseModel, Field, ConfigDictclass CalendarEvent(BaseModel): model_config = ConfigDict(json_schema_extra = {"X-SystemPrompt": "You are a useful assistant."}) name: str = Field(..., description="The name of the calendar event.", ) date: str = Field(..., description="The date of the calendar event in ISO 8601 format.", json_schema_extra={ "X-ReasoningPrompt": "The user can mention it in any format, like **next week** or **tomorrow**. Infer the right date format from the user input.", } )
If you need a json_schema, you can convert the BaseModel to model_json_schema:
You will need to use uiform’s Schema object to leverage the directives in the schema.
Copy
from uiform import UiForm, Schemafrom openai import OpenAIfrom pydantic import BaseModel, Field, ConfigDictuiclient = UiForm()doc_msg = uiclient.documents.create_messages( document = "document_1.xlsx")class CalendarEvent(BaseModel): model_config = ConfigDict(json_schema_extra = {"X-SystemPrompt": "You are a useful assistant."}) name: str = Field(..., description="The name of the calendar event.", ) date: str = Field(..., description="The date of the calendar event in ISO 8601 format.", json_schema_extra={ 'X-ReasoningPrompt': 'The user can mention it in any format, like **next week** or **tomorrow**. Infer the right date format from the user input.', } )schema_obj =Schema( pydantic_model = CalendarEvent)# Now you can use your favorite model to analyze your documentclient = OpenAI()completion = client.beta.chat.completions.parse( model="gpt-4o", messages=schema_obj.openai_messages + doc_msg.openai_messages, response_format=schema_obj.response_format_pydantic)# Validate the response against the original schema if you want to remove the reasoning fieldsfrom uiform._utils.json_schema import filter_auxiliary_fields_jsonassert completion.choices[0].message.content is not Noneextraction = schema_obj.pydantic_model.model_validate( filter_auxiliary_fields_json(completion.choices[0].message.content, schema_obj.pydantic_model))print("Extraction:",extraction)