2026-03-06 12:59:32 +01:00

32 KiB

Raw Permalink Blame History

GLiNER2 Training Dataset Formats

GLiNER2 uses JSONL format where each line contains an input and output field (or alternatively text and schema). The input/text is the text to process, and the output/schema is the schema with labels/annotations.

Quick Format Reference

General Structure

Primary Format:

{"input": "text to process", "output": {"schema_definition": "with_annotations"}}

Alternative Format (also supported):

{"text": "text to process", "schema": {"schema_definition": "with_annotations"}}

Both formats are equivalent - use whichever is more convenient for your workflow.

Valid Output Schema Keys

Key	Type	Required	Description
`entities`	`dict[str, list[str]]`	No	Entity type → list of entity mentions
`entity_descriptions`	`dict[str, str]`	No	Entity type → description
`classifications`	`list[dict]`	No	List of classification tasks
`json_structures`	`list[dict]`	No	List of structured data extractions
`json_descriptions`	`dict[str, dict[str, str]]`	No	Parent → field → description
`relations`	`list[dict]`	No	List of relation extractions

Classification Task Fields

Field	Type	Required	Description
`task`	`str`	Yes	Task identifier
`labels`	`list[str]`	Yes	Available label options
`true_label`	`list[str]` or `str`	Yes	Correct label(s)
`multi_label`	`bool`	No	Enable multi-label classification
`prompt`	`str`	No	Custom prompt for the task
`examples`	`list[list[str]]` or `list[tuple[str, str]]`	No	Few-shot examples as [[input, output], ...] pairs. Internally converted to list of lists.
`label_descriptions`	`dict[str, str]`	No	Label → description mapping

Entity Fields Format

Entities use a simple dictionary where keys are entity types and values are lists of mentions:

Component	Type	Required	Description
Entity type (key)	`str`	Yes	Name of the entity type (e.g., "person", "location")
Entity mentions (value)	`list[str]`	Yes	List of entity text spans found in input

Format: {"entity_type": ["mention1", "mention2", ...]}

JSON Structure Fields Format

Each structure is a dictionary with a parent name as key and field definitions as value:

Component	Type	Required	Description
Parent name (key)	`str`	Yes	Name of the structure (e.g., "product", "contact")
Fields (value)	`dict`	Yes	Field name → field value mappings
Field value	`str` or `list[str]` or `dict`	Yes	String, list of strings, or choice dict
Choice dict	`dict` with `value` and `choices`	No	For classification-style fields

Format: [{"parent": {"field1": "value", "field2": ["list", "values"]}}]

Multiple Instances: When the same parent appears multiple times, each instance is a separate dict in the list:

[{"hotel": {"name": "Hotel A", ...}}, {"hotel": {"name": "Hotel B", ...}}]

Relation Fields Format

Relations use flexible field structures - you can use ANY field names (not just "head" and "tail"):

Component	Type	Required	Description
Relation name (key)	`str`	Yes	Name of the relation type (e.g., "works_for")
Fields (value)	`dict`	Yes	Field name → field value mappings
Field value	`str` or `list[str]`	Yes	String or list of strings

Standard Format: [{"relation_name": {"head": "entity1", "tail": "entity2"}}]

⚠️ Critical Constraint: For a given relation type, the first occurrence defines the field structure:

The first instance of "works_for" determines what fields ALL "works_for" instances must have
All subsequent instances of the same relation type must use the same field names
Different relation types can have different field structures
This consistency is enforced during validation - inconsistent field structures will raise a ValidationError

Example: If first "works_for" has {"head": "...", "tail": "..."}, all other "works_for" instances must also have "head" and "tail" fields.

Validation: The TrainingDataset.validate_relation_consistency() method checks that all relation types have consistent field structures across the entire dataset.

Alternative Input Formats

The training data loader supports multiple input formats:

JSONL files: {"input": "...", "output": {...}} or {"text": "...", "schema": {...}}
Python API: Use InputExample and TrainingDataset classes from gliner2.training.data
Dict lists: List of dictionaries in the same format as JSONL

All formats are automatically detected and converted to the internal format. See gliner2.training.data.DataLoader_Factory for details.

1. Classification Tasks

Basic Single-Label Classification

{"input": "This movie is absolutely fantastic! I loved every minute of it.", "output": {"classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["positive"]}]}}
{"input": "The service at this restaurant was terrible and the food was cold.", "output": {"classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["negative"]}]}}
{"input": "The weather today is okay, nothing special.", "output": {"classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["neutral"]}]}}

Multi-label Classification

{"input": "This smartphone has an amazing camera but the battery life is poor.", "output": {"classifications": [{"task": "product_aspects", "labels": ["camera", "battery", "screen", "performance", "design"], "true_label": ["camera", "battery"], "multi_label": true}]}}
{"input": "Great performance and beautiful design!", "output": {"classifications": [{"task": "product_aspects", "labels": ["camera", "battery", "screen", "performance", "design"], "true_label": ["performance", "design"], "multi_label": true}]}}

Classification with Label Descriptions

{"input": "Breaking: New AI model achieves human-level performance on reasoning tasks.", "output": {"classifications": [{"task": "news_category", "labels": ["technology", "politics", "sports", "entertainment"], "true_label": ["technology"], "label_descriptions": {"technology": "Articles about computers, AI, software, and tech innovations", "politics": "Government, elections, and political news", "sports": "Athletic events, teams, and competitions", "entertainment": "Movies, music, celebrities, and entertainment news"}}]}}

Classification with Custom Prompts

{"input": "The patient shows signs of improvement after treatment.", "output": {"classifications": [{"task": "medical_assessment", "labels": ["improving", "stable", "declining", "critical"], "true_label": ["improving"], "prompt": "Assess the patient's medical condition based on the clinical notes."}]}}

Classification with Few-Shot Examples

Few-shot examples are provided as a list of [input, output] pairs. Each example is a list/tuple with exactly 2 elements:

{"input": "This service exceeded all my expectations!", "output": {"classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["positive"], "examples": [["Great product, highly recommend!", "positive"], ["Terrible experience, very disappointed.", "negative"], ["It's okay, nothing special.", "neutral"]]}]}}

Format: "examples": [[input_text, output_label], [input_text, output_label], ...]

Each example pair must have exactly 2 elements: the input text and the corresponding label.

Classification with Both Examples and Descriptions

{"input": "The algorithm demonstrates linear time complexity.", "output": {"classifications": [{"task": "complexity", "labels": ["constant", "linear", "quadratic", "exponential"], "true_label": ["linear"], "examples": [["O(1) lookup time", "constant"], ["O(n) iteration", "linear"]], "label_descriptions": {"constant": "O(1) - fixed time regardless of input size", "linear": "O(n) - time scales linearly with input", "quadratic": "O(n²) - nested iterations", "exponential": "O(2ⁿ) - recursive branching"}}]}}

Multiple Classification Tasks

{"input": "Exciting new smartphone with innovative features!", "output": {"classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["positive"]}, {"task": "category", "labels": ["technology", "sports", "politics", "entertainment"], "true_label": ["technology"]}]}}

true_label: String vs List Format

Both formats are supported - use list for consistency or string for brevity:

{"input": "Sample text A", "output": {"classifications": [{"task": "label", "labels": ["a", "b"], "true_label": ["a"]}]}}
{"input": "Sample text B", "output": {"classifications": [{"task": "label", "labels": ["a", "b"], "true_label": "b"}]}}
{"input": "This is great!", "output": {"classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": "positive"}]}}

Note:

String format ("true_label": "positive") and list format ("true_label": ["positive"]) are both valid for single-label classification
Internally, string values are automatically converted to lists (["positive"])
For multi-label classification, always use list format: "true_label": ["label1", "label2"]

2. Named Entity Recognition (NER)

Basic NER

{"input": "John Smith works at OpenAI in San Francisco and will visit London next month.", "output": {"entities": {"person": ["John Smith"], "organization": ["OpenAI"], "location": ["San Francisco", "London"]}}}
{"input": "Apple Inc. CEO Tim Cook announced the iPhone 15 release date.", "output": {"entities": {"person": ["Tim Cook"], "organization": ["Apple Inc."], "product": ["iPhone 15"]}}}
{"input": "The meeting on January 15, 2024 will be held at Microsoft headquarters.", "output": {"entities": {"date": ["January 15, 2024"], "organization": ["Microsoft"]}}}

NER with Entity Descriptions

{"input": "Dr. Sarah Johnson prescribed Metformin 500mg twice daily for diabetes treatment.", "output": {"entities": {"person": ["Dr. Sarah Johnson"], "medication": ["Metformin"], "dosage": ["500mg"], "condition": ["diabetes"]}, "entity_descriptions": {"person": "Names of people mentioned in the text", "medication": "Names of drugs or pharmaceutical products", "dosage": "Specific amounts or dosages of medications", "condition": "Medical conditions or diseases"}}}

NER with Multiple Instances of Same Entity Type

{"input": "Alice, Bob, and Charlie attended the meeting with David.", "output": {"entities": {"person": ["Alice", "Bob", "Charlie", "David"]}}}

NER with Empty Entity Types

{"input": "The conference will be held next week.", "output": {"entities": {"person": [], "organization": [], "location": []}}}

Partial NER (Some Entity Types Present)

{"input": "Microsoft announced new features.", "output": {"entities": {"organization": ["Microsoft"], "person": []}}}

3. JSON Structure Extraction

Basic Structure with String Fields

{"input": "Contact John Doe at john.doe@email.com or call (555) 123-4567.", "output": {"json_structures": [{"contact": {"name": "John Doe", "email": "john.doe@email.com", "phone": "(555) 123-4567"}}]}}

Structure with List Fields

{"input": "Product features include: wireless charging, water resistance, and face recognition.", "output": {"json_structures": [{"product": {"features": ["wireless charging", "water resistance", "face recognition"]}}]}}

Structure with Mixed String and List Fields

{"input": "iPhone 15 costs $999 and comes in blue, black, and white colors.", "output": {"json_structures": [{"product": {"name": "iPhone 15", "price": "$999", "colors": ["blue", "black", "white"]}}]}}

Multiple Instances of Same Structure Type

When the same structure type (parent name) appears multiple times in the text, each instance is a separate dictionary in the json_structures list:

{"input": "We have two hotels available: Hotel Paradise with 4 stars, pool, and wifi for $150/night, and Budget Inn with 2 stars and parking for $80/night.", "output": {"json_structures": [{"hotel": {"name": "Hotel Paradise", "stars": "4", "amenities": ["pool", "wifi"], "price": "$150/night"}}, {"hotel": {"name": "Budget Inn", "stars": "2", "amenities": ["parking"], "price": "$80/night"}}]}}

Note: Both instances use the same parent key "hotel" but are separate objects in the list. This is how you represent multiple occurrences of the same structure type.

Another example with three products:

{"input": "Available products: iPhone 15 for $999, MacBook Pro for $1999, and AirPods for $199.", "output": {"json_structures": [{"product": {"name": "iPhone 15", "price": "$999"}}, {"product": {"name": "MacBook Pro", "price": "$1999"}}, {"product": {"name": "AirPods", "price": "$199"}}]}}

Structure with Classification Fields (Choices)

{"input": "Book a single room at Grand Hotel for 2 nights with breakfast included.", "output": {"json_structures": [{"booking": {"hotel": "Grand Hotel", "room_type": {"value": "single", "choices": ["single", "double", "suite"]}, "nights": "2", "meal_plan": {"value": "breakfast", "choices": ["none", "breakfast", "half-board", "full-board"]}}}]}}

Structure with Multiple Choice Fields

{"input": "Order a large pepperoni pizza for delivery, extra cheese.", "output": {"json_structures": [{"order": {"size": {"value": "large", "choices": ["small", "medium", "large", "xlarge"]}, "type": {"value": "pepperoni", "choices": ["cheese", "pepperoni", "veggie", "supreme"]}, "method": {"value": "delivery", "choices": ["pickup", "delivery", "dine-in"]}, "extras": ["extra cheese"]}}]}}

Structure with Field Descriptions

{"input": "Patient: Mary Wilson, Age: 45, diagnosed with hypertension, prescribed Lisinopril 10mg daily.", "output": {"json_structures": [{"medical_record": {"patient_name": "Mary Wilson", "age": "45", "diagnosis": "hypertension", "medication": "Lisinopril", "dosage": "10mg daily"}}], "json_descriptions": {"medical_record": {"patient_name": "Full name of the patient", "age": "Patient's age in years", "diagnosis": "Medical condition diagnosed", "medication": "Prescribed medication name", "dosage": "Medication dosage and frequency"}}}}

Structure with Null/Empty Field Values

{"input": "Product name is Widget X. Price not available.", "output": {"json_structures": [{"product": {"name": "Widget X", "price": "", "description": ""}}]}}

Structure with Some Fields Missing

{"input": "Contact Sarah at sarah@example.com", "output": {"json_structures": [{"contact": {"name": "Sarah", "email": "sarah@example.com", "phone": ""}}]}}

Multiple Different Structure Types

{"input": "John Doe works at TechCorp. Product ABC costs $50 with free shipping.", "output": {"json_structures": [{"employee": {"name": "John Doe", "company": "TechCorp"}}, {"product": {"name": "ABC", "price": "$50", "shipping": "free"}}]}}

Structure with Only List Fields

{"input": "Available colors: red, blue, green. Sizes: S, M, L, XL.", "output": {"json_structures": [{"options": {"colors": ["red", "blue", "green"], "sizes": ["S", "M", "L", "XL"]}}]}}

4. Relation Extraction

Relations use flexible field structures. While "head" and "tail" are common, you can use ANY field names.

⚠️ Important: The first occurrence of each relation type defines the field structure for ALL instances of that type.

Basic Relation (Head and Tail)

{"input": "Alice manages the engineering team.", "output": {"relations": [{"manages": {"head": "Alice", "tail": "engineering team"}}]}}
{"input": "John works for Microsoft.", "output": {"relations": [{"works_for": {"head": "John", "tail": "Microsoft"}}]}}

Multiple Instances - Same Field Structure

All instances of the same relation type MUST have the same fields (determined by first occurrence):

{"input": "Alice works for Google. Bob works for Microsoft. Charlie works for Amazon.", "output": {"relations": [{"works_for": {"head": "Alice", "tail": "Google"}}, {"works_for": {"head": "Bob", "tail": "Microsoft"}}, {"works_for": {"head": "Charlie", "tail": "Amazon"}}]}}

Note: All three "works_for" instances use the same fields (head, tail) as defined by the first occurrence.

Multiple Different Relation Types

Different relation types can have different field structures:

{"input": "John works for Apple Inc. and lives in San Francisco. Apple Inc. is located in Cupertino.", "output": {"relations": [{"works_for": {"head": "John", "tail": "Apple Inc."}}, {"lives_in": {"head": "John", "tail": "San Francisco"}}, {"located_in": {"head": "Apple Inc.", "tail": "Cupertino"}}]}}

Note: Each relation type ("works_for", "lives_in", "located_in") can independently define its own field structure.

Custom Field Names (Beyond Head/Tail)

You can use custom field names - the first occurrence defines what fields to use:

{"input": "Alice sent $100 to Bob. Charlie sent $50 to David.", "output": {"relations": [{"transaction": {"sender": "Alice", "recipient": "Bob", "amount": "$100"}}, {"transaction": {"sender": "Charlie", "recipient": "David", "amount": "$50"}}]}}

Note: First "transaction" uses sender/recipient/amount, so all "transaction" instances must use these same fields.

Relations with Additional Fields

{"input": "John Smith is the CEO of TechCorp which is headquartered in Silicon Valley.", "output": {"relations": [{"employment": {"head": "John Smith", "tail": "TechCorp", "role": "CEO"}}, {"located_in": {"head": "TechCorp", "tail": "Silicon Valley"}}]}}

Relations Combined with Entities

{"input": "Elon Musk founded SpaceX in 2002. SpaceX is located in Hawthorne.", "output": {"entities": {"person": ["Elon Musk"], "organization": ["SpaceX"], "location": ["Hawthorne"], "date": ["2002"]}, "relations": [{"founded": {"head": "Elon Musk", "tail": "SpaceX"}}, {"located_in": {"head": "SpaceX", "tail": "Hawthorne"}}]}}

Empty Relations (Negative Example)

{"input": "The weather is nice today.", "output": {"relations": []}}

Bidirectional Relations

{"input": "Alice and Bob are colleagues.", "output": {"relations": [{"colleague_of": {"head": "Alice", "tail": "Bob"}}, {"colleague_of": {"head": "Bob", "tail": "Alice"}}]}}

Field Consistency: Relations vs JSON Structures

Key Difference:

Relations: First occurrence defines field structure for ALL instances of that relation type
- All "works_for" relations must have same fields
- Enforced consistency per relation type
JSON Structures: Fields can vary between instances of the same parent type
- Uses union of all fields across instances
- More flexible - instances can have different subsets of fields

Example - Relations (Strict Consistency):

{"input": "Alice works for Google. Bob works for Microsoft.", "output": {"relations": [{"works_for": {"head": "Alice", "tail": "Google"}}, {"works_for": {"head": "Bob", "tail": "Microsoft"}}]}}

✓ Valid: Both "works_for" have same fields (head, tail)

Example - JSON Structures (Flexible Fields):

{"input": "Product A costs $10. Product B costs $20 and weighs 5kg.", "output": {"json_structures": [{"product": {"name": "A", "price": "$10"}}, {"product": {"name": "B", "price": "$20", "weight": "5kg"}}]}}

✓ Valid: Second instance has extra "weight" field - this is allowed for json_structures

5. Combined Multi-Task Examples

Entities + Classifications

{"input": "Apple Inc. announced record profits. This is great news for investors.", "output": {"entities": {"organization": ["Apple Inc."]}, "classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["positive"]}]}}

Entities + JSON Structures

{"input": "Contact John Doe at john@example.com. He works at TechCorp.", "output": {"entities": {"person": ["John Doe"], "organization": ["TechCorp"]}, "json_structures": [{"contact": {"name": "John Doe", "email": "john@example.com", "company": "TechCorp"}}]}}

Entities + Relations

{"input": "Elon Musk founded SpaceX in 2002. SpaceX is located in Hawthorne.", "output": {"entities": {"person": ["Elon Musk"], "organization": ["SpaceX"], "location": ["Hawthorne"], "date": ["2002"]}, "relations": [{"founded": {"head": "Elon Musk", "tail": "SpaceX", "year": "2002"}}, {"located_in": {"head": "SpaceX", "tail": "Hawthorne"}}]}}

Classifications + JSON Structures

{"input": "Premium subscription for $99/month includes unlimited access. Great value!", "output": {"classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["positive"]}], "json_structures": [{"subscription": {"tier": "Premium", "price": "$99/month", "features": ["unlimited access"]}}]}}

Entities + Classifications + JSON Structures

{"input": "Apple CEO Tim Cook unveiled iPhone 15 for $999. Analysts are optimistic.", "output": {"entities": {"person": ["Tim Cook"], "organization": ["Apple"], "product": ["iPhone 15"]}, "classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["positive"]}], "json_structures": [{"product_announcement": {"company": "Apple", "product": "iPhone 15", "price": "$999", "presenter": "Tim Cook"}}]}}

Entities + Relations + Classifications

{"input": "Sarah founded TechStart in 2020. The company is doing exceptionally well.", "output": {"entities": {"person": ["Sarah"], "organization": ["TechStart"], "date": ["2020"]}, "relations": [{"founded": {"head": "Sarah", "tail": "TechStart", "year": "2020"}}], "classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["positive"]}]}}

All Four Tasks Combined

{"input": "Breaking: Apple announces new iPhone 15 with improved camera. Analysts are optimistic about sales projections.", "output": {"entities": {"company": ["Apple"], "product": ["iPhone 15"]}, "classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["positive"]}, {"task": "category", "labels": ["technology", "business", "sports", "entertainment"], "true_label": ["technology"]}], "json_structures": [{"news_article": {"company": "Apple", "product": "iPhone 15", "feature": "improved camera", "analyst_view": "optimistic"}}], "relations": [{"product_of": {"head": "iPhone 15", "tail": "Apple"}}]}}

Multi-Task with Descriptions

{"input": "Dr. Johnson prescribed medication X for condition Y. Patient shows improvement.", "output": {"entities": {"person": ["Dr. Johnson"], "medication": ["medication X"], "condition": ["condition Y"]}, "entity_descriptions": {"person": "Healthcare provider names", "medication": "Prescribed drugs", "condition": "Medical conditions"}, "classifications": [{"task": "patient_status", "labels": ["improving", "stable", "declining"], "true_label": ["improving"], "label_descriptions": {"improving": "Patient condition getting better", "stable": "No change in condition", "declining": "Patient condition worsening"}}], "json_structures": [{"prescription": {"doctor": "Dr. Johnson", "medication": "medication X", "condition": "condition Y"}}], "json_descriptions": {"prescription": {"doctor": "Prescribing physician", "medication": "Prescribed drug name", "condition": "Diagnosed condition"}}}}

Partial Multi-Task (Some Tasks Empty)

Note: While you can include empty dictionaries/lists for some tasks, at least one task must have content.

{"input": "The weather forecast predicts rain tomorrow.", "output": {"entities": {}, "classifications": [{"task": "weather", "labels": ["sunny", "rainy", "cloudy", "snowy"], "true_label": ["rainy"]}], "json_structures": []}}

This is valid because it has a classification task. However, if all tasks were empty, it would fail validation.

6. Format Edge Cases

Completely Empty Output

⚠️ Note: Examples must have at least one task (entities, classifications, structures, or relations). Completely empty outputs are not valid training examples.

{"input": "Random text with no specific information.", "output": {"entities": {}, "classifications": [], "json_structures": [], "relations": []}}

This format will fail validation. Each example must contain at least one annotation.

Empty Entities Dictionary

⚠️ Note: While an empty entities dictionary is syntactically valid, examples must have at least one task. If you only have empty entities, add at least one other task (classification, structure, or relation).

{"input": "The weather is nice today.", "output": {"entities": {}, "classifications": [{"task": "sentiment", "labels": ["positive", "negative"], "true_label": ["positive"]}]}}

Empty Classifications List

⚠️ Note: While an empty classifications list is syntactically valid, examples must have at least one task. If you only have empty classifications, add at least one other task.

{"input": "Some generic text.", "output": {"classifications": [], "entities": {"location": ["text"]}}}

Very Long Label Lists

{"input": "Sample text for many labels.", "output": {"classifications": [{"task": "topic", "labels": ["label1", "label2", "label3", "label4", "label5", "label6", "label7", "label8", "label9", "label10", "label11", "label12", "label13", "label14", "label15", "label16", "label17", "label18", "label19", "label20"], "true_label": ["label5"]}]}}

Very Short Text

{"input": "Yes.", "output": {"classifications": [{"task": "response", "labels": ["yes", "no", "maybe"], "true_label": ["yes"]}]}}
{"input": "OK", "output": {"entities": {}}}

Special Characters in Labels

{"input": "The C++ programming language.", "output": {"entities": {"programming_language": ["C++"]}}}
{"input": "Use the @ symbol for mentions.", "output": {"entities": {"symbol": ["@"]}}}

Special Characters in Values

{"input": "Price is $1,299.99 (including tax).", "output": {"json_structures": [{"pricing": {"amount": "$1,299.99", "note": "(including tax)"}}]}}

Unicode and Non-ASCII Characters

{"input": "Café Münchën serves crème brûlée.", "output": {"entities": {"location": ["Café Münchën"], "food": ["crème brûlée"]}}}
{"input": "东京 Tokyo is the capital.", "output": {"entities": {"location": ["东京", "Tokyo"]}}}

Quotes and Escaping

{"input": "He said \"hello\" to me.", "output": {"entities": {"quote": ["\"hello\""]}}}

Newlines in Text

{"input": "First line.\nSecond line.", "output": {"entities": {"text": ["First line", "Second line"]}}}

Numbers as Strings vs Entity Names

{"input": "Room 123 on floor 4.", "output": {"json_structures": [{"location": {"room": "123", "floor": "4"}}]}}

Boolean-like Values

{"input": "Status is active, notifications enabled.", "output": {"json_structures": [{"settings": {"status": "active", "notifications": "enabled"}}]}}

Empty String Values

{"input": "Name: John, Age: unknown", "output": {"json_structures": [{"person": {"name": "John", "age": ""}}]}}

Multiple Empty Lines in JSONL

{"input": "First example.", "output": {"entities": {"type": ["example"]}}}
{"input": "Second example.", "output": {"entities": {"type": ["example"]}}}

Schema Component Reference

entities

Type: dict[str, list[str]]
Format: {"entity_type": ["mention1", "mention2", ...]}
Example: {"person": ["John", "Alice"], "location": ["NYC"]}

entity_descriptions

Type: dict[str, str]
Format: {"entity_type": "description text"}
Example: {"person": "Names of people", "location": "Geographic places"}

classifications

Type: list[dict]
Required fields: task, labels, true_label
Optional fields: multi_label, prompt, examples, label_descriptions
Example: [{"task": "sentiment", "labels": ["pos", "neg"], "true_label": ["pos"]}]

json_structures

Type: list[dict]
Single instance: [{"parent_name": {"field1": "value1", "field2": ["list", "values"]}}]
Multiple instances (same parent): [{"parent": {...}}, {"parent": {...}}] - Same parent key, separate dicts
Multiple types: [{"parent1": {...}}, {"parent2": {...}}] - Different parent keys
Choice format: {"field": {"value": "selected", "choices": ["opt1", "opt2"]}}
Example: [{"product": {"name": "Item", "price": "$10"}}, {"product": {"name": "Item2", "price": "$20"}}]

json_descriptions

Type: dict[str, dict[str, str]]
Format: {"parent": {"field": "description"}}
Example: {"product": {"name": "Product name", "price": "Cost in USD"}}

relations

Type: list[dict]
Standard format: [{"relation_name": {"head": "entity1", "tail": "entity2"}}]
With custom fields: [{"relation_name": {"sender": "A", "recipient": "B", "amount": "$100"}}]
Example: [{"works_for": {"head": "John", "tail": "Company"}}, {"founded": {"head": "Alice", "tail": "StartupX"}}]
⚠️ Field constraint: First occurrence of each relation type defines field structure for ALL instances of that type
Note: While "head" and "tail" are common, you can use ANY field names - just keep them consistent per relation type

Tips for Dataset Creation

Use diverse examples to improve model generalization
Include edge cases - but remember each example must have at least one task
Provide descriptions when possible to improve accuracy
Balance your classes in classification tasks
Use realistic text that matches your target domain
Include multiple instances for JSON structures when applicable
For negative examples, include at least one task (e.g., empty entities but a classification, or empty classifications but entities)
Mix task types to train multi-task capabilities
Use consistent formatting for similar examples
Include special characters to ensure robust handling
Validate your dataset using TrainingDataset.validate(strict=True) to catch annotation errors early
Check relation consistency using validate_relation_consistency() to ensure all relation types have consistent field structures

Validation Checklist

Make sure your JSONL file is valid by checking:

Each line is valid JSON
Required fields (input/output or text/schema) are present
At least one task is present (entities, classifications, structures, or relations)
Schema structure matches the expected format
Entity spans exist in the input text (entities can be found in the input) - checked in strict validation mode
Classification labels are from the defined label set
true_label is a list or string (string format is converted to list internally)
For multi-label classification, multi_label is set to true when multiple labels are provided
JSON structure fields match between instances of the same parent (flexible - union of fields is used)
Relation field consistency: All instances of the same relation type use the same field names (determined by first occurrence)
No trailing commas in JSON objects
Special characters are properly escaped
File encoding is UTF-8

Validation Modes

The implementation supports two validation modes:

Standard validation: Checks format correctness, required fields, label consistency
Strict validation: Additionally checks that entity mentions and relation values exist in the input text (case-insensitive substring matching)

Use strict validation during dataset creation to catch annotation errors early.

32 KiB Raw Permalink Blame History