13 KiB
13 KiB
GLiNER2 JSON Structure Extraction Tutorial
Learn how to extract complex structured data from text using GLiNER2's hierarchical extraction capabilities.
Table of Contents
- Quick API with extract_json
- Field Types and Specifications
- Multiple Instances
- Schema Builder (Multi-Task)
- Real-World Examples
- Best Practices
Quick API with extract_json
For structure-only extraction, use the extract_json() method with the simple dictionary format:
Basic Structure Extraction
from gliner2 import GLiNER2
# Load model
extractor = GLiNER2.from_pretrained("your-model-name")
# Simple product extraction
text = "The MacBook Pro costs $1999 and features M3 chip, 16GB RAM, and 512GB storage."
results = extractor.extract_json(
text,
{
"product": [
"name::str",
"price",
"features"
]
}
)
print(results)
# Output: {
# 'product': [{
# 'name': 'MacBook Pro',
# 'price': ['$1999'],
# 'features': ['M3 chip', '16GB RAM', '512GB storage']
# }]
# }
Contact Information
text = """
Contact: John Smith
Email: john@example.com
Phones: 555-1234, 555-5678
Address: 123 Main St, NYC
"""
results = extractor.extract_json(
text,
{
"contact": [
"name::str",
"email::str",
"phone::list",
"address"
]
}
)
# Output: {
# 'contact': [{
# 'name': 'John Smith',
# 'email': 'john@example.com',
# 'phone': ['555-1234', '555-5678'],
# 'address': ['123 Main St, NYC']
# }]
# }
Field Types and Specifications
Field Specification Format
Fields support flexible specifications using :: separators:
"field_name::type::description"
"field_name::[choice1|choice2|choice3]::type::description"
"field_name::description" # defaults to list type
"field_name" # simple field, defaults to list
String vs List Fields
text = """
Tech Conference 2024 on June 15th in San Francisco.
Topics include AI, Machine Learning, and Cloud Computing.
Registration fee: $299 for early bird tickets.
"""
results = extractor.extract_json(
text,
{
"event": [
"name::str::Event or conference name",
"date::str::Event date",
"location::str",
"topics::list::Conference topics",
"registration_fee::str"
]
}
)
# Output: {
# 'event': [{
# 'name': 'Tech Conference 2024',
# 'date': 'June 15th',
# 'location': 'San Francisco',
# 'topics': ['AI', 'Machine Learning', 'Cloud Computing'],
# 'registration_fee': '$299'
# }]
# }
Choice Fields (Classification within Structure)
text = """
Reservation at Le Bernardin for 4 people on March 15th at 7:30 PM.
We'd prefer outdoor seating. Two guests are vegetarian and one is gluten-free.
"""
results = extractor.extract_json(
text,
{
"reservation": [
"restaurant::str::Restaurant name",
"date::str",
"time::str",
"party_size::[1|2|3|4|5|6+]::str::Number of guests",
"seating::[indoor|outdoor|bar]::str::Seating preference",
"dietary::[vegetarian|vegan|gluten-free|none]::list::Dietary restrictions"
]
}
)
# Output: {
# 'reservation': [{
# 'restaurant': 'Le Bernardin',
# 'date': 'March 15th',
# 'time': '7:30 PM',
# 'party_size': '4',
# 'seating': 'outdoor',
# 'dietary': ['vegetarian', 'gluten-free']
# }]
# }
Multiple Instances
GLiNER2 automatically extracts ALL instances of a structure found in text:
Multiple Transactions
text = """
Recent transactions:
- Jan 5: Starbucks $5.50 (food)
- Jan 5: Uber $23.00 (transport)
- Jan 6: Amazon $156.99 (shopping)
"""
results = extractor.extract_json(
text,
{
"transaction": [
"date::str",
"merchant::str",
"amount::str",
"category::[food|transport|shopping|utilities]::str"
]
}
)
# Output: {
# 'transaction': [
# {'date': 'Jan 5', 'merchant': 'Starbucks', 'amount': '$5.50', 'category': 'food'},
# {'date': 'Jan 5', 'merchant': 'Uber', 'amount': '$23.00', 'category': 'transport'},
# {'date': 'Jan 6', 'merchant': 'Amazon', 'amount': '$156.99', 'category': 'shopping'}
# ]
# }
Multiple Hotel Bookings
text = """
Alice Brown booked the Hilton Downtown from March 10 to March 12. She selected a double room
for $340 total with breakfast and parking included.
Robert Taylor reserved The Grand Hotel, April 1 to April 5, suite at $1,200 total.
Amenities include breakfast, wifi, gym, and spa access.
"""
results = extractor.extract_json(
text,
{
"booking": [
"guest::str::Guest name",
"hotel::str::Hotel name",
"check_in::str",
"check_out::str",
"room_type::[single|double|suite|deluxe]::str",
"total_price::str",
"amenities::[breakfast|wifi|parking|gym|spa]::list"
]
}
)
# Output: {
# 'booking': [
# {
# 'guest': 'Alice Brown',
# 'hotel': 'Hilton Downtown',
# 'check_in': 'March 10',
# 'check_out': 'March 12',
# 'room_type': 'double',
# 'total_price': '$340',
# 'amenities': ['breakfast', 'parking']
# },
# {
# 'guest': 'Robert Taylor',
# 'hotel': 'The Grand Hotel',
# 'check_in': 'April 1',
# 'check_out': 'April 5',
# 'room_type': 'suite',
# 'total_price': '$1,200',
# 'amenities': ['breakfast', 'wifi', 'gym', 'spa']
# }
# ]
# }
Schema Builder (Multi-Task)
Use create_schema() only when combining structured extraction with other tasks (entities, classification):
Multi-Task Extraction
# Use schema builder for multi-task scenarios
schema = (extractor.create_schema()
# Extract entities
.entities(["person", "company", "location"])
# Classify sentiment
.classification("sentiment", ["positive", "negative", "neutral"])
# Extract structured product info
.structure("product")
.field("name", dtype="str")
.field("price", dtype="str")
.field("features", dtype="list")
.field("category", dtype="str", choices=["electronics", "software", "service"])
)
text = "Apple CEO Tim Cook announced iPhone 15 for $999 with amazing new features. This is exciting!"
results = extractor.extract(text, schema)
# Output: {
# 'entities': {'person': ['Tim Cook'], 'company': ['Apple'], 'location': []},
# 'sentiment': 'positive',
# 'product': [{
# 'name': 'iPhone 15',
# 'price': '$999',
# 'features': ['amazing new features'],
# 'category': 'electronics'
# }]
# }
Advanced Configuration
schema = (extractor.create_schema()
.classification("urgency", ["low", "medium", "high"])
.structure("support_ticket")
.field("ticket_id", dtype="str", threshold=0.9) # High precision
.field("customer", dtype="str", description="Customer name")
.field("issue", dtype="str", description="Problem description")
.field("priority", dtype="str", choices=["low", "medium", "high", "urgent"])
.field("tags", dtype="list", choices=["bug", "feature", "support", "billing"])
)
Examples
Financial Transaction Processing
text = """
Goldman Sachs processed a $2.5M equity trade for Tesla Inc. on March 15, 2024.
Commission: $1,250. Status: Completed.
"""
results = extractor.extract_json(
text,
{
"transaction": [
"broker::str::Financial institution",
"amount::str::Transaction amount",
"security::str::Stock or financial instrument",
"date::str::Transaction date",
"commission::str::Fees charged",
"status::[pending|completed|failed]::str",
"type::[equity|bond|option|future]::str"
]
}
)
# Output: {
# 'transaction': [{
# 'broker': 'Goldman Sachs',
# 'amount': '$2.5M',
# 'security': 'Tesla Inc.',
# 'date': 'March 15, 2024',
# 'commission': '$1,250',
# 'status': 'completed',
# 'type': 'equity'
# }]
# }
Medical Prescription Extraction
text = """
Patient: Sarah Johnson, 34, presented with chest pain.
Prescribed: Lisinopril 10mg daily, Metoprolol 25mg twice daily.
Follow-up scheduled for next Tuesday.
"""
results = extractor.extract_json(
text,
{
"patient": [
"name::str::Patient full name",
"age::str::Patient age",
"symptoms::list::Reported symptoms"
],
"prescription": [
"medication::str::Drug name",
"dosage::str::Dosage amount",
"frequency::str::How often to take"
]
}
)
# Output: {
# 'patient': [{
# 'name': 'Sarah Johnson',
# 'age': '34',
# 'symptoms': ['chest pain']
# }],
# 'prescription': [
# {'medication': 'Lisinopril', 'dosage': '10mg', 'frequency': 'daily'},
# {'medication': 'Metoprolol', 'dosage': '25mg', 'frequency': 'twice daily'}
# ]
# }
E-commerce Order Processing
text = """
Order #ORD-2024-001 for Alexandra Thompson
Items: Laptop Stand (2x $45.99), Wireless Mouse (1x $29.99), USB Hub (3x $35.50)
Subtotal: $228.46, Tax: $18.28, Total: $246.74
Status: Processing
"""
results = extractor.extract_json(
text,
{
"order": [
"order_id::str::Order number",
"customer::str::Customer name",
"items::list::Product names",
"quantities::list::Item quantities",
"unit_prices::list::Individual prices",
"subtotal::str",
"tax::str",
"total::str",
"status::[pending|processing|shipped|delivered]::str"
]
}
)
# Output: {
# 'order': [{
# 'order_id': 'ORD-2024-001',
# 'customer': 'Alexandra Thompson',
# 'items': ['Laptop Stand', 'Wireless Mouse', 'USB Hub'],
# 'quantities': ['2', '1', '3'],
# 'unit_prices': ['$45.99', '$29.99', '$35.50'],
# 'subtotal': '$228.46',
# 'tax': '$18.28',
# 'total': '$246.74',
# 'status': 'processing'
# }]
# }
Confidence Scores and Character Positions
You can include confidence scores and character-level start/end positions for structured extraction:
# Extract with confidence scores
text = "The MacBook Pro costs $1999 and features M3 chip, 16GB RAM, and 512GB storage."
results = extractor.extract_json(
text,
{
"product": [
"name::str",
"price",
"features"
]
},
include_confidence=True
)
# Output: {
# 'product': [{
# 'name': {'text': 'MacBook Pro', 'confidence': 0.95},
# 'price': [{'text': '$1999', 'confidence': 0.92}],
# 'features': [
# {'text': 'M3 chip', 'confidence': 0.88},
# {'text': '16GB RAM', 'confidence': 0.90},
# {'text': '512GB storage', 'confidence': 0.87}
# ]
# }]
# }
# Extract with character positions (spans)
results = extractor.extract_json(
text,
{
"product": [
"name::str",
"price"
]
},
include_spans=True
)
# Output: {
# 'product': [{
# 'name': {'text': 'MacBook Pro', 'start': 4, 'end': 15},
# 'price': [{'text': '$1999', 'start': 22, 'end': 27}]
# }]
# }
# Extract with both confidence and spans
results = extractor.extract_json(
text,
{
"product": [
"name::str",
"price",
"features"
]
},
include_confidence=True,
include_spans=True
)
# Output: {
# 'product': [{
# 'name': {'text': 'MacBook Pro', 'confidence': 0.95, 'start': 4, 'end': 15},
# 'price': [{'text': '$1999', 'confidence': 0.92, 'start': 22, 'end': 27}],
# 'features': [
# {'text': 'M3 chip', 'confidence': 0.88, 'start': 32, 'end': 39},
# {'text': '16GB RAM', 'confidence': 0.90, 'start': 41, 'end': 49},
# {'text': '512GB storage', 'confidence': 0.87, 'start': 55, 'end': 68}
# ]
# }]
# }
Note: When include_spans or include_confidence is True:
- String fields (
dtype="str"): Return dicts with{'text': '...', 'confidence': 0.95, 'start': 0, 'end': 5}(or subset) - List fields (
dtype="list"): Return lists of dicts, each with text, confidence, and positions - Default (both False): Returns simple strings or lists of strings
Best Practices
Data Types
- Use
::strfor single values (IDs, names, amounts) - Use
::listor default for multiple values (features, items, tags) - Use choices
[opt1|opt2|opt3]for standardized values - Add descriptions for complex or domain-specific fields
Quick Decision Guide
Use extract_json() for:
- Structure-only extraction
- Quick data parsing
- Single extraction task
Use create_schema().extract() for:
- Multi-task scenarios (entities + structures + classification)
- When you need entities or classification alongside structures
- Complex extraction pipelines