agent-smith/packages/GLiNER2/tutorial/3-json_extraction.md

# GLiNER2 JSON Structure Extraction Tutorial

Learn how to extract complex structured data from text using GLiNER2's hierarchical extraction capabilities.

## Table of Contents
- [Quick API with extract_json](#quick-api-with-extract_json)
- [Field Types and Specifications](#field-types-and-specifications)
- [Multiple Instances](#multiple-instances)
- [Schema Builder (Multi-Task)](#schema-builder-multi-task)
- [Real-World Examples](#real-world-examples)
- [Best Practices](#best-practices)

## Quick API with extract_json

For structure-only extraction, use the `extract_json()` method with the simple dictionary format:

### Basic Structure Extraction

```python
from gliner2 import GLiNER2

# Load model
extractor = GLiNER2.from_pretrained("your-model-name")

# Simple product extraction
text = "The MacBook Pro costs $1999 and features M3 chip, 16GB RAM, and 512GB storage."
results = extractor.extract_json(
    text,
    {
        "product": [
            "name::str",
            "price",
            "features"
        ]
    }
)
print(results)
# Output: {
#     'product': [{
#         'name': 'MacBook Pro',
#         'price': ['$1999'],
#         'features': ['M3 chip', '16GB RAM', '512GB storage']
#     }]
# }
```

### Contact Information

```python
text = """
Contact: John Smith
Email: john@example.com
Phones: 555-1234, 555-5678
Address: 123 Main St, NYC
"""

results = extractor.extract_json(
    text,
    {
        "contact": [
            "name::str",
            "email::str",
            "phone::list",
            "address"
        ]
    }
)
# Output: {
#     'contact': [{
#         'name': 'John Smith',
#         'email': 'john@example.com',
#         'phone': ['555-1234', '555-5678'],
#         'address': ['123 Main St, NYC']
#     }]
# }
```

## Field Types and Specifications

### Field Specification Format

Fields support flexible specifications using `::` separators:

```
"field_name::type::description"
"field_name::[choice1|choice2|choice3]::type::description"
"field_name::description"  # defaults to list type
"field_name"               # simple field, defaults to list
```

### String vs List Fields

```python
text = """
Tech Conference 2024 on June 15th in San Francisco.
Topics include AI, Machine Learning, and Cloud Computing.
Registration fee: $299 for early bird tickets.
"""

results = extractor.extract_json(
    text,
    {
        "event": [
            "name::str::Event or conference name",
            "date::str::Event date",
            "location::str",
            "topics::list::Conference topics",
            "registration_fee::str"
        ]
    }
)
# Output: {
#     'event': [{
#         'name': 'Tech Conference 2024',
#         'date': 'June 15th',
#         'location': 'San Francisco',
#         'topics': ['AI', 'Machine Learning', 'Cloud Computing'],
#         'registration_fee': '$299'
#     }]
# }
```

### Choice Fields (Classification within Structure)

```python
text = """
Reservation at Le Bernardin for 4 people on March 15th at 7:30 PM.
We'd prefer outdoor seating. Two guests are vegetarian and one is gluten-free.
"""

results = extractor.extract_json(
    text,
    {
        "reservation": [
            "restaurant::str::Restaurant name",
            "date::str",
            "time::str",
            "party_size::[1|2|3|4|5|6+]::str::Number of guests",
            "seating::[indoor|outdoor|bar]::str::Seating preference",
            "dietary::[vegetarian|vegan|gluten-free|none]::list::Dietary restrictions"
        ]
    }
)
# Output: {
#     'reservation': [{
#         'restaurant': 'Le Bernardin',
#         'date': 'March 15th',
#         'time': '7:30 PM',
#         'party_size': '4',
#         'seating': 'outdoor',
#         'dietary': ['vegetarian', 'gluten-free']
#     }]
# }
```

## Multiple Instances

GLiNER2 automatically extracts ALL instances of a structure found in text:

### Multiple Transactions

```python
text = """
Recent transactions:
- Jan 5: Starbucks $5.50 (food)
- Jan 5: Uber $23.00 (transport)
- Jan 6: Amazon $156.99 (shopping)
"""

results = extractor.extract_json(
    text,
    {
        "transaction": [
            "date::str",
            "merchant::str",
            "amount::str",
            "category::[food|transport|shopping|utilities]::str"
        ]
    }
)
# Output: {
#     'transaction': [
#         {'date': 'Jan 5', 'merchant': 'Starbucks', 'amount': '$5.50', 'category': 'food'},
#         {'date': 'Jan 5', 'merchant': 'Uber', 'amount': '$23.00', 'category': 'transport'},
#         {'date': 'Jan 6', 'merchant': 'Amazon', 'amount': '$156.99', 'category': 'shopping'}
#     ]
# }
```

### Multiple Hotel Bookings

```python
text = """
Alice Brown booked the Hilton Downtown from March 10 to March 12. She selected a double room
for $340 total with breakfast and parking included.

Robert Taylor reserved The Grand Hotel, April 1 to April 5, suite at $1,200 total.
Amenities include breakfast, wifi, gym, and spa access.
"""

results = extractor.extract_json(
    text,
    {
        "booking": [
            "guest::str::Guest name",
            "hotel::str::Hotel name",
            "check_in::str",
            "check_out::str",
            "room_type::[single|double|suite|deluxe]::str",
            "total_price::str",
            "amenities::[breakfast|wifi|parking|gym|spa]::list"
        ]
    }
)
# Output: {
#     'booking': [
#         {
#             'guest': 'Alice Brown',
#             'hotel': 'Hilton Downtown',
#             'check_in': 'March 10',
#             'check_out': 'March 12',
#             'room_type': 'double',
#             'total_price': '$340',
#             'amenities': ['breakfast', 'parking']
#         },
#         {
#             'guest': 'Robert Taylor',
#             'hotel': 'The Grand Hotel',
#             'check_in': 'April 1',
#             'check_out': 'April 5',
#             'room_type': 'suite',
#             'total_price': '$1,200',
#             'amenities': ['breakfast', 'wifi', 'gym', 'spa']
#         }
#     ]
# }
```

## Schema Builder (Multi-Task)

Use `create_schema()` only when combining structured extraction with other tasks (entities, classification):

### Multi-Task Extraction

```python
# Use schema builder for multi-task scenarios
schema = (extractor.create_schema()
    # Extract entities
    .entities(["person", "company", "location"])

    # Classify sentiment
    .classification("sentiment", ["positive", "negative", "neutral"])

    # Extract structured product info
    .structure("product")
        .field("name", dtype="str")
        .field("price", dtype="str")
        .field("features", dtype="list")
        .field("category", dtype="str", choices=["electronics", "software", "service"])
)

text = "Apple CEO Tim Cook announced iPhone 15 for $999 with amazing new features. This is exciting!"
results = extractor.extract(text, schema)
# Output: {
#     'entities': {'person': ['Tim Cook'], 'company': ['Apple'], 'location': []},
#     'sentiment': 'positive',
#     'product': [{
#         'name': 'iPhone 15',
#         'price': '$999',
#         'features': ['amazing new features'],
#         'category': 'electronics'
#     }]
# }
```

### Advanced Configuration

```python
schema = (extractor.create_schema()
    .classification("urgency", ["low", "medium", "high"])

    .structure("support_ticket")
        .field("ticket_id", dtype="str", threshold=0.9)      # High precision
        .field("customer", dtype="str", description="Customer name")
        .field("issue", dtype="str", description="Problem description")
        .field("priority", dtype="str", choices=["low", "medium", "high", "urgent"])
        .field("tags", dtype="list", choices=["bug", "feature", "support", "billing"])
)
```

## Examples

### Financial Transaction Processing

```python
text = """
Goldman Sachs processed a $2.5M equity trade for Tesla Inc. on March 15, 2024.
Commission: $1,250. Status: Completed.
"""

results = extractor.extract_json(
    text,
    {
        "transaction": [
            "broker::str::Financial institution",
            "amount::str::Transaction amount",
            "security::str::Stock or financial instrument",
            "date::str::Transaction date",
            "commission::str::Fees charged",
            "status::[pending|completed|failed]::str",
            "type::[equity|bond|option|future]::str"
        ]
    }
)
# Output: {
#     'transaction': [{
#         'broker': 'Goldman Sachs',
#         'amount': '$2.5M',
#         'security': 'Tesla Inc.',
#         'date': 'March 15, 2024',
#         'commission': '$1,250',
#         'status': 'completed',
#         'type': 'equity'
#     }]
# }
```

### Medical Prescription Extraction

```python
text = """
Patient: Sarah Johnson, 34, presented with chest pain.
Prescribed: Lisinopril 10mg daily, Metoprolol 25mg twice daily.
Follow-up scheduled for next Tuesday.
"""

results = extractor.extract_json(
    text,
    {
        "patient": [
            "name::str::Patient full name",
            "age::str::Patient age",
            "symptoms::list::Reported symptoms"
        ],
        "prescription": [
            "medication::str::Drug name",
            "dosage::str::Dosage amount",
            "frequency::str::How often to take"
        ]
    }
)
# Output: {
#     'patient': [{
#         'name': 'Sarah Johnson',
#         'age': '34',
#         'symptoms': ['chest pain']
#     }],
#     'prescription': [
#         {'medication': 'Lisinopril', 'dosage': '10mg', 'frequency': 'daily'},
#         {'medication': 'Metoprolol', 'dosage': '25mg', 'frequency': 'twice daily'}
#     ]
# }
```

### E-commerce Order Processing

```python
text = """
Order #ORD-2024-001 for Alexandra Thompson
Items: Laptop Stand (2x $45.99), Wireless Mouse (1x $29.99), USB Hub (3x $35.50)
Subtotal: $228.46, Tax: $18.28, Total: $246.74
Status: Processing
"""

results = extractor.extract_json(
    text,
    {
        "order": [
            "order_id::str::Order number",
            "customer::str::Customer name",
            "items::list::Product names",
            "quantities::list::Item quantities",
            "unit_prices::list::Individual prices",
            "subtotal::str",
            "tax::str",
            "total::str",
            "status::[pending|processing|shipped|delivered]::str"
        ]
    }
)
# Output: {
#     'order': [{
#         'order_id': 'ORD-2024-001',
#         'customer': 'Alexandra Thompson',
#         'items': ['Laptop Stand', 'Wireless Mouse', 'USB Hub'],
#         'quantities': ['2', '1', '3'],
#         'unit_prices': ['$45.99', '$29.99', '$35.50'],
#         'subtotal': '$228.46',
#         'tax': '$18.28',
#         'total': '$246.74',
#         'status': 'processing'
#     }]
# }
```

## Confidence Scores and Character Positions

You can include confidence scores and character-level start/end positions for structured extraction:

```python
# Extract with confidence scores
text = "The MacBook Pro costs $1999 and features M3 chip, 16GB RAM, and 512GB storage."
results = extractor.extract_json(
    text,
    {
        "product": [
            "name::str",
            "price",
            "features"
        ]
    },
    include_confidence=True
)
# Output: {
#     'product': [{
#         'name': {'text': 'MacBook Pro', 'confidence': 0.95},
#         'price': [{'text': '$1999', 'confidence': 0.92}],
#         'features': [
#             {'text': 'M3 chip', 'confidence': 0.88},
#             {'text': '16GB RAM', 'confidence': 0.90},
#             {'text': '512GB storage', 'confidence': 0.87}
#         ]
#     }]
# }

# Extract with character positions (spans)
results = extractor.extract_json(
    text,
    {
        "product": [
            "name::str",
            "price"
        ]
    },
    include_spans=True
)
# Output: {
#     'product': [{
#         'name': {'text': 'MacBook Pro', 'start': 4, 'end': 15},
#         'price': [{'text': '$1999', 'start': 22, 'end': 27}]
#     }]
# }

# Extract with both confidence and spans
results = extractor.extract_json(
    text,
    {
        "product": [
            "name::str",
            "price",
            "features"
        ]
    },
    include_confidence=True,
    include_spans=True
)
# Output: {
#     'product': [{
#         'name': {'text': 'MacBook Pro', 'confidence': 0.95, 'start': 4, 'end': 15},
#         'price': [{'text': '$1999', 'confidence': 0.92, 'start': 22, 'end': 27}],
#         'features': [
#             {'text': 'M3 chip', 'confidence': 0.88, 'start': 32, 'end': 39},
#             {'text': '16GB RAM', 'confidence': 0.90, 'start': 41, 'end': 49},
#             {'text': '512GB storage', 'confidence': 0.87, 'start': 55, 'end': 68}
#         ]
#     }]
# }
```

**Note**: When `include_spans` or `include_confidence` is True:
- **String fields** (`dtype="str"`): Return dicts with `{'text': '...', 'confidence': 0.95, 'start': 0, 'end': 5}` (or subset)
- **List fields** (`dtype="list"`): Return lists of dicts, each with text, confidence, and positions
- **Default** (both False): Returns simple strings or lists of strings

## Best Practices

### Data Types

- Use `::str` for single values (IDs, names, amounts)
- Use `::list` or default for multiple values (features, items, tags)
- Use choices `[opt1|opt2|opt3]` for standardized values
- Add descriptions for complex or domain-specific fields

### Quick Decision Guide

**Use `extract_json()`** for:
- Structure-only extraction
- Quick data parsing
- Single extraction task

**Use `create_schema().extract()`** for:
- Multi-task scenarios (entities + structures + classification)
- When you need entities or classification alongside structures
- Complex extraction pipelines