agent-smith/packages/GLiNER2/tutorial/6-relation_extraction.md

# GLiNER2 Relation Extraction Tutorial

Learn how to extract relations between entities from text using GLiNER2's relation extraction capabilities.

## Table of Contents
- [Basic Relation Extraction](#basic-relation-extraction)
- [Multiple Relation Types](#multiple-relation-types)
- [Relation Extraction with Descriptions](#relation-extraction-with-descriptions)
- [Custom Thresholds](#custom-thresholds)
- [Batch Processing](#batch-processing)
- [Combining with Other Tasks](#combining-with-other-tasks)
- [Real-World Examples](#real-world-examples)
- [Best Practices](#best-practices)

## Basic Relation Extraction

### Simple Example

```python
from gliner2 import GLiNER2

# Load model
extractor = GLiNER2.from_pretrained("your-model-name")

# Extract relations
text = "John works for Apple Inc. and lives in San Francisco."
results = extractor.extract_relations(
    text,
    ["works_for", "lives_in"]
)
print(results)
# Output: {
#     'relation_extraction': {
#         'works_for': [('John', 'Apple Inc.')],
#         'lives_in': [('John', 'San Francisco')]
#     }
# }
```

### Using Schema Builder

```python
# Same extraction using schema
schema = extractor.create_schema().relations([
    "works_for", "lives_in"
])
results = extractor.extract(text, schema)
```

### Understanding the Output Format

Relations are returned as tuples `(source, target)` grouped under the `relation_extraction` key. **All requested relation types are included in the output, even if no relations are found** (they appear as empty lists `[]`):

```python
text = "Alice manages the Engineering team. Bob reports to Alice."
results = extractor.extract_relations(
    text,
    ["manages", "reports_to", "founded"]  # Note: "founded" not found in text
)
# Output: {
#     'relation_extraction': {
#         'manages': [('Alice', 'Engineering team')],
#         'reports_to': [('Bob', 'Alice')],
#         'founded': []  # Empty list - relation type requested but not found
#     }
# }
```

This ensures consistent output structure - all requested relation types will always be present in the results, making it easier to process the output programmatically.

## Multiple Relation Types

You can extract multiple relation types in a single call:

```python
text = """
Sarah founded TechCorp in 2020. She is married to Mike,
who works at Google. TechCorp is located in Seattle.
"""

results = extractor.extract_relations(
    text,
    ["founded", "married_to", "works_at", "located_in"]
)
# Output: {
#     'relation_extraction': {
#         'founded': [('Sarah', 'TechCorp')],
#         'married_to': [('Sarah', 'Mike')],
#         'works_at': [('Mike', 'Google')],
#         'located_in': [('TechCorp', 'Seattle')]
#     }
# }
```

### Multiple Instances per Relation Type

GLiNER2 automatically extracts all relation instances found in the text:

```python
text = """
John works for Microsoft. Mary works for Google.
Bob works for Apple. All three live in California.
"""

results = extractor.extract_relations(
    text,
    ["works_for", "lives_in"]
)
# Output: {
#     'relation_extraction': {
#         'works_for': [
#             ('John', 'Microsoft'),
#             ('Mary', 'Google'),
#             ('Bob', 'Apple')
#         ],
#         'lives_in': [
#             ('John', 'California'),
#             ('Mary', 'California'),
#             ('Bob', 'California')
#         ]
#     }
# }
```

## Relation Extraction with Descriptions

Providing descriptions helps improve extraction accuracy by clarifying the relation semantics:

```python
schema = extractor.create_schema().relations({
    "works_for": "Employment relationship where person works at organization",
    "founded": "Founding relationship where person created organization",
    "acquired": "Acquisition relationship where company bought another company",
    "located_in": "Geographic relationship where entity is in a location"
})

text = """
Elon Musk founded SpaceX in 2002. SpaceX is located in Hawthorne, California.
Tesla acquired SolarCity in 2016. Many engineers work for SpaceX.
"""

results = extractor.extract(text, schema)
```

### Advanced Configuration

```python
schema = extractor.create_schema().relations({
    "works_for": {
        "description": "Employment or professional relationship",
        "threshold": 0.7  # Higher precision for employment relations
    },
    "located_in": {
        "description": "Geographic containment relationship",
        "threshold": 0.6  # Moderate threshold
    },
    "reports_to": {
        "description": "Organizational hierarchy relationship",
        "threshold": 0.8  # Very high precision
    }
})
```

## Custom Thresholds

### Global Threshold

```python
# High-precision relation extraction
results = extractor.extract_relations(
    text,
    ["acquired", "merged_with"],
    threshold=0.8  # High confidence required
)
```

### Per-Relation Thresholds

```python
schema = extractor.create_schema().relations({
    "acquired": {
        "description": "Company acquisition relationship",
        "threshold": 0.9  # Very high precision
    },
    "partnered_with": {
        "description": "Partnership or collaboration relationship",
        "threshold": 0.6  # Moderate threshold
    },
    "competes_with": {
        "description": "Competitive relationship",
        "threshold": 0.5  # Lower threshold for implicit relations
    }
})
```

### With Confidence Scores and Character Positions

You can include confidence scores and character-level start/end positions for relation extractions:

```python
# Extract relations with confidence scores
text = "John works for Apple Inc. and lives in San Francisco."
results = extractor.extract_relations(
    text,
    ["works_for", "lives_in"],
    include_confidence=True
)
print(results)
# Output: {
#     'relation_extraction': {
#         'works_for': [{
#             'head': {'text': 'John', 'confidence': 0.95},
#             'tail': {'text': 'Apple Inc.', 'confidence': 0.92}
#         }],
#         'lives_in': [{
#             'head': {'text': 'John', 'confidence': 0.94},
#             'tail': {'text': 'San Francisco', 'confidence': 0.91}
#         }]
#     }
# }

# Extract with character positions (spans)
results = extractor.extract_relations(
    text,
    ["works_for", "lives_in"],
    include_spans=True
)
print(results)
# Output: {
#     'relation_extraction': {
#         'works_for': [{
#             'head': {'text': 'John', 'start': 0, 'end': 4},
#             'tail': {'text': 'Apple Inc.', 'start': 15, 'end': 25}
#         }],
#         'lives_in': [{
#             'head': {'text': 'John', 'start': 0, 'end': 4},
#             'tail': {'text': 'San Francisco', 'start': 33, 'end': 46}
#         }]
#     }
# }

# Extract with both confidence and spans
results = extractor.extract_relations(
    text,
    ["works_for", "lives_in"],
    include_confidence=True,
    include_spans=True
)
print(results)
# Output: {
#     'relation_extraction': {
#         'works_for': [{
#             'head': {'text': 'John', 'confidence': 0.95, 'start': 0, 'end': 4},
#             'tail': {'text': 'Apple Inc.', 'confidence': 0.92, 'start': 15, 'end': 25}
#         }],
#         'lives_in': [{
#             'head': {'text': 'John', 'confidence': 0.94, 'start': 0, 'end': 4},
#             'tail': {'text': 'San Francisco', 'confidence': 0.91, 'start': 33, 'end': 46}
#         }]
#     }
# }
```

**Note**: When `include_spans` or `include_confidence` is True, relations are returned as dictionaries with `head` and `tail` keys, each containing the extracted text along with optional confidence scores and character positions. When both are False (default), relations are returned as simple tuples `(head, tail)`.

## Batch Processing

Process multiple texts efficiently:

```python
texts = [
    "John works for Microsoft and lives in Seattle.",
    "Sarah founded TechStartup in 2020.",
    "Bob reports to Alice at Google."
]

results = extractor.batch_extract_relations(
    texts,
    ["works_for", "founded", "reports_to", "lives_in"],
    batch_size=8
)
# Output: [
#     {
#         'relation_extraction': {
#             'works_for': [('John', 'Microsoft')],
#             'lives_in': [('John', 'Seattle')],
#             'founded': [],      # Not found in first text
#             'reports_to': []   # Not found in first text
#         }
#     },
#     {
#         'relation_extraction': {
#             'works_for': [],    # Not found in second text
#             'founded': [('Sarah', 'TechStartup')],
#             'reports_to': [],   # Not found in second text
#             'lives_in': []      # Not found in second text
#         }
#     },
#     {
#         'relation_extraction': {
#             'works_for': [('Alice', 'Google')],
#             'reports_to': [('Bob', 'Alice')],
#             'founded': [],      # Not found in third text
#             'lives_in': []      # Not found in third text
#         }
#     }
# ]
```

**Note**: All requested relation types appear in each result, even if empty. This ensures consistent structure across all batch results, making it easier to process programmatically.

## Combining with Other Tasks

Relation extraction can be combined with entity extraction, classification, and structured extraction:

### Relations + Entities

```python
schema = (extractor.create_schema()
    .entities(["person", "organization", "location"])
    .relations(["works_for", "located_in"])
)

text = "Tim Cook works for Apple Inc., which is located in Cupertino, California."
results = extractor.extract(text, schema)
# Output: {
#     'entities': {
#         'person': ['Tim Cook'],
#         'organization': ['Apple Inc.'],
#         'location': ['Cupertino', 'California']
#     },
#     'relation_extraction': {
#         'works_for': [('Tim Cook', 'Apple Inc.')],
#         'located_in': [('Apple Inc.', 'Cupertino')]
#     }
# }
```

### Relations + Classification + Structures

```python
schema = (extractor.create_schema()
    .classification("document_type", ["news", "report", "announcement"])
    .entities(["person", "company"])
    .relations(["works_for", "acquired"])
    .structure("event")
        .field("date", dtype="str")
        .field("description", dtype="str")
)

text = """
BREAKING: Microsoft announced today that it acquired GitHub.
Satya Nadella, CEO of Microsoft, confirmed the deal.
The acquisition was finalized on October 26, 2018.
"""

results = extractor.extract(text, schema)
```

## Real-World Examples

### Organizational Relationships

```python
org_schema = extractor.create_schema().relations({
    "reports_to": "Direct reporting relationship in organizational hierarchy",
    "manages": "Management relationship where person manages team/department",
    "works_for": "Employment relationship",
    "founded": "Founding relationship",
    "acquired": "Company acquisition relationship"
})

text = """
Sundar Pichai is the CEO of Google. He reports to the board of directors.
Google acquired YouTube in 2006. Many engineers work for Google.
"""

results = extractor.extract(text, org_schema)
# Output: {
#     'relation_extraction': {
#         'reports_to': [('Sundar Pichai', 'board of directors')],
#         'works_for': [('engineers', 'Google')],
#         'acquired': [('Google', 'YouTube')]
#     }
# }
```

### Medical Relationships

```python
medical_schema = extractor.create_schema().relations({
    "treats": "Medical treatment relationship between doctor and patient",
    "prescribed_for": "Prescription relationship between medication and condition",
    "causes": "Causal relationship between condition and symptom",
    "located_in": "Anatomical location relationship"
})

text = """
Dr. Smith treats patients with diabetes. Metformin is prescribed for Type 2 Diabetes.
High blood sugar causes frequent urination. The pancreas is located in the abdomen.
"""

results = extractor.extract(text, medical_schema)
```

### Financial Relationships

```python
finance_schema = extractor.create_schema().relations({
    "invested_in": "Investment relationship between investor and company",
    "acquired": "Company acquisition relationship",
    "merged_with": "Merger relationship between companies",
    "owns": "Ownership relationship"
})

text = """
SoftBank invested in Uber in 2018. Microsoft acquired LinkedIn in 2016.
Disney merged with 21st Century Fox. Berkshire Hathaway owns Geico.
"""

results = extractor.extract(text, finance_schema)
```

### Geographic Relationships

```python
geo_schema = extractor.create_schema().relations({
    "located_in": "Geographic containment (city in country, etc.)",
    "borders": "Geographic adjacency relationship",
    "capital_of": "Capital city relationship",
    "flows_through": "River or waterway relationship"
})

text = """
Paris is the capital of France. France borders Germany and Spain.
The Seine flows through Paris. Paris is located in France.
"""

results = extractor.extract(text, geo_schema)
```

### Family Relationships

```python
family_schema = extractor.create_schema().relations({
    "married_to": "Marriage relationship",
    "parent_of": "Parent-child relationship",
    "sibling_of": "Sibling relationship",
    "related_to": "General family relationship"
})

text = """
John is married to Mary. They are parents of two children: Alice and Bob.
Alice and Bob are siblings. Mary is related to her sister Sarah.
"""

results = extractor.extract(text, family_schema)
```

### Academic Relationships

```python
academic_schema = extractor.create_schema().relations({
    "authored": "Publication relationship between author and paper",
    "cited": "Citation relationship between papers",
    "supervised": "Academic supervision relationship",
    "affiliated_with": "Institutional affiliation relationship"
})

text = """
Dr. Johnson authored the paper on machine learning. The paper cited
previous work by Dr. Smith. Dr. Johnson supervises graduate students
at MIT, where she is affiliated with the Computer Science department.
"""

results = extractor.extract(text, academic_schema)
```

## Best Practices

### 1. Use Clear, Specific Relation Names

```python
# Good - Clear and specific
schema.relations(["works_for", "reports_to", "manages"])

# Less ideal - Too generic
schema.relations(["related", "connected", "linked"])
```

### 2. Provide Descriptions for Ambiguous Relations

```python
# Good - Clear descriptions
schema.relations({
    "works_for": "Employment relationship where person works at organization",
    "consulted_for": "Consulting relationship where person provides services to organization"
})

# Less ideal - No context
schema.relations(["works_for", "consulted_for"])
```

### 3. Set Appropriate Thresholds

```python
# High precision for critical relations
schema.relations({
    "acquired": {
        "description": "Company acquisition",
        "threshold": 0.9  # Very high precision
    },
    "partnered_with": {
        "description": "Partnership relationship",
        "threshold": 0.6  # Moderate threshold
    }
})
```

### 4. Combine with Entity Extraction

```python
# Extract both entities and relations for better context
schema = (extractor.create_schema()
    .entities(["person", "organization"])
    .relations(["works_for", "founded"])
)
```

### 5. Use Batch Processing for Multiple Texts

```python
# Efficient batch processing
results = extractor.batch_extract_relations(
    texts,
    relation_types,
    batch_size=8  # Adjust based on your hardware
)
```

### 6. Handle Multiple Instances

```python
# GLiNER2 automatically extracts all instances
text = "John works for Apple. Mary works for Google. Bob works for Microsoft."
results = extractor.extract_relations(text, ["works_for"])
# Returns all three work relationships
```

### 7. Handle Empty Relations

All requested relation types are always included in the output, even if empty:

```python
results = extractor.extract_relations(
    "John works for Microsoft.",
    ["works_for", "founded", "acquired"]
)
# Output: {
#     'relation_extraction': {
#         'works_for': [('John', 'Microsoft')],
#         'founded': [],      # Empty - not found in text
#         'acquired': []      # Empty - not found in text
#     }
# }

# This makes it easy to check for relations programmatically:
for rel_type, rels in results['relation_extraction'].items():
    if rels:  # Non-empty
        print(f"Found {len(rels)} {rel_type} relations")
    else:  # Empty
        print(f"No {rel_type} relations found")
```

### 7. Validate Relation Direction

Relations are directional tuples `(source, target)`:
- `works_for`: (person, organization)
- `located_in`: (entity, location)
- `reports_to`: (subordinate, manager)
- `manages`: (manager, team)

Make sure your relation names match the expected direction.

## Common Use Cases

### Knowledge Graph Construction

```python
# Extract entities and relations for knowledge graph
schema = (extractor.create_schema()
    .entities(["person", "organization", "location", "product"])
    .relations([
        "works_for", "founded", "located_in", "created",
        "acquired", "partnered_with"
    ])
)

# Process documents to build knowledge graph
documents = [...]  # Your documents
all_relations = []
all_entities = []

for doc in documents:
    results = extractor.extract(doc, schema)
    all_relations.append(results.get("relation_extraction", {}))
    all_entities.append(results.get("entities", {}))
```

### Relationship Analysis

```python
# Analyze organizational structures
org_texts = [...]  # Organizational documents
results = extractor.batch_extract_relations(
    org_texts,
    ["reports_to", "manages", "works_for", "collaborates_with"],
    batch_size=8
)

# Analyze relationship patterns
for result in results:
    relations = result.get("relation_extraction", {})
    # Process relations for analysis
```

### Document Understanding

```python
# Comprehensive document understanding
schema = (extractor.create_schema()
    .classification("document_type", ["contract", "report", "email"])
    .entities(["person", "organization", "date", "amount"])
    .relations(["signed_by", "involves", "dated", "worth"])
    .structure("contract_term")
        .field("term", dtype="str")
        .field("value", dtype="str")
)

# Extract all information types in one pass
results = extractor.extract(document_text, schema)
```