agent-smith/packages/GLiNER2/tutorial/7-api.md
2026-03-06 12:59:32 +01:00

515 lines
13 KiB
Markdown

# GLiNER2 API Extractor
Use GLiNER2 through a cloud API without loading models locally. Perfect for production deployments, low-memory environments, or when you need instant access without GPU setup.
## Table of Contents
- [Getting Started](#getting-started)
- [Basic Usage](#basic-usage)
- [Entity Extraction](#entity-extraction)
- [Text Classification](#text-classification)
- [Structured Extraction](#structured-extraction)
- [Relation Extraction](#relation-extraction)
- [Combined Schemas](#combined-schemas)
- [Batch Processing](#batch-processing)
- [Confidence Scores](#confidence-scores)
- [Error Handling](#error-handling)
- [API vs Local](#api-vs-local)
## Getting Started
### Get Your API Key
1. Visit [gliner.pioneer.ai](https://gliner.pioneer.ai)
2. Sign up or log in to your account
3. Navigate to API Keys section
4. Generate a new API key
### Installation
```bash
pip install gliner2
```
### Set Your API Key
**Option 1: Environment Variable (Recommended)**
```bash
export PIONEER_API_KEY="your-api-key-here"
```
**Option 2: Pass Directly**
```python
extractor = GLiNER2.from_api(api_key="your-api-key-here")
```
## Basic Usage
```python
from gliner2 import GLiNER2
# Load from API (uses PIONEER_API_KEY environment variable)
extractor = GLiNER2.from_api()
# Use exactly like the local model!
results = extractor.extract_entities(
"Apple CEO Tim Cook announced the iPhone 15 in Cupertino.",
["company", "person", "product", "location"]
)
print(results)
# Output: {
# 'entities': {
# 'company': ['Apple'],
# 'person': ['Tim Cook'],
# 'product': ['iPhone 15'],
# 'location': ['Cupertino']
# }
# }
```
## Entity Extraction
### Simple Extraction
```python
extractor = GLiNER2.from_api()
text = "Elon Musk founded SpaceX in 2002 and Tesla in 2003."
results = extractor.extract_entities(
text,
["person", "company", "date"]
)
# Output: {
# 'entities': {
# 'person': ['Elon Musk'],
# 'company': ['SpaceX', 'Tesla'],
# 'date': ['2002', '2003']
# }
# }
```
### With Confidence Scores and Character Positions
You can include confidence scores and character-level start/end positions using `include_confidence` and `include_spans`:
```python
# With confidence only
results = extractor.extract_entities(
"Microsoft acquired LinkedIn for $26.2 billion.",
["company", "price"],
include_confidence=True
)
# Output: {
# 'entities': {
# 'company': [
# {'text': 'Microsoft', 'confidence': 0.98},
# {'text': 'LinkedIn', 'confidence': 0.97}
# ],
# 'price': [
# {'text': '$26.2 billion', 'confidence': 0.95}
# ]
# }
# }
# With character positions (spans) only
results = extractor.extract_entities(
"Microsoft acquired LinkedIn.",
["company"],
include_spans=True
)
# Output: {
# 'entities': {
# 'company': [
# {'text': 'Microsoft', 'start': 0, 'end': 9},
# {'text': 'LinkedIn', 'start': 18, 'end': 26}
# ]
# }
# }
# With both confidence and spans
results = extractor.extract_entities(
"Microsoft acquired LinkedIn for $26.2 billion.",
["company", "price"],
include_confidence=True,
include_spans=True
)
# Output: {
# 'entities': {
# 'company': [
# {'text': 'Microsoft', 'confidence': 0.98, 'start': 0, 'end': 9},
# {'text': 'LinkedIn', 'confidence': 0.97, 'start': 18, 'end': 26}
# ],
# 'price': [
# {'text': '$26.2 billion', 'confidence': 0.95, 'start': 32, 'end': 45}
# ]
# }
# }
```
### Custom Threshold
```python
# Only return high-confidence extractions
results = extractor.extract_entities(
text,
["person", "company"],
threshold=0.8 # Minimum 80% confidence
)
```
## Text Classification
### Single-Label Classification
```python
extractor = GLiNER2.from_api()
text = "I absolutely love this product! It exceeded all my expectations."
results = extractor.classify_text(
text,
{"sentiment": ["positive", "negative", "neutral"]}
)
# Output: {'sentiment': {'category': 'positive'}}
```
### Multi-Task Classification
```python
text = "Breaking: Major earthquake hits coastal city. Rescue teams deployed."
results = extractor.classify_text(
text,
{
"category": ["politics", "sports", "technology", "disaster", "business"],
"urgency": ["low", "medium", "high"]
}
)
# Output: {'category': 'disaster', 'urgency': 'high'}
```
## Structured Extraction
### Contact Information
```python
extractor = GLiNER2.from_api()
text = """
Contact John Smith at john.smith@email.com or call +1-555-123-4567.
He works as a Senior Engineer at TechCorp Inc.
"""
results = extractor.extract_json(
text,
{
"contact": [
"name::str::Full name of the person",
"email::str::Email address",
"phone::str::Phone number",
"job_title::str::Professional title",
"company::str::Company name"
]
}
)
# Output: {
# 'contact': [{
# 'name': 'John Smith',
# 'email': 'john.smith@email.com',
# 'phone': '+1-555-123-4567',
# 'job_title': 'Senior Engineer',
# 'company': 'TechCorp Inc.'
# }]
# }
```
### Product Information
```python
text = "iPhone 15 Pro Max - $1199, 256GB storage, Natural Titanium color"
results = extractor.extract_json(
text,
{
"product": [
"name::str",
"price::str",
"storage::str",
"color::str"
]
}
)
# Output: {
# 'product': [{
# 'name': 'iPhone 15 Pro Max',
# 'price': '$1199',
# 'storage': '256GB',
# 'color': 'Natural Titanium'
# }]
# }
```
## Relation Extraction
Extract relationships between entities as directional tuples (source, target).
### Basic Relation Extraction
```python
extractor = GLiNER2.from_api()
text = "John works for Apple Inc. and lives in San Francisco. Apple Inc. is located in Cupertino."
results = extractor.extract_relations(
text,
["works_for", "lives_in", "located_in"]
)
# Output: {
# 'relation_extraction': {
# 'works_for': [('John', 'Apple Inc.')],
# 'lives_in': [('John', 'San Francisco')],
# 'located_in': [('Apple Inc.', 'Cupertino')]
# }
# }
```
### With Descriptions
```python
text = "Elon Musk founded SpaceX in 2002. SpaceX is located in Hawthorne, California."
schema = extractor.create_schema().relations({
"founded": "Founding relationship where person created organization",
"located_in": "Geographic relationship where entity is in a location"
})
results = extractor.extract(text, schema)
# Output: {
# 'relation_extraction': {
# 'founded': [('Elon Musk', 'SpaceX')],
# 'located_in': [('SpaceX', 'Hawthorne, California')]
# }
# }
```
### Batch Relation Extraction
```python
texts = [
"John works for Microsoft and lives in Seattle.",
"Sarah founded TechStartup in 2020.",
"Bob reports to Alice at Google."
]
results = extractor.batch_extract_relations(
texts,
["works_for", "founded", "reports_to", "lives_in"]
)
# Returns list of relation extraction results for each text
```
## Combined Schemas
Combine entities, classification, relations, and structured extraction in a single call.
```python
extractor = GLiNER2.from_api()
text = """
Tech Review: The new MacBook Pro M3 is absolutely fantastic! Apple has outdone themselves.
I tested it in San Francisco last week. Tim Cook works for Apple, which is located in Cupertino.
Highly recommended for developers. Rating: 5 out of 5 stars.
"""
schema = (extractor.create_schema()
.entities(["company", "product", "location", "person"])
.classification("sentiment", ["positive", "negative", "neutral"])
.relations(["works_for", "located_in"])
.structure("review")
.field("product_name", dtype="str")
.field("rating", dtype="str")
.field("recommendation", dtype="str")
)
results = extractor.extract(text, schema)
# Output: {
# 'entities': {
# 'company': ['Apple'],
# 'product': ['MacBook Pro M3'],
# 'location': ['San Francisco', 'Cupertino'],
# 'person': ['Tim Cook']
# },
# 'sentiment': 'positive',
# 'relation_extraction': {
# 'works_for': [('Tim Cook', 'Apple')],
# 'located_in': [('Apple', 'Cupertino')]
# },
# 'review': [{
# 'product_name': 'MacBook Pro M3',
# 'rating': '5 out of 5 stars',
# 'recommendation': 'Highly recommended for developers'
# }]
# }
```
## Batch Processing
Process multiple texts efficiently in a single API call.
```python
extractor = GLiNER2.from_api()
texts = [
"Google's Sundar Pichai unveiled Gemini AI in Mountain View.",
"Microsoft CEO Satya Nadella announced Copilot at Build 2023.",
"Amazon's Andy Jassy revealed new AWS services in Seattle."
]
results = extractor.batch_extract_entities(
texts,
["company", "person", "product", "location"]
)
for i, result in enumerate(results):
print(f"Text {i+1}: {result}")
```
## Confidence Scores and Character Positions
### Entity Extraction with Confidence
```python
# Include confidence scores
results = extractor.extract_entities(
"Apple released the iPhone 15 in September 2023.",
["company", "product", "date"],
include_confidence=True
)
# Each entity includes: {'text': '...', 'confidence': 0.95}
```
### Entity Extraction with Character Positions
```python
# Include character-level start/end positions
results = extractor.extract_entities(
"Apple released the iPhone 15.",
["company", "product"],
include_spans=True
)
# Each entity includes: {'text': '...', 'start': 0, 'end': 5}
```
### Both Confidence and Positions
```python
# Include both confidence and character positions
results = extractor.extract_entities(
"Apple released the iPhone 15 in September 2023.",
["company", "product", "date"],
include_confidence=True,
include_spans=True
)
# Each entity includes: {'text': '...', 'confidence': 0.95, 'start': 0, 'end': 5}
```
### Raw Results (Advanced)
For full control over the extraction data:
```python
results = extractor.extract_entities(
"Apple CEO Tim Cook announced new products.",
["company", "person"],
format_results=False, # Get raw extraction data
include_confidence=True,
include_spans=True
)
# Returns tuples: (text, confidence, start_char, end_char)
```
## Error Handling
```python
from gliner2 import GLiNER2, GLiNER2APIError, AuthenticationError, ValidationError
try:
extractor = GLiNER2.from_api()
results = extractor.extract_entities(text, entity_types)
except AuthenticationError:
print("Invalid API key. Check your PIONEER_API_KEY.")
except ValidationError as e:
print(f"Invalid request: {e}")
except GLiNER2APIError as e:
print(f"API error: {e}")
```
### Connection Settings
```python
extractor = GLiNER2.from_api(
api_key="your-key",
timeout=60.0, # Request timeout (seconds)
max_retries=5 # Retry failed requests
)
```
## API vs Local
| Feature | API (`from_api()`) | Local (`from_pretrained()`) |
|---------|-------------------|----------------------------|
| Setup | Just API key | GPU/CPU + model download |
| Memory | ~0 MB | 2-8 GB+ |
| Latency | Network dependent | Faster for single texts |
| Batch | Optimized | Optimized |
| Cost | Per request | Free after setup |
| Offline | ❌ | ✅ |
| RegexValidator | ❌ | ✅ |
### When to Use API
- Production deployments without GPU
- Serverless functions (AWS Lambda, etc.)
- Quick prototyping
- Low-memory environments
- Mobile/edge applications
### When to Use Local
- High-volume processing
- Offline requirements
- Sensitive data (no network transfer)
- Need for RegexValidator
- Cost optimization at scale
## Seamless Switching
The API mirrors the local interface exactly, making switching trivial:
```python
# Development: Use API for quick iteration
extractor = GLiNER2.from_api()
# Production: Switch to local if needed
# extractor = GLiNER2.from_pretrained("your-model")
# Same code works with both!
results = extractor.extract_entities(text, entity_types)
```
## Limitations
The API currently does not support:
1. **RegexValidator** - Use local model for regex-based filtering
2. **Multi-schema batch** - Different schemas per text in batch (works but slower)
3. **Custom models** - API uses the default GLiNER2 model
## Best Practices
1. **Store API key securely** - Use environment variables, not hardcoded strings
2. **Handle errors gracefully** - Network issues can occur
3. **Use batch processing** - More efficient than individual calls
4. **Set appropriate timeouts** - Increase for large texts
5. **Cache results** - Avoid redundant API calls for same content