515 lines
13 KiB
Markdown
515 lines
13 KiB
Markdown
# GLiNER2 API Extractor
|
|
|
|
Use GLiNER2 through a cloud API without loading models locally. Perfect for production deployments, low-memory environments, or when you need instant access without GPU setup.
|
|
|
|
## Table of Contents
|
|
- [Getting Started](#getting-started)
|
|
- [Basic Usage](#basic-usage)
|
|
- [Entity Extraction](#entity-extraction)
|
|
- [Text Classification](#text-classification)
|
|
- [Structured Extraction](#structured-extraction)
|
|
- [Relation Extraction](#relation-extraction)
|
|
- [Combined Schemas](#combined-schemas)
|
|
- [Batch Processing](#batch-processing)
|
|
- [Confidence Scores](#confidence-scores)
|
|
- [Error Handling](#error-handling)
|
|
- [API vs Local](#api-vs-local)
|
|
|
|
## Getting Started
|
|
|
|
### Get Your API Key
|
|
|
|
1. Visit [gliner.pioneer.ai](https://gliner.pioneer.ai)
|
|
2. Sign up or log in to your account
|
|
3. Navigate to API Keys section
|
|
4. Generate a new API key
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install gliner2
|
|
```
|
|
|
|
### Set Your API Key
|
|
|
|
**Option 1: Environment Variable (Recommended)**
|
|
```bash
|
|
export PIONEER_API_KEY="your-api-key-here"
|
|
```
|
|
|
|
**Option 2: Pass Directly**
|
|
```python
|
|
extractor = GLiNER2.from_api(api_key="your-api-key-here")
|
|
```
|
|
|
|
## Basic Usage
|
|
|
|
```python
|
|
from gliner2 import GLiNER2
|
|
|
|
# Load from API (uses PIONEER_API_KEY environment variable)
|
|
extractor = GLiNER2.from_api()
|
|
|
|
# Use exactly like the local model!
|
|
results = extractor.extract_entities(
|
|
"Apple CEO Tim Cook announced the iPhone 15 in Cupertino.",
|
|
["company", "person", "product", "location"]
|
|
)
|
|
print(results)
|
|
# Output: {
|
|
# 'entities': {
|
|
# 'company': ['Apple'],
|
|
# 'person': ['Tim Cook'],
|
|
# 'product': ['iPhone 15'],
|
|
# 'location': ['Cupertino']
|
|
# }
|
|
# }
|
|
```
|
|
|
|
## Entity Extraction
|
|
|
|
### Simple Extraction
|
|
|
|
```python
|
|
extractor = GLiNER2.from_api()
|
|
|
|
text = "Elon Musk founded SpaceX in 2002 and Tesla in 2003."
|
|
results = extractor.extract_entities(
|
|
text,
|
|
["person", "company", "date"]
|
|
)
|
|
# Output: {
|
|
# 'entities': {
|
|
# 'person': ['Elon Musk'],
|
|
# 'company': ['SpaceX', 'Tesla'],
|
|
# 'date': ['2002', '2003']
|
|
# }
|
|
# }
|
|
```
|
|
|
|
### With Confidence Scores and Character Positions
|
|
|
|
You can include confidence scores and character-level start/end positions using `include_confidence` and `include_spans`:
|
|
|
|
```python
|
|
# With confidence only
|
|
results = extractor.extract_entities(
|
|
"Microsoft acquired LinkedIn for $26.2 billion.",
|
|
["company", "price"],
|
|
include_confidence=True
|
|
)
|
|
# Output: {
|
|
# 'entities': {
|
|
# 'company': [
|
|
# {'text': 'Microsoft', 'confidence': 0.98},
|
|
# {'text': 'LinkedIn', 'confidence': 0.97}
|
|
# ],
|
|
# 'price': [
|
|
# {'text': '$26.2 billion', 'confidence': 0.95}
|
|
# ]
|
|
# }
|
|
# }
|
|
|
|
# With character positions (spans) only
|
|
results = extractor.extract_entities(
|
|
"Microsoft acquired LinkedIn.",
|
|
["company"],
|
|
include_spans=True
|
|
)
|
|
# Output: {
|
|
# 'entities': {
|
|
# 'company': [
|
|
# {'text': 'Microsoft', 'start': 0, 'end': 9},
|
|
# {'text': 'LinkedIn', 'start': 18, 'end': 26}
|
|
# ]
|
|
# }
|
|
# }
|
|
|
|
# With both confidence and spans
|
|
results = extractor.extract_entities(
|
|
"Microsoft acquired LinkedIn for $26.2 billion.",
|
|
["company", "price"],
|
|
include_confidence=True,
|
|
include_spans=True
|
|
)
|
|
# Output: {
|
|
# 'entities': {
|
|
# 'company': [
|
|
# {'text': 'Microsoft', 'confidence': 0.98, 'start': 0, 'end': 9},
|
|
# {'text': 'LinkedIn', 'confidence': 0.97, 'start': 18, 'end': 26}
|
|
# ],
|
|
# 'price': [
|
|
# {'text': '$26.2 billion', 'confidence': 0.95, 'start': 32, 'end': 45}
|
|
# ]
|
|
# }
|
|
# }
|
|
```
|
|
|
|
### Custom Threshold
|
|
|
|
```python
|
|
# Only return high-confidence extractions
|
|
results = extractor.extract_entities(
|
|
text,
|
|
["person", "company"],
|
|
threshold=0.8 # Minimum 80% confidence
|
|
)
|
|
```
|
|
|
|
## Text Classification
|
|
|
|
### Single-Label Classification
|
|
|
|
```python
|
|
extractor = GLiNER2.from_api()
|
|
|
|
text = "I absolutely love this product! It exceeded all my expectations."
|
|
results = extractor.classify_text(
|
|
text,
|
|
{"sentiment": ["positive", "negative", "neutral"]}
|
|
)
|
|
# Output: {'sentiment': {'category': 'positive'}}
|
|
```
|
|
|
|
### Multi-Task Classification
|
|
|
|
```python
|
|
text = "Breaking: Major earthquake hits coastal city. Rescue teams deployed."
|
|
results = extractor.classify_text(
|
|
text,
|
|
{
|
|
"category": ["politics", "sports", "technology", "disaster", "business"],
|
|
"urgency": ["low", "medium", "high"]
|
|
}
|
|
)
|
|
# Output: {'category': 'disaster', 'urgency': 'high'}
|
|
```
|
|
|
|
## Structured Extraction
|
|
|
|
### Contact Information
|
|
|
|
```python
|
|
extractor = GLiNER2.from_api()
|
|
|
|
text = """
|
|
Contact John Smith at john.smith@email.com or call +1-555-123-4567.
|
|
He works as a Senior Engineer at TechCorp Inc.
|
|
"""
|
|
|
|
results = extractor.extract_json(
|
|
text,
|
|
{
|
|
"contact": [
|
|
"name::str::Full name of the person",
|
|
"email::str::Email address",
|
|
"phone::str::Phone number",
|
|
"job_title::str::Professional title",
|
|
"company::str::Company name"
|
|
]
|
|
}
|
|
)
|
|
# Output: {
|
|
# 'contact': [{
|
|
# 'name': 'John Smith',
|
|
# 'email': 'john.smith@email.com',
|
|
# 'phone': '+1-555-123-4567',
|
|
# 'job_title': 'Senior Engineer',
|
|
# 'company': 'TechCorp Inc.'
|
|
# }]
|
|
# }
|
|
```
|
|
|
|
### Product Information
|
|
|
|
```python
|
|
text = "iPhone 15 Pro Max - $1199, 256GB storage, Natural Titanium color"
|
|
|
|
results = extractor.extract_json(
|
|
text,
|
|
{
|
|
"product": [
|
|
"name::str",
|
|
"price::str",
|
|
"storage::str",
|
|
"color::str"
|
|
]
|
|
}
|
|
)
|
|
# Output: {
|
|
# 'product': [{
|
|
# 'name': 'iPhone 15 Pro Max',
|
|
# 'price': '$1199',
|
|
# 'storage': '256GB',
|
|
# 'color': 'Natural Titanium'
|
|
# }]
|
|
# }
|
|
```
|
|
|
|
## Relation Extraction
|
|
|
|
Extract relationships between entities as directional tuples (source, target).
|
|
|
|
### Basic Relation Extraction
|
|
|
|
```python
|
|
extractor = GLiNER2.from_api()
|
|
|
|
text = "John works for Apple Inc. and lives in San Francisco. Apple Inc. is located in Cupertino."
|
|
results = extractor.extract_relations(
|
|
text,
|
|
["works_for", "lives_in", "located_in"]
|
|
)
|
|
# Output: {
|
|
# 'relation_extraction': {
|
|
# 'works_for': [('John', 'Apple Inc.')],
|
|
# 'lives_in': [('John', 'San Francisco')],
|
|
# 'located_in': [('Apple Inc.', 'Cupertino')]
|
|
# }
|
|
# }
|
|
```
|
|
|
|
### With Descriptions
|
|
|
|
```python
|
|
text = "Elon Musk founded SpaceX in 2002. SpaceX is located in Hawthorne, California."
|
|
|
|
schema = extractor.create_schema().relations({
|
|
"founded": "Founding relationship where person created organization",
|
|
"located_in": "Geographic relationship where entity is in a location"
|
|
})
|
|
|
|
results = extractor.extract(text, schema)
|
|
# Output: {
|
|
# 'relation_extraction': {
|
|
# 'founded': [('Elon Musk', 'SpaceX')],
|
|
# 'located_in': [('SpaceX', 'Hawthorne, California')]
|
|
# }
|
|
# }
|
|
```
|
|
|
|
### Batch Relation Extraction
|
|
|
|
```python
|
|
texts = [
|
|
"John works for Microsoft and lives in Seattle.",
|
|
"Sarah founded TechStartup in 2020.",
|
|
"Bob reports to Alice at Google."
|
|
]
|
|
|
|
results = extractor.batch_extract_relations(
|
|
texts,
|
|
["works_for", "founded", "reports_to", "lives_in"]
|
|
)
|
|
# Returns list of relation extraction results for each text
|
|
```
|
|
|
|
## Combined Schemas
|
|
|
|
Combine entities, classification, relations, and structured extraction in a single call.
|
|
|
|
```python
|
|
extractor = GLiNER2.from_api()
|
|
|
|
text = """
|
|
Tech Review: The new MacBook Pro M3 is absolutely fantastic! Apple has outdone themselves.
|
|
I tested it in San Francisco last week. Tim Cook works for Apple, which is located in Cupertino.
|
|
Highly recommended for developers. Rating: 5 out of 5 stars.
|
|
"""
|
|
|
|
schema = (extractor.create_schema()
|
|
.entities(["company", "product", "location", "person"])
|
|
.classification("sentiment", ["positive", "negative", "neutral"])
|
|
.relations(["works_for", "located_in"])
|
|
.structure("review")
|
|
.field("product_name", dtype="str")
|
|
.field("rating", dtype="str")
|
|
.field("recommendation", dtype="str")
|
|
)
|
|
|
|
results = extractor.extract(text, schema)
|
|
# Output: {
|
|
# 'entities': {
|
|
# 'company': ['Apple'],
|
|
# 'product': ['MacBook Pro M3'],
|
|
# 'location': ['San Francisco', 'Cupertino'],
|
|
# 'person': ['Tim Cook']
|
|
# },
|
|
# 'sentiment': 'positive',
|
|
# 'relation_extraction': {
|
|
# 'works_for': [('Tim Cook', 'Apple')],
|
|
# 'located_in': [('Apple', 'Cupertino')]
|
|
# },
|
|
# 'review': [{
|
|
# 'product_name': 'MacBook Pro M3',
|
|
# 'rating': '5 out of 5 stars',
|
|
# 'recommendation': 'Highly recommended for developers'
|
|
# }]
|
|
# }
|
|
```
|
|
|
|
## Batch Processing
|
|
|
|
Process multiple texts efficiently in a single API call.
|
|
|
|
```python
|
|
extractor = GLiNER2.from_api()
|
|
|
|
texts = [
|
|
"Google's Sundar Pichai unveiled Gemini AI in Mountain View.",
|
|
"Microsoft CEO Satya Nadella announced Copilot at Build 2023.",
|
|
"Amazon's Andy Jassy revealed new AWS services in Seattle."
|
|
]
|
|
|
|
results = extractor.batch_extract_entities(
|
|
texts,
|
|
["company", "person", "product", "location"]
|
|
)
|
|
|
|
for i, result in enumerate(results):
|
|
print(f"Text {i+1}: {result}")
|
|
```
|
|
|
|
## Confidence Scores and Character Positions
|
|
|
|
### Entity Extraction with Confidence
|
|
|
|
```python
|
|
# Include confidence scores
|
|
results = extractor.extract_entities(
|
|
"Apple released the iPhone 15 in September 2023.",
|
|
["company", "product", "date"],
|
|
include_confidence=True
|
|
)
|
|
# Each entity includes: {'text': '...', 'confidence': 0.95}
|
|
```
|
|
|
|
### Entity Extraction with Character Positions
|
|
|
|
```python
|
|
# Include character-level start/end positions
|
|
results = extractor.extract_entities(
|
|
"Apple released the iPhone 15.",
|
|
["company", "product"],
|
|
include_spans=True
|
|
)
|
|
# Each entity includes: {'text': '...', 'start': 0, 'end': 5}
|
|
```
|
|
|
|
### Both Confidence and Positions
|
|
|
|
```python
|
|
# Include both confidence and character positions
|
|
results = extractor.extract_entities(
|
|
"Apple released the iPhone 15 in September 2023.",
|
|
["company", "product", "date"],
|
|
include_confidence=True,
|
|
include_spans=True
|
|
)
|
|
# Each entity includes: {'text': '...', 'confidence': 0.95, 'start': 0, 'end': 5}
|
|
```
|
|
|
|
### Raw Results (Advanced)
|
|
|
|
For full control over the extraction data:
|
|
|
|
```python
|
|
results = extractor.extract_entities(
|
|
"Apple CEO Tim Cook announced new products.",
|
|
["company", "person"],
|
|
format_results=False, # Get raw extraction data
|
|
include_confidence=True,
|
|
include_spans=True
|
|
)
|
|
# Returns tuples: (text, confidence, start_char, end_char)
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
```python
|
|
from gliner2 import GLiNER2, GLiNER2APIError, AuthenticationError, ValidationError
|
|
|
|
try:
|
|
extractor = GLiNER2.from_api()
|
|
results = extractor.extract_entities(text, entity_types)
|
|
|
|
except AuthenticationError:
|
|
print("Invalid API key. Check your PIONEER_API_KEY.")
|
|
|
|
except ValidationError as e:
|
|
print(f"Invalid request: {e}")
|
|
|
|
except GLiNER2APIError as e:
|
|
print(f"API error: {e}")
|
|
```
|
|
|
|
### Connection Settings
|
|
|
|
```python
|
|
extractor = GLiNER2.from_api(
|
|
api_key="your-key",
|
|
timeout=60.0, # Request timeout (seconds)
|
|
max_retries=5 # Retry failed requests
|
|
)
|
|
```
|
|
|
|
## API vs Local
|
|
|
|
| Feature | API (`from_api()`) | Local (`from_pretrained()`) |
|
|
|---------|-------------------|----------------------------|
|
|
| Setup | Just API key | GPU/CPU + model download |
|
|
| Memory | ~0 MB | 2-8 GB+ |
|
|
| Latency | Network dependent | Faster for single texts |
|
|
| Batch | Optimized | Optimized |
|
|
| Cost | Per request | Free after setup |
|
|
| Offline | ❌ | ✅ |
|
|
| RegexValidator | ❌ | ✅ |
|
|
|
|
### When to Use API
|
|
|
|
- Production deployments without GPU
|
|
- Serverless functions (AWS Lambda, etc.)
|
|
- Quick prototyping
|
|
- Low-memory environments
|
|
- Mobile/edge applications
|
|
|
|
### When to Use Local
|
|
|
|
- High-volume processing
|
|
- Offline requirements
|
|
- Sensitive data (no network transfer)
|
|
- Need for RegexValidator
|
|
- Cost optimization at scale
|
|
|
|
## Seamless Switching
|
|
|
|
The API mirrors the local interface exactly, making switching trivial:
|
|
|
|
```python
|
|
# Development: Use API for quick iteration
|
|
extractor = GLiNER2.from_api()
|
|
|
|
# Production: Switch to local if needed
|
|
# extractor = GLiNER2.from_pretrained("your-model")
|
|
|
|
# Same code works with both!
|
|
results = extractor.extract_entities(text, entity_types)
|
|
```
|
|
|
|
## Limitations
|
|
|
|
The API currently does not support:
|
|
|
|
1. **RegexValidator** - Use local model for regex-based filtering
|
|
2. **Multi-schema batch** - Different schemas per text in batch (works but slower)
|
|
3. **Custom models** - API uses the default GLiNER2 model
|
|
|
|
## Best Practices
|
|
|
|
1. **Store API key securely** - Use environment variables, not hardcoded strings
|
|
2. **Handle errors gracefully** - Network issues can occur
|
|
3. **Use batch processing** - More efficient than individual calls
|
|
4. **Set appropriate timeouts** - Increase for large texts
|
|
5. **Cache results** - Avoid redundant API calls for same content
|
|
|