663 lines
19 KiB
Markdown
663 lines
19 KiB
Markdown
# GLiNER2 Classification Tutorial
|
|
|
|
This tutorial covers all the ways to perform text classification with GLiNER2, from simple single-label classification to complex multi-label tasks with custom configurations.
|
|
|
|
## Table of Contents
|
|
- [Setup](#setup)
|
|
- [Single-Label Classification](#single-label-classification)
|
|
- [Multi-Label Classification](#multi-label-classification)
|
|
- [Classification with Descriptions](#classification-with-descriptions)
|
|
- [Using the Quick API](#using-the-quick-api)
|
|
- [Multiple Classification Tasks](#multiple-classification-tasks)
|
|
- [Advanced Configurations](#advanced-configurations)
|
|
- [Best Practices](#best-practices)
|
|
|
|
## Setup
|
|
|
|
```python
|
|
from gliner2 import GLiNER2
|
|
|
|
# Load the pre-trained model
|
|
extractor = GLiNER2.from_pretrained("your-model-name")
|
|
```
|
|
|
|
## Single-Label Classification
|
|
|
|
The simplest form - classify text into one of several categories.
|
|
|
|
### Basic Example
|
|
|
|
```python
|
|
# Define the schema
|
|
schema = extractor.create_schema().classification(
|
|
"sentiment",
|
|
["positive", "negative", "neutral"]
|
|
)
|
|
|
|
# Extract
|
|
text = "This product exceeded my expectations! Absolutely love it."
|
|
results = extractor.extract(text, schema)
|
|
print(results)
|
|
# Expected output: {'sentiment': 'positive'}
|
|
```
|
|
|
|
### With Confidence Scores
|
|
|
|
```python
|
|
# Same schema as above
|
|
schema = extractor.create_schema().classification(
|
|
"sentiment",
|
|
["positive", "negative", "neutral"]
|
|
)
|
|
|
|
text = "The service was okay, nothing special but not bad either."
|
|
results = extractor.extract(text, schema, include_confidence=True)
|
|
print(results)
|
|
# Expected output: {'sentiment': {'label': 'neutral', 'confidence': 0.82}}
|
|
```
|
|
|
|
## Multi-Label Classification
|
|
|
|
When text can belong to multiple categories simultaneously.
|
|
|
|
```python
|
|
# Multi-label classification
|
|
schema = extractor.create_schema().classification(
|
|
"topics",
|
|
["technology", "business", "health", "politics", "sports"],
|
|
multi_label=True,
|
|
cls_threshold=0.3 # Lower threshold for multi-label
|
|
)
|
|
|
|
text = "Apple announced new health monitoring features in their latest smartwatch, boosting their stock price."
|
|
results = extractor.extract(text, schema)
|
|
print(results)
|
|
# Expected output: {'topics': ['technology', 'business', 'health']}
|
|
|
|
# With confidence scores
|
|
results = extractor.extract(text, schema, include_confidence=True)
|
|
print(results)
|
|
# Expected output: {'topics': [
|
|
# {'label': 'technology', 'confidence': 0.92},
|
|
# {'label': 'business', 'confidence': 0.78},
|
|
# {'label': 'health', 'confidence': 0.65}
|
|
# ]}
|
|
```
|
|
|
|
## Classification with Descriptions
|
|
|
|
Adding descriptions significantly improves accuracy by providing context.
|
|
|
|
```python
|
|
# With label descriptions
|
|
schema = extractor.create_schema().classification(
|
|
"document_type",
|
|
{
|
|
"invoice": "A bill for goods or services with payment details",
|
|
"receipt": "Proof of payment for a completed transaction",
|
|
"contract": "Legal agreement between parties with terms and conditions",
|
|
"proposal": "Document outlining suggested plans or services with pricing"
|
|
}
|
|
)
|
|
|
|
text = "Please find attached the itemized bill for consulting services rendered in Q3 2024. Payment is due within 30 days."
|
|
results = extractor.extract(text, schema)
|
|
print(results)
|
|
# Expected output: {'document_type': 'invoice'}
|
|
|
|
# Another example
|
|
text2 = "Thank you for your payment of $500. This confirms your transaction was completed on March 1st, 2024."
|
|
results2 = extractor.extract(text2, schema)
|
|
print(results2)
|
|
# Expected output: {'document_type': 'receipt'}
|
|
```
|
|
|
|
## Using the Quick API
|
|
|
|
For simple classification tasks without building a schema.
|
|
|
|
### Single Task
|
|
|
|
```python
|
|
text = "The new AI model shows remarkable performance improvements."
|
|
results = extractor.classify_text(
|
|
text,
|
|
{"sentiment": ["positive", "negative", "neutral"]}
|
|
)
|
|
print(results)
|
|
# Expected output: {'sentiment': 'positive'}
|
|
|
|
# Another example
|
|
text2 = "The software keeps crashing and customer support is unresponsive."
|
|
results2 = extractor.classify_text(
|
|
text2,
|
|
{"sentiment": ["positive", "negative", "neutral"]}
|
|
)
|
|
print(results2)
|
|
# Expected output: {'sentiment': 'negative'}
|
|
```
|
|
|
|
### Multiple Tasks
|
|
|
|
```python
|
|
text = "Breaking: Tech giant announces major layoffs amid market downturn"
|
|
results = extractor.classify_text(
|
|
text,
|
|
{
|
|
"sentiment": ["positive", "negative", "neutral"],
|
|
"urgency": ["high", "medium", "low"],
|
|
"category": {
|
|
"labels": ["tech", "finance", "politics", "sports"],
|
|
"multi_label": False
|
|
}
|
|
}
|
|
)
|
|
print(results)
|
|
# Expected output: {
|
|
# 'sentiment': 'negative',
|
|
# 'urgency': 'high',
|
|
# 'category': 'tech'
|
|
# }
|
|
```
|
|
|
|
### Multi-Label with Config
|
|
|
|
```python
|
|
text = "The smartphone features an amazing camera but disappointing battery life and overheats frequently."
|
|
results = extractor.classify_text(
|
|
text,
|
|
{
|
|
"product_aspects": {
|
|
"labels": ["camera", "battery", "display", "performance", "design", "heating"],
|
|
"multi_label": True,
|
|
"cls_threshold": 0.4
|
|
}
|
|
}
|
|
)
|
|
print(results)
|
|
# Expected output: {'product_aspects': ['camera', 'battery', 'heating']}
|
|
|
|
# Another example
|
|
text2 = "Beautiful design with vibrant display, though the camera could be better."
|
|
results2 = extractor.classify_text(
|
|
text2,
|
|
{
|
|
"product_aspects": {
|
|
"labels": ["camera", "battery", "display", "performance", "design", "heating"],
|
|
"multi_label": True,
|
|
"cls_threshold": 0.4
|
|
}
|
|
}
|
|
)
|
|
print(results2)
|
|
# Expected output: {'product_aspects': ['design', 'display', 'camera']}
|
|
```
|
|
|
|
## Multiple Classification Tasks
|
|
|
|
You can include multiple classification tasks in a single schema for comprehensive text analysis.
|
|
|
|
### Basic Multiple Classifications
|
|
|
|
```python
|
|
# Multiple independent classifications
|
|
schema = (extractor.create_schema()
|
|
.classification("sentiment", ["positive", "negative", "neutral"])
|
|
.classification("language", ["english", "spanish", "french", "german", "other"])
|
|
.classification("formality", ["formal", "informal", "semi-formal"])
|
|
.classification("intent", ["question", "statement", "request", "complaint"])
|
|
)
|
|
|
|
text = "Could you please help me with my order? The service has been disappointing."
|
|
results = extractor.extract(text, schema)
|
|
print(results)
|
|
# Expected output: {
|
|
# 'sentiment': 'negative',
|
|
# 'language': 'english',
|
|
# 'formality': 'formal',
|
|
# 'intent': 'question'
|
|
# }
|
|
|
|
# Another example
|
|
text2 = "Hey! Just wanted to say your product rocks! 🎉"
|
|
results2 = extractor.extract(text2, schema)
|
|
print(results2)
|
|
# Expected output: {
|
|
# 'sentiment': 'positive',
|
|
# 'language': 'english',
|
|
# 'formality': 'informal',
|
|
# 'intent': 'statement'
|
|
# }
|
|
```
|
|
|
|
### Mixed Single and Multi-Label Classifications
|
|
|
|
```python
|
|
# Combine different classification types
|
|
schema = (extractor.create_schema()
|
|
# Single-label classifications
|
|
.classification("primary_topic", ["tech", "business", "health", "sports", "politics"])
|
|
.classification("urgency", ["immediate", "soon", "later", "not_urgent"])
|
|
|
|
# Multi-label classifications
|
|
.classification("emotions",
|
|
["happy", "sad", "angry", "surprised", "fearful", "disgusted"],
|
|
multi_label=True,
|
|
cls_threshold=0.4
|
|
)
|
|
.classification("content_flags",
|
|
["inappropriate", "spam", "promotional", "personal_info", "financial_info"],
|
|
multi_label=True,
|
|
cls_threshold=0.3
|
|
)
|
|
)
|
|
|
|
text = "URGENT: I'm thrilled to announce our new product! But concerned about competitor reactions. Please keep confidential."
|
|
results = extractor.extract(text, schema)
|
|
print(results)
|
|
# Expected output: {
|
|
# 'primary_topic': 'business',
|
|
# 'urgency': 'immediate',
|
|
# 'emotions': ['happy', 'fearful'],
|
|
# 'content_flags': ['promotional', 'personal_info']
|
|
# }
|
|
|
|
# Another example
|
|
text2 = "Just saw the game - absolutely devastated by the loss. Can't believe the referee's terrible decision!"
|
|
results2 = extractor.extract(text2, schema)
|
|
print(results2)
|
|
# Expected output: {
|
|
# 'primary_topic': 'sports',
|
|
# 'urgency': 'not_urgent',
|
|
# 'emotions': ['sad', 'angry'],
|
|
# 'content_flags': []
|
|
# }
|
|
```
|
|
|
|
### Domain-Specific Multiple Classifications
|
|
|
|
```python
|
|
# Customer support ticket classification
|
|
support_schema = (extractor.create_schema()
|
|
.classification("ticket_type",
|
|
["technical_issue", "billing", "feature_request", "bug_report", "other"])
|
|
.classification("priority",
|
|
["critical", "high", "medium", "low"],
|
|
cls_threshold=0.7
|
|
)
|
|
.classification("product_area",
|
|
{
|
|
"authentication": "Login, passwords, security",
|
|
"payment": "Payment processing, subscriptions",
|
|
"ui": "User interface, design issues",
|
|
"performance": "Speed, loading, responsiveness",
|
|
"data": "Data loss, corruption, sync issues"
|
|
},
|
|
multi_label=True,
|
|
cls_threshold=0.5
|
|
)
|
|
.classification("customer_sentiment",
|
|
["very_satisfied", "satisfied", "neutral", "frustrated", "very_frustrated"],
|
|
cls_threshold=0.6
|
|
)
|
|
.classification("requires_action",
|
|
["immediate_response", "investigation_needed", "waiting_customer", "resolved"],
|
|
multi_label=True
|
|
)
|
|
)
|
|
|
|
ticket_text = """
|
|
Subject: Cannot login - Urgent!
|
|
|
|
I've been trying to login for the past hour but keep getting error messages.
|
|
This is critical as I need to process payments for my customers today.
|
|
The page just keeps spinning and then times out. I'm extremely frustrated
|
|
as this is costing me business. Please fix this immediately!
|
|
"""
|
|
|
|
results = extractor.extract(ticket_text, support_schema)
|
|
print(results)
|
|
# Expected output: {
|
|
# 'ticket_type': 'technical_issue',
|
|
# 'priority': 'critical',
|
|
# 'product_area': ['authentication', 'payment', 'performance'],
|
|
# 'customer_sentiment': 'very_frustrated',
|
|
# 'requires_action': ['immediate_response', 'investigation_needed']
|
|
# }
|
|
|
|
# Another support ticket example
|
|
ticket_text2 = """
|
|
Hi team,
|
|
|
|
Thanks for the great product! I was wondering if you could add a dark mode feature?
|
|
It would really help with eye strain during late night work sessions.
|
|
|
|
Best regards,
|
|
Happy Customer
|
|
"""
|
|
|
|
results2 = extractor.extract(ticket_text2, support_schema)
|
|
print(results2)
|
|
# Expected output: {
|
|
# 'ticket_type': 'feature_request',
|
|
# 'priority': 'low',
|
|
# 'product_area': ['ui'],
|
|
# 'customer_sentiment': 'satisfied',
|
|
# 'requires_action': ['waiting_customer']
|
|
# }
|
|
```
|
|
|
|
### Sequential Classification with Dependencies
|
|
|
|
```python
|
|
# Email routing and handling classification
|
|
email_schema = (extractor.create_schema()
|
|
# Primary classification
|
|
.classification("email_category",
|
|
["sales", "support", "hr", "legal", "general"],
|
|
cls_threshold=0.6
|
|
)
|
|
|
|
# Secondary classifications based on context
|
|
.classification("sales_stage",
|
|
["lead", "qualified", "proposal", "negotiation", "closed"],
|
|
cls_threshold=0.5
|
|
)
|
|
.classification("support_type",
|
|
["pre_sales", "technical", "account", "billing"],
|
|
cls_threshold=0.5
|
|
)
|
|
|
|
# Action classifications
|
|
.classification("required_action",
|
|
["reply_needed", "forward_to_team", "schedule_meeting", "no_action"],
|
|
multi_label=True,
|
|
cls_threshold=0.4
|
|
)
|
|
.classification("response_timeframe",
|
|
["within_1_hour", "within_24_hours", "within_week", "non_urgent"],
|
|
cls_threshold=0.6
|
|
)
|
|
)
|
|
|
|
email = """
|
|
Hi Sales Team,
|
|
|
|
I'm interested in your enterprise solution. We're currently evaluating vendors
|
|
for our upcoming project. Could we schedule a demo next week? We need to make
|
|
a decision by month end.
|
|
|
|
Best regards,
|
|
John from TechCorp
|
|
"""
|
|
|
|
results = extractor.extract(email, email_schema)
|
|
print(results)
|
|
# Expected output: {
|
|
# 'email_category': 'sales',
|
|
# 'sales_stage': 'qualified',
|
|
# 'support_type': 'pre_sales',
|
|
# 'required_action': ['reply_needed', 'schedule_meeting'],
|
|
# 'response_timeframe': 'within_24_hours'
|
|
# }
|
|
|
|
# HR email example
|
|
email2 = """
|
|
Dear HR Department,
|
|
|
|
I need to update my tax withholding information. Could someone please send me
|
|
the necessary forms? This is somewhat urgent as I need this changed before the
|
|
next payroll cycle.
|
|
|
|
Thank you,
|
|
Sarah
|
|
"""
|
|
|
|
results2 = extractor.extract(email2, email_schema)
|
|
print(results2)
|
|
# Expected output: {
|
|
# 'email_category': 'hr',
|
|
# 'sales_stage': 'lead', # May have noise in non-sales emails
|
|
# 'support_type': 'account',
|
|
# 'required_action': ['reply_needed'],
|
|
# 'response_timeframe': 'within_24_hours'
|
|
# }
|
|
```
|
|
|
|
### Complex Analysis with Multiple Classifications
|
|
|
|
```python
|
|
# Content moderation and analysis
|
|
content_schema = (extractor.create_schema()
|
|
# Content classifications
|
|
.classification("content_type",
|
|
["article", "comment", "review", "social_post", "message"])
|
|
.classification("primary_language",
|
|
["english", "spanish", "french", "other"])
|
|
|
|
# Quality assessments
|
|
.classification("quality_score",
|
|
["excellent", "good", "average", "poor", "spam"],
|
|
cls_threshold=0.7
|
|
)
|
|
.classification("originality",
|
|
["original", "derivative", "duplicate", "plagiarized"],
|
|
cls_threshold=0.8
|
|
)
|
|
|
|
# Safety and compliance
|
|
.classification("safety_flags",
|
|
{
|
|
"hate_speech": "Contains discriminatory or hateful content",
|
|
"violence": "Contains violent or threatening content",
|
|
"adult": "Contains adult or explicit content",
|
|
"misinformation": "Contains potentially false information",
|
|
"personal_info": "Contains personal identifying information"
|
|
},
|
|
multi_label=True,
|
|
cls_threshold=0.3
|
|
)
|
|
|
|
# Engagement predictions
|
|
.classification("engagement_potential",
|
|
["viral", "high", "medium", "low"],
|
|
cls_threshold=0.6
|
|
)
|
|
.classification("audience_fit",
|
|
["general", "professional", "academic", "youth", "senior"],
|
|
multi_label=True,
|
|
cls_threshold=0.5
|
|
)
|
|
)
|
|
|
|
content_text = """
|
|
Just discovered this amazing productivity hack that doubled my output!
|
|
Here's what I do: I wake up at 5 AM, meditate for 20 minutes, then work
|
|
in 90-minute focused blocks. The results have been incredible. My email
|
|
is john.doe@example.com if you want more tips!
|
|
"""
|
|
|
|
results = extractor.extract(content_text, content_schema)
|
|
print(results)
|
|
# Expected output: {
|
|
# 'content_type': 'social_post',
|
|
# 'primary_language': 'english',
|
|
# 'quality_score': 'good',
|
|
# 'originality': 'original',
|
|
# 'safety_flags': ['personal_info'],
|
|
# 'engagement_potential': 'high',
|
|
# 'audience_fit': ['general', 'professional']
|
|
# }
|
|
|
|
# Review example
|
|
review_text = """
|
|
Worst product ever!!! Total scam! Don't buy this garbage. The company should
|
|
be shut down for selling this junk. I'm going to report them to authorities.
|
|
"""
|
|
|
|
results2 = extractor.extract(review_text, content_schema)
|
|
print(results2)
|
|
# Expected output: {
|
|
# 'content_type': 'review',
|
|
# 'primary_language': 'english',
|
|
# 'quality_score': 'poor',
|
|
# 'originality': 'original',
|
|
# 'safety_flags': ['violence'], # Due to aggressive language
|
|
# 'engagement_potential': 'low',
|
|
# 'audience_fit': ['general']
|
|
# }
|
|
```
|
|
|
|
## Advanced Configurations
|
|
|
|
### Custom Thresholds
|
|
|
|
```python
|
|
# High-precision classification
|
|
schema = extractor.create_schema().classification(
|
|
"is_spam",
|
|
["spam", "not_spam"],
|
|
cls_threshold=0.9 # Very high confidence required
|
|
)
|
|
|
|
text = "Congratulations! You've won $1,000,000! Click here to claim your prize now!"
|
|
results = extractor.extract(text, schema)
|
|
print(results)
|
|
# Expected output: {'is_spam': 'spam'}
|
|
|
|
# Different thresholds for different tasks
|
|
schema = (extractor.create_schema()
|
|
.classification("priority", ["urgent", "high", "normal", "low"], cls_threshold=0.8)
|
|
.classification("department", ["sales", "support", "billing", "other"], cls_threshold=0.5)
|
|
)
|
|
|
|
text = "URGENT: Customer threatening to cancel $50k contract due to billing error"
|
|
results = extractor.extract(text, schema)
|
|
print(results)
|
|
# Expected output: {
|
|
# 'priority': 'urgent',
|
|
# 'department': 'billing'
|
|
# }
|
|
```
|
|
|
|
### Custom Activation Functions
|
|
|
|
```python
|
|
# Force specific activation
|
|
schema = extractor.create_schema().classification(
|
|
"category",
|
|
["A", "B", "C", "D"],
|
|
class_act="softmax" # Options: "sigmoid", "softmax", "auto"
|
|
)
|
|
|
|
text = "This clearly belongs to category B based on the criteria."
|
|
results = extractor.extract(text, schema)
|
|
print(results)
|
|
# Expected output: {'category': 'B'}
|
|
```
|
|
|
|
### Complex Multi-Label Example
|
|
|
|
```python
|
|
# Email classification system
|
|
schema = extractor.create_schema().classification(
|
|
"email_tags",
|
|
{
|
|
"action_required": "Email requires recipient to take action",
|
|
"meeting_request": "Email contains meeting invitation or scheduling",
|
|
"project_update": "Email contains project status or updates",
|
|
"urgent": "Email marked as urgent or time-sensitive",
|
|
"question": "Email contains questions requiring answers",
|
|
"fyi": "Informational email requiring no action"
|
|
},
|
|
multi_label=True,
|
|
cls_threshold=0.35
|
|
)
|
|
|
|
email_text = """
|
|
Hi team,
|
|
|
|
Quick update on Project Alpha: We're ahead of schedule!
|
|
|
|
However, I need your input on the design mockups by EOD tomorrow.
|
|
Can we schedule a 30-min call this week to discuss?
|
|
|
|
This is quite urgent as the client is waiting.
|
|
|
|
Best,
|
|
Sarah
|
|
"""
|
|
|
|
results = extractor.extract(email_text, schema)
|
|
print(results)
|
|
# Expected output: {
|
|
# 'email_tags': ['action_required', 'meeting_request', 'project_update', 'urgent', 'question']
|
|
# }
|
|
|
|
# FYI email example
|
|
email_text2 = """
|
|
Team,
|
|
|
|
Just wanted to let everyone know that I'll be out of office next Monday for a
|
|
doctor's appointment. I'll be back Tuesday morning.
|
|
|
|
Thanks,
|
|
Mark
|
|
"""
|
|
|
|
results2 = extractor.extract(email_text2, schema)
|
|
print(results2)
|
|
# Expected output: {
|
|
# 'email_tags': ['fyi']
|
|
# }
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Use Descriptions**: Always provide label descriptions when possible
|
|
```python
|
|
# Good - with descriptions
|
|
schema = extractor.create_schema().classification(
|
|
"intent",
|
|
{
|
|
"purchase": "User wants to buy a product",
|
|
"return": "User wants to return a product",
|
|
"inquiry": "User asking for information"
|
|
}
|
|
)
|
|
|
|
# Less effective - no context
|
|
schema = extractor.create_schema().classification(
|
|
"intent",
|
|
["purchase", "return", "inquiry"]
|
|
)
|
|
```
|
|
|
|
2. **Adjust Thresholds**: Lower thresholds for multi-label (0.3-0.5), higher for single-label (0.5-0.7)
|
|
|
|
3. **Multi-Label Strategy**: Use multi-label when categories aren't mutually exclusive
|
|
```python
|
|
# Good use of multi-label
|
|
schema = extractor.create_schema().classification(
|
|
"product_features",
|
|
["waterproof", "wireless", "rechargeable", "portable"],
|
|
multi_label=True
|
|
)
|
|
|
|
# Should be single-label
|
|
schema = extractor.create_schema().classification(
|
|
"size",
|
|
["small", "medium", "large"],
|
|
multi_label=False # Sizes are mutually exclusive
|
|
)
|
|
```
|
|
|
|
4. **Test with Real Examples**: Always test with actual text samples from your domain
|
|
|
|
## Common Use Cases
|
|
|
|
- **Sentiment Analysis**: Customer feedback, reviews, social media
|
|
- **Intent Classification**: Chatbots, customer service routing
|
|
- **Document Classification**: Email filtering, document management
|
|
- **Content Moderation**: Toxic content, spam detection
|
|
- **Topic Classification**: News categorization, content tagging |