3.4 KiB
3.4 KiB
GLiNER2 Regex Validators
Regex validators filter extracted spans to ensure they match expected patterns, improving extraction quality and reducing false positives.
Quick Start
from gliner2 import GLiNER2, RegexValidator
extractor = GLiNER2.from_pretrained("your-model")
# Create validator and apply to field
email_validator = RegexValidator(r"^[\w\.-]+@[\w\.-]+\.\w+$")
schema = (extractor.create_schema()
.structure("contact")
.field("email", dtype="str", validators=[email_validator])
)
RegexValidator Parameters
- pattern: Regex pattern (string or compiled Pattern)
- mode:
"full"(exact match) or"partial"(substring match) - exclude:
False(keep matches) orTrue(exclude matches) - flags: Regex flags like
re.IGNORECASE(for string patterns only)
Examples
Email Validation
email_validator = RegexValidator(r"^[\w\.-]+@[\w\.-]+\.\w+$")
text = "Contact: john@company.com, not-an-email, jane@domain.org"
# Output: ['john@company.com', 'jane@domain.org']
Phone Numbers (US Format)
phone_validator = RegexValidator(r"\(\d{3}\)\s\d{3}-\d{4}", mode="partial")
text = "Call (555) 123-4567 or 5551234567"
# Output: ['(555) 123-4567'] # Second number filtered out
URLs Only
url_validator = RegexValidator(r"^https?://", mode="partial")
text = "Visit https://example.com or www.site.com"
# Output: ['https://example.com'] # www.site.com filtered out
Exclude Test Data
no_test_validator = RegexValidator(r"^(test|demo|sample)", exclude=True, flags=re.IGNORECASE)
text = "Products: iPhone, Test Phone, Samsung Galaxy"
# Output: ['iPhone', 'Samsung Galaxy'] # Test Phone excluded
Length Constraints
length_validator = RegexValidator(r"^.{5,50}$") # 5-50 characters
text = "Names: Jo, Alexander, A Very Long Name That Exceeds Fifty Characters"
# Output: ['Alexander'] # Others filtered by length
Multiple Validators
# All validators must pass
username_validators = [
RegexValidator(r"^[a-zA-Z0-9_]+$"), # Alphanumeric + underscore
RegexValidator(r"^.{3,20}$"), # 3-20 characters
RegexValidator(r"^(?!admin)", exclude=True, flags=re.IGNORECASE) # No "admin"
]
schema = (extractor.create_schema()
.structure("user")
.field("username", dtype="str", validators=username_validators)
)
text = "Users: ab, john_doe, user@domain, admin, valid_user123"
# Output: ['john_doe', 'valid_user123']
Common Patterns
| Use Case | Pattern | Mode |
|---|---|---|
r"^[\w\.-]+@[\w\.-]+\.\w+$" |
full | |
| Phone (US) | r"\(\d{3}\)\s\d{3}-\d{4}" |
partial |
| URL | r"^https?://" |
partial |
| Numbers only | r"^\d+$" |
full |
| No spaces | r"^\S+$" |
full |
| Min length | r"^.{5,}$" |
full |
| Alphanumeric | r"^[a-zA-Z0-9]+$" |
full |
Best Practices
- Use specific patterns - More specific = fewer false positives
- Test your regex - Validate patterns before deployment
- Combine validators - Chain multiple simple validators
- Consider case sensitivity - Use
re.IGNORECASEwhen needed - Start simple - Begin with basic patterns, refine as needed
Performance Notes
- Validators run after span extraction but before formatting
- Failed validation simply excludes the span (no errors)
- Multiple validators use short-circuit evaluation (stops at first failure)
- Compiled patterns are cached automatically