676 lines
23 KiB
Markdown
676 lines
23 KiB
Markdown
# Iterator Documentation
|
|
|
|
The Iterator module provides a powerful way to transform data structures using asynchronous operations, particularly suited for applying LLM-based transformations to JSON data. This document covers the core functionality, usage patterns, and examples.
|
|
|
|
## Overview
|
|
|
|
The Iterator module allows you to:
|
|
|
|
1. Define mappings between JSON paths and transformations
|
|
2. Apply transformations in place or to new fields
|
|
3. Customize filtering, error handling, and concurrency
|
|
4. Chain multiple transformations together
|
|
|
|
## Example Implementations
|
|
|
|
For complete working examples, see:
|
|
|
|
- [`async-iterator-example.ts`](../src/examples/core/async-iterator-example.ts): Shows basic data transformation with LLM and targetPath usage
|
|
- [`iterator-factory-example.ts`](../src/examples/core/iterator-factory-example.ts): Demonstrates factory pattern with multiple field transformations
|
|
|
|
These examples demonstrate transforming a sample product dataset with various JSONPath expressions and LLM-powered transformations.
|
|
|
|
## Core Components
|
|
|
|
### AsyncTransformer
|
|
|
|
The fundamental unit that applies transformations:
|
|
|
|
```typescript
|
|
type AsyncTransformer = (input: string, path: string) => Promise<string>
|
|
```
|
|
|
|
Every transformer takes a string input and its path in the JSON structure, then returns a transformed string.
|
|
|
|
### Field Mappings
|
|
|
|
Field mappings define which parts of the data to transform and how:
|
|
|
|
```typescript
|
|
interface FieldMapping {
|
|
jsonPath: string // JSONPath expression to find values
|
|
targetPath?: string // Optional target field for transformed values
|
|
options?: IKBotTask // Options for the transformation
|
|
}
|
|
```
|
|
|
|
## Basic Usage
|
|
|
|
### Creating an Iterator
|
|
|
|
```typescript
|
|
import { createIterator } from '@polymech/kbot'
|
|
|
|
// Create an iterator instance
|
|
const iterator = createIterator(
|
|
data, // The data to transform
|
|
globalOptionsMixin, // Global options for all transformations
|
|
{
|
|
throttleDelay: 1000,
|
|
concurrentTasks: 1,
|
|
errorCallback: (path, value, error) => console.error(`Error at ${path}: ${error.message}`),
|
|
filterCallback: async () => true,
|
|
transformerFactory: createCustomTransformer
|
|
}
|
|
)
|
|
|
|
// Define field mappings
|
|
const mappings = [
|
|
{
|
|
jsonPath: '$.products.*.name',
|
|
targetPath: null, // Transform in place
|
|
options: {
|
|
prompt: 'Make this product name more appealing'
|
|
}
|
|
}
|
|
]
|
|
|
|
// Apply transformations
|
|
await iterator.transform(mappings)
|
|
```
|
|
|
|
### JSONPath Patterns
|
|
|
|
The Iterator uses JSONPath to identify fields for transformation:
|
|
|
|
- `$..name` - All name fields at any level
|
|
- `$.products..name` - All name fields under the products key
|
|
- `$.products.*.*.name` - Names of items in product categories
|
|
- `$[*].description` - All description fields at the first level
|
|
|
|
## Advanced Usage
|
|
|
|
### Custom Transformers
|
|
|
|
Create custom transformers for specific transformation logic:
|
|
|
|
```typescript
|
|
const createCustomTransformer = (options: IKBotTask): AsyncTransformer => {
|
|
return async (input: string, jsonPath: string): Promise<string> => {
|
|
// Transform the input string based on options
|
|
return transformedValue
|
|
}
|
|
}
|
|
```
|
|
|
|
### In-Place vs. Target Field Transformations
|
|
|
|
#### In-Place Transformation
|
|
|
|
To transform values in place, set `targetPath` to `null`:
|
|
|
|
```typescript
|
|
{
|
|
jsonPath: '$.products.*.*.description',
|
|
targetPath: null,
|
|
options: {
|
|
prompt: 'Make this description more engaging'
|
|
}
|
|
}
|
|
```
|
|
|
|
This will replace the original description with the transformed value.
|
|
|
|
#### Adding New Fields
|
|
|
|
To keep the original value and add a transformed version, specify a `targetPath`:
|
|
|
|
```typescript
|
|
{
|
|
jsonPath: '$.products.*.*.name',
|
|
targetPath: 'marketingName',
|
|
options: {
|
|
prompt: 'Generate a marketing name based on this product'
|
|
}
|
|
}
|
|
```
|
|
|
|
This keeps the original `name` and adds a new `marketingName` field.
|
|
|
|
### Structured Output with Format Option
|
|
|
|
The `format` option allows you to define a JSON schema that the LLM output should conform to. This is extremely useful for ensuring consistent, structured responses that can be easily parsed and used in your application.
|
|
|
|
#### Basic Format Usage
|
|
|
|
To request structured output, add a `format` property to your field mapping options:
|
|
|
|
```typescript
|
|
{
|
|
jsonPath: '$.productReview.reviewText',
|
|
targetPath: 'analysis',
|
|
options: {
|
|
prompt: 'Analyze this product review and extract key information',
|
|
format: {
|
|
type: "object",
|
|
properties: {
|
|
sentiment: {
|
|
type: "string",
|
|
enum: ["positive", "neutral", "negative"],
|
|
description: "The overall sentiment of the review"
|
|
},
|
|
pros: {
|
|
type: "array",
|
|
items: {
|
|
type: "string"
|
|
},
|
|
description: "Positive aspects mentioned in the review",
|
|
minItems: 1,
|
|
maxItems: 3
|
|
},
|
|
cons: {
|
|
type: "array",
|
|
items: {
|
|
type: "string"
|
|
},
|
|
description: "Negative aspects mentioned in the review",
|
|
minItems: 0,
|
|
maxItems: 3
|
|
}
|
|
},
|
|
required: ["sentiment", "pros", "cons"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Processing Structured Responses
|
|
|
|
The formatted response may be returned as a JSON string. When working with formatted responses, it's good practice to handle potential string parsing:
|
|
|
|
```typescript
|
|
// After transformation
|
|
if (data.productReview && data.productReview.analysis) {
|
|
try {
|
|
// Parse the JSON string if needed
|
|
const analysisJson = typeof data.productReview.analysis === 'string'
|
|
? JSON.parse(data.productReview.analysis)
|
|
: data.productReview.analysis;
|
|
|
|
// Now you can work with the structured data
|
|
console.log(`Sentiment: ${analysisJson.sentiment}`);
|
|
console.log(`Pros: ${analysisJson.pros.join(', ')}`);
|
|
console.log(`Cons: ${analysisJson.cons.join(', ')}`);
|
|
} catch (e) {
|
|
console.error("Error parsing structured output:", e);
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Best Practices for Formatted Output
|
|
|
|
1. **Clear Prompt Instructions**: Include explicit instructions in your prompt about the expected format.
|
|
2. **Schema Validation**: Use detailed JSON schemas with required fields and appropriate types.
|
|
3. **Parsing Handling**: Always include error handling when parsing the output.
|
|
4. **Schema Examples**: Consider including examples in your prompt for more complex schemas.
|
|
|
|
#### Format Option Example
|
|
|
|
Here's a complete example from the iterator-factory-example.ts file:
|
|
|
|
```typescript
|
|
// Define a field mapping with format option
|
|
const fieldMappings = [
|
|
{
|
|
jsonPath: '$.productReview.reviewText',
|
|
targetPath: 'analysis',
|
|
options: {
|
|
// Clear and explicit prompt that includes the schema format details
|
|
prompt: `Analyze this product review and extract key information using EXACTLY the schema specified below.
|
|
|
|
The review: "Great selection of fruits with good prices and quality. Some items were out of stock."
|
|
|
|
Your response MUST be a valid JSON object following this exact schema:
|
|
{
|
|
"sentiment": "positive" | "neutral" | "negative",
|
|
"pros": ["string", "string"...], // 1-3 items
|
|
"cons": ["string"...] // 0-3 items
|
|
}
|
|
|
|
Do not add any extra fields not in the schema, and make sure to use the exact field names as specified.`,
|
|
// Schema validation ensures structured output format
|
|
format: {
|
|
type: "object",
|
|
properties: {
|
|
sentiment: {
|
|
type: "string",
|
|
enum: ["positive", "neutral", "negative"],
|
|
description: "The overall sentiment of the review"
|
|
},
|
|
pros: {
|
|
type: "array",
|
|
items: {
|
|
type: "string"
|
|
},
|
|
description: "Positive aspects mentioned in the review",
|
|
minItems: 1,
|
|
maxItems: 3
|
|
},
|
|
cons: {
|
|
type: "array",
|
|
items: {
|
|
type: "string"
|
|
},
|
|
description: "Negative aspects mentioned in the review",
|
|
minItems: 0,
|
|
maxItems: 3
|
|
}
|
|
},
|
|
required: ["sentiment", "pros", "cons"]
|
|
}
|
|
}
|
|
}
|
|
]
|
|
```
|
|
|
|
When run, this produces a structured output like:
|
|
|
|
```json
|
|
{
|
|
"sentiment": "positive",
|
|
"pros": [
|
|
"great selection of fruits",
|
|
"good prices",
|
|
"good quality"
|
|
],
|
|
"cons": [
|
|
"some items were out of stock"
|
|
]
|
|
}
|
|
```
|
|
|
|
This structured format is much easier to work with programmatically than free-form text responses.
|
|
|
|
## Filtering
|
|
|
|
Filters determine which values should be transformed:
|
|
|
|
```typescript
|
|
// Default filters that skip numbers, booleans, and empty strings
|
|
const defaultFilters = [isNumber, isBoolean, isValidString]
|
|
|
|
// Custom filter example
|
|
const skipFirstItem: FilterCallback = async (input, path) => {
|
|
return !path.includes('[0]')
|
|
}
|
|
```
|
|
|
|
## Throttling and Concurrency
|
|
|
|
Control API rate limits and parallel processing:
|
|
|
|
```typescript
|
|
{
|
|
throttleDelay: 1000, // Milliseconds between requests
|
|
concurrentTasks: 2 // Number of parallel transformations
|
|
}
|
|
```
|
|
|
|
## Retry Mechanism
|
|
|
|
The Iterator includes a built-in retry mechanism for handling transient errors during transformations, particularly useful for API calls to external services like LLMs.
|
|
|
|
### Configuration
|
|
|
|
Retry settings can be configured at both the global and individual field mapping levels:
|
|
|
|
```typescript
|
|
// Global retry configuration
|
|
const iterator = createIterator(
|
|
data,
|
|
globalOptionsMixin,
|
|
{
|
|
maxRetries: 3, // Maximum number of retry attempts
|
|
retryDelay: 2000, // Base delay in milliseconds between retries
|
|
// ... other options
|
|
}
|
|
)
|
|
|
|
// Field-specific retry configuration
|
|
const mappings = [
|
|
{
|
|
jsonPath: '$.products.*.description',
|
|
targetPath: null,
|
|
options: { /* ... */ },
|
|
maxRetries: 5, // Override global setting for this field
|
|
retryDelay: 1000 // Override global setting for this field
|
|
}
|
|
]
|
|
```
|
|
|
|
### Behavior
|
|
|
|
When a transformation fails:
|
|
|
|
1. The Iterator will wait for `retryDelay` milliseconds
|
|
2. The delay increases exponentially with each retry attempt (backoff strategy)
|
|
3. After `maxRetries` failed attempts, the error is passed to the `errorCallback`
|
|
|
|
This helps handle temporary issues like API rate limits, network connectivity problems, or service outages without failing the entire operation.
|
|
|
|
## Caching Mechanism
|
|
|
|
The Iterator includes a powerful caching system to improve performance, reduce API costs, and speed up repeated operations.
|
|
|
|
### Cache Configuration
|
|
|
|
Configure caching behavior when creating an iterator:
|
|
|
|
```typescript
|
|
import { createIterator, DefaultCache, NoopCache } from '@polymech/kbot'
|
|
|
|
// Configure caching
|
|
const iterator = createIterator(
|
|
data,
|
|
globalOptionsMixin,
|
|
{
|
|
cache: {
|
|
enabled: true, // Enable or disable caching
|
|
implementation: new DefaultCache(), // Use default cache or provide custom
|
|
namespace: 'my-custom-namespace' // Custom namespace for cache entries
|
|
},
|
|
// ... other options
|
|
}
|
|
)
|
|
```
|
|
|
|
### Cache Implementations
|
|
|
|
The Iterator provides multiple cache implementations:
|
|
|
|
1. **DefaultCache**: Uses the registered cache module to store and retrieve values
|
|
2. **NoopCache**: A no-operation cache that doesn't actually cache anything
|
|
3. **Custom Implementation**: Implement the `CacheInterface` to connect any caching system
|
|
|
|
```typescript
|
|
// Custom cache implementation example
|
|
class RedisCache implements CacheInterface {
|
|
constructor(private redisClient: any) {}
|
|
|
|
async get(key: any): Promise<any> {
|
|
const result = await this.redisClient.get(JSON.stringify(key));
|
|
return result ? JSON.parse(result) : null;
|
|
}
|
|
|
|
async set(key: any, value: any): Promise<void> {
|
|
await this.redisClient.set(JSON.stringify(key), JSON.stringify(value));
|
|
}
|
|
|
|
async delete(key: any): Promise<void> {
|
|
await this.redisClient.del(JSON.stringify(key));
|
|
}
|
|
}
|
|
|
|
// Use the custom cache
|
|
const iterator = createIterator(data, options, {
|
|
cache: {
|
|
enabled: true,
|
|
implementation: new RedisCache(redisClient),
|
|
namespace: 'product-transformations'
|
|
}
|
|
})
|
|
```
|
|
|
|
### Cache Behavior
|
|
|
|
1. **Caching Enabled**: Before transformation, the Iterator checks if a result already exists in the cache. If found, it applies the cached result directly without calling the transformer. If not found, it performs the transformation and stores the result.
|
|
|
|
2. **Caching Disabled**: When caching is explicitly disabled, the system will remove any existing cache entries for the current transformations. This ensures that stale data isn't accidentally used later if caching is re-enabled.
|
|
|
|
### Registering a Cache Module
|
|
|
|
To use the DefaultCache implementation, you must register a cache module:
|
|
|
|
```typescript
|
|
import { registerCacheModule } from '@polymech/kbot'
|
|
|
|
// Your cache module must implement these methods
|
|
const myCacheModule = {
|
|
get_cached_object: async (key, namespace) => { /* ... */ },
|
|
set_cached_object: async (key, namespace, value, options) => { /* ... */ },
|
|
rm_cached_object: async (key, namespace) => { /* ... */ }
|
|
}
|
|
|
|
// Register the cache module
|
|
registerCacheModule(myCacheModule)
|
|
```
|
|
|
|
### Cache Keys
|
|
|
|
Cache keys are automatically generated based on:
|
|
- JSONPath expression
|
|
- Target path (if any)
|
|
- Transformation options
|
|
- Retry settings
|
|
|
|
This ensures that different transformations produce different cache entries, while identical transformations reuse cached results.
|
|
|
|
## Combined Example with Caching and Retry
|
|
|
|
Here's a complete example showcasing both caching and retry mechanisms:
|
|
|
|
```typescript
|
|
import { createIterator, FieldMapping, DefaultCache, registerCacheModule } from '@polymech/kbot'
|
|
|
|
// Setup a simple mock cache module
|
|
const mockCacheModule = {
|
|
storage: new Map(),
|
|
get_cached_object: async (key, namespace) => {
|
|
const cacheKey = `${namespace}:${JSON.stringify(key)}`;
|
|
console.log(`Looking up cache: ${cacheKey}`);
|
|
return mockCacheModule.storage.get(cacheKey);
|
|
},
|
|
set_cached_object: async (key, namespace, value) => {
|
|
const cacheKey = `${namespace}:${JSON.stringify(key)}`;
|
|
console.log(`Storing in cache: ${cacheKey}`);
|
|
mockCacheModule.storage.set(cacheKey, value);
|
|
},
|
|
rm_cached_object: async (key, namespace) => {
|
|
const cacheKey = `${namespace}:${JSON.stringify(key)}`;
|
|
console.log(`Removing from cache: ${cacheKey}`);
|
|
const exists = mockCacheModule.storage.has(cacheKey);
|
|
mockCacheModule.storage.delete(cacheKey);
|
|
return exists;
|
|
}
|
|
};
|
|
|
|
// Register the cache module
|
|
registerCacheModule(mockCacheModule);
|
|
|
|
async function transformProductsWithCaching() {
|
|
// Product data
|
|
const data = {
|
|
products: {
|
|
fruits: [
|
|
{
|
|
id: 'f1',
|
|
name: 'apple',
|
|
description: 'A sweet fruit',
|
|
},
|
|
{
|
|
id: 'f2',
|
|
name: 'banana',
|
|
description: 'A yellow fruit',
|
|
}
|
|
]
|
|
}
|
|
};
|
|
|
|
let requestCount = 0;
|
|
|
|
// Create a transformer factory that simulates occasional failures
|
|
const createLLMTransformer = (options): AsyncTransformer => {
|
|
return async (input, path) => {
|
|
requestCount++;
|
|
|
|
// Simulate occasional failures to demonstrate retry
|
|
if (requestCount % 3 === 0) {
|
|
throw new Error('API rate limit exceeded');
|
|
}
|
|
|
|
console.log(`Transforming ${path}: ${input}`);
|
|
return `Enhanced: ${input}`;
|
|
}
|
|
};
|
|
|
|
// Create iterator with caching and retry
|
|
const iterator = createIterator(
|
|
data,
|
|
{ model: 'openai/gpt-4' },
|
|
{
|
|
throttleDelay: 1000,
|
|
concurrentTasks: 1,
|
|
transformerFactory: createLLMTransformer,
|
|
maxRetries: 3,
|
|
retryDelay: 1000,
|
|
cache: {
|
|
enabled: true,
|
|
implementation: new DefaultCache(),
|
|
namespace: 'product-transformations'
|
|
},
|
|
errorCallback: (path, value, error) => {
|
|
console.error(`Failed to transform ${path}: ${error.message}`);
|
|
}
|
|
}
|
|
);
|
|
|
|
// Define transformations
|
|
const mappings: FieldMapping[] = [
|
|
{
|
|
jsonPath: '$.products.fruits.*.description',
|
|
targetPath: null,
|
|
options: {
|
|
prompt: 'Make this description more detailed'
|
|
},
|
|
maxRetries: 5 // Override global retry setting
|
|
},
|
|
{
|
|
jsonPath: '$.products.fruits.*.name',
|
|
targetPath: 'marketingName',
|
|
options: {
|
|
prompt: 'Generate a marketing name for this product'
|
|
}
|
|
}
|
|
];
|
|
|
|
// First run - will perform transformations and cache results
|
|
console.log("First run - transforming and caching:");
|
|
await iterator.transform(mappings);
|
|
|
|
// Second run - will use cached results
|
|
console.log("\nSecond run - using cached results:");
|
|
await iterator.transform(mappings);
|
|
|
|
// Output the transformed data
|
|
console.log("\nTransformed data:");
|
|
console.log(JSON.stringify(data, null, 2));
|
|
}
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Be specific with JSONPaths**: Use precise JSONPath expressions to target only the fields you want to transform.
|
|
|
|
2. **Handle errors gracefully**: Provide an error callback to handle failed transformations without breaking the entire process.
|
|
|
|
3. **Respect rate limits**: Set appropriate throttle delays when working with external APIs.
|
|
|
|
4. **Test with small datasets first**: Validate your transformations on a smaller subset before processing large datasets.
|
|
|
|
5. **Prefer targeted transformations**: Transform only what you need to minimize costs and processing time.
|
|
|
|
6. **Use caching for expensive operations**: Enable caching for transformations that are computationally expensive or costly, such as LLM API calls.
|
|
|
|
7. **Configure appropriate retry settings**: Set retry limits and delays based on the expected reliability of your transformers. More retries with longer delays for less reliable services.
|
|
|
|
8. **Use targetPath for non-destructive transformations**: When generating new content related to existing fields, use targetPath to preserve the original data.
|
|
|
|
9. **Use format for structured outputs**: When you need consistent, structured data from LLMs, use the format option with clear JSON schemas.
|
|
|
|
10. **Include schema details in prompts**: For complex schemas, include the schema structure in your prompt to guide the LLM.
|
|
|
|
11. **Handle string parsing**: Always add error handling when parsing structured responses, as they may be returned as string JSON.
|
|
|
|
12. **Implement custom cache for production**: For production scenarios, implement a persistent cache solution rather than relying on in-memory caching.
|
|
|
|
13. **Use appropriate namespaces**: When multiple parts of your application use the same cache implementation, use distinct namespaces to prevent collisions.
|
|
|
|
## API Reference
|
|
|
|
### Main Functions
|
|
|
|
- `createIterator(data, optionsMixin, globalOptions)`: Creates an iterator instance
|
|
- `transformObjectWithOptions(obj, transform, options)`: Low-level function to transform objects
|
|
- `transformObject(obj, transform, path, ...)`: Transforms matching paths in an object
|
|
|
|
### Helper Functions
|
|
|
|
- `testFilters(filters)`: Creates a filter callback from filter functions
|
|
- `defaultFilters()`: Returns commonly used filters
|
|
- `defaultError`: Default error handler that logs to console
|
|
|
|
### Types and Interfaces
|
|
|
|
- `AsyncTransformer`: Function that transforms strings asynchronously
|
|
- `FilterCallback`: Function that determines if a value should be transformed
|
|
- `ErrorCallback`: Function that handles transformation errors
|
|
- `FieldMapping`: Configuration for a transformation
|
|
- `jsonPath`: JSONPath expression to select values
|
|
- `targetPath`: Optional field to store transformed value (null for in-place)
|
|
- `options`: Configuration for the transformation including:
|
|
- `prompt`: The prompt for the LLM
|
|
- `format`: Optional JSON schema for structured output
|
|
- `TransformOptions`: Options for the transformation process
|
|
- `CacheConfig`: Configuration for the caching mechanism
|
|
- `INetworkOptions`: Configuration for throttling and concurrency
|
|
|
|
## Limitations
|
|
|
|
1. The Iterator works with string values; objects and arrays are traversed but not directly transformed.
|
|
2. Large datasets might require pagination or chunking for efficient processing.
|
|
3. External API rate limits might require careful throttling configuration.
|
|
|
|
## Troubleshooting
|
|
|
|
Common issues and solutions:
|
|
|
|
- **No transformations occurring**: Check your JSONPath expressions and filter conditions
|
|
- **Unexpected field structure**: Examine the exact structure of your data
|
|
- **Rate limiting errors**: Increase the throttleDelay between requests
|
|
- **Transformation errors**: Implement a custom error callback for detailed logging
|
|
|
|
## Running the Examples
|
|
|
|
To run the included examples:
|
|
|
|
```bash
|
|
# Run the basic async iterator example
|
|
npm run examples:async-iterator
|
|
|
|
# Run the iterator factory example
|
|
npm run examples:iterator-factory
|
|
|
|
# Run with debug logging
|
|
npm run examples:async-iterator -- --debug
|
|
|
|
# Run with caching disabled (forces fresh responses)
|
|
npm run examples:iterator-factory -- --no-cache
|
|
```
|
|
|
|
The examples will transform sample JSON data and save the results to the `tests/test-data/core/` directory.
|
|
|
|
## See Also
|
|
|
|
- [JSON Path Syntax](https://goessner.net/articles/JsonPath/) - Reference for JSONPath expressions
|
|
- [p-throttle](https://github.com/sindresorhus/p-throttle) - The throttling library used internally
|
|
- [p-map](https://github.com/sindresorhus/p-map) - For concurrent asynchronous mapping |