5.4 KiB
Brief
Image-to-text models are AI models that convert visual data (images) directly into descriptive, informative text. Unlike Optical Character Recognition (OCR), these models generate captions or descriptions reflecting the content, context, and meaning of the image rather than just extracting written characters. Several commercial and open-source models are available for integration into web-based or other software solutions.
Available Models (Commercial and Open-source)
| Model Name | Provider/Platform | Price | Key Features | Language/Library | Links |
|---|---|---|---|---|---|
| BLIP-2 | Hugging Face (Open-source, non-commercial) | Free | Image-captioning, Zero-shot inference, Multimodal | Python, PyTorch | BLIP-2 HuggingFace |
| OpenAI GPT-4V | OpenAI (Commercial) | Usage-based pricing (~$0.03/image) | Vision-enabled, advanced multimodal understanding, high accuracy | REST API (JS/TypeScript via fetch/promise) | OpenAI Vision |
| Microsoft Azure Vision | Microsoft Azure (Commercial) | Pay-as-you-go (~$1.50 per 1,000 transactions) | Advanced image description, object detection, image analysis | REST API (JS/TypeScript via REST fetch or Azure SDK) | Azure Vision Docs |
| Google Cloud Vision API | Google (Commercial) | Pay-as-you-go (~$1.50 per 1,000 requests) | Robust image annotation, label detection, powerful object description | REST API (JS/TypeScript via fetch/promise) | Google Vision |
| Flamingo Mini | Hugging Face (Open-source, non-commercial) | Free | Multimodal capabilities, advanced captioning, lightweight version | Python/PyTorch | Flamingo Mini HuggingFace |
| Salesforce BLIP | Salesforce AI / Hugging Face (Open-source) | Free | State-of-the-art captioning, visual question answering | Python/PyTorch | Salesforce BLIP HF Repo |
Recommended Approach/Library (Commercial APIs - JS ESM compatible)
To directly integrate quickly and easily with a Typescript-based frontend project (ES modules compatible and without React, Tailwind CSS as UI), commercial API services such as OpenAI Vision API or Azure Vision API are highly recommended. Here’s a simple example integrating OpenAI’s GPT-4V:
// Example: Simple OpenAI Vision API call using fetch.
const API_KEY = "YOUR_OPENAI_API_KEY";
type GPTVisionResponse = {
choices: {
message: {
content: string;
};
}[];
};
async function generateImageCaption(imageUrl: string): Promise<string> {
const response = await fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${API_KEY}`,
},
body: JSON.stringify({
"model": "gpt-4-vision-preview",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the following image in detail."},
{"type": "image_url", "image_url": {"url": imageUrl}}
]
}
],
"max_tokens": 150
}),
});
if (!response.ok) throw new Error("OpenAI API failed.");
const data: GPTVisionResponse = await response.json();
return data.choices[0].message.content;
}
// Usage example (with async/await)
const caption = await generateImageCaption("https://example.com/sample_image.jpg");
console.log("Generated Caption:", caption);
Integration with Tailwind CSS-Based Interface (without React)
Here's a minimalistic example using vanilla JavaScript/Typescript with TailwindCSS for styling a simple interface:
HTML markup:
<div class="max-w-lg mx-auto p-4 bg-white rounded-lg shadow-lg">
<input id="image-url" type="url"
placeholder="Enter Image URL"
class="w-full p-2 border border-gray-300 rounded mb-4"
/>
<button id="generate-caption-btn"
class="bg-blue-500 text-white px-4 py-2 rounded hover:bg-blue-600"
>
Generate Caption
</button>
<div id="caption-output" class="mt-4 p-4 bg-gray-100 rounded"></div>
</div>
Typescript/JS code (frontend):
document.getElementById('generate-caption-btn')!.addEventListener('click', async () => {
const imageUrlInput = document.getElementById("image-url") as HTMLInputElement;
const outputDiv = document.getElementById("caption-output")!;
const imageUrl = imageUrlInput.value;
outputDiv.textContent = "Generating caption...";
try {
const caption = await generateImageCaption(imageUrl);
outputDiv.textContent = caption;
} catch (error) {
outputDiv.textContent = "Error generating caption.";
console.error(error);
}
});