This repository has been archived on 2025-12-24. You can view files and clone it, but cannot push or open issues or pull requests.
site-template/docs/image-to-text.md
2025-03-08 12:47:58 +01:00

5.4 KiB
Raw Blame History

Brief

Image-to-text models are AI models that convert visual data (images) directly into descriptive, informative text. Unlike Optical Character Recognition (OCR), these models generate captions or descriptions reflecting the content, context, and meaning of the image rather than just extracting written characters. Several commercial and open-source models are available for integration into web-based or other software solutions.

Available Models (Commercial and Open-source)

Model Name Provider/Platform Price Key Features Language/Library Links
BLIP-2 Hugging Face (Open-source, non-commercial) Free Image-captioning, Zero-shot inference, Multimodal Python, PyTorch BLIP-2 HuggingFace
OpenAI GPT-4V OpenAI (Commercial) Usage-based pricing (~$0.03/image) Vision-enabled, advanced multimodal understanding, high accuracy REST API (JS/TypeScript via fetch/promise) OpenAI Vision
Microsoft Azure Vision Microsoft Azure (Commercial) Pay-as-you-go (~$1.50 per 1,000 transactions) Advanced image description, object detection, image analysis REST API (JS/TypeScript via REST fetch or Azure SDK) Azure Vision Docs
Google Cloud Vision API Google (Commercial) Pay-as-you-go (~$1.50 per 1,000 requests) Robust image annotation, label detection, powerful object description REST API (JS/TypeScript via fetch/promise) Google Vision
Flamingo Mini Hugging Face (Open-source, non-commercial) Free Multimodal capabilities, advanced captioning, lightweight version Python/PyTorch Flamingo Mini HuggingFace
Salesforce BLIP Salesforce AI / Hugging Face (Open-source) Free State-of-the-art captioning, visual question answering Python/PyTorch Salesforce BLIP HF Repo

Recommended Approach/Library (Commercial APIs - JS ESM compatible)

To directly integrate quickly and easily with a Typescript-based frontend project (ES modules compatible and without React, Tailwind CSS as UI), commercial API services such as OpenAI Vision API or Azure Vision API are highly recommended. Heres a simple example integrating OpenAIs GPT-4V:

// Example: Simple OpenAI Vision API call using fetch.

const API_KEY = "YOUR_OPENAI_API_KEY";

type GPTVisionResponse = {
  choices: {
    message: {
      content: string;
    };
  }[];
};

async function generateImageCaption(imageUrl: string): Promise<string> {
  const response = await fetch("https://api.openai.com/v1/chat/completions", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": `Bearer ${API_KEY}`,
    },
    body: JSON.stringify({
      "model": "gpt-4-vision-preview",
      "messages": [
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "Describe the following image in detail."},
            {"type": "image_url", "image_url": {"url": imageUrl}}
          ]
        }
      ],
      "max_tokens": 150
    }),
  });

  if (!response.ok) throw new Error("OpenAI API failed.");

  const data: GPTVisionResponse = await response.json();
  return data.choices[0].message.content;
}

// Usage example (with async/await)
const caption = await generateImageCaption("https://example.com/sample_image.jpg");
console.log("Generated Caption:", caption);

Integration with Tailwind CSS-Based Interface (without React)

Here's a minimalistic example using vanilla JavaScript/Typescript with TailwindCSS for styling a simple interface:

HTML markup:

<div class="max-w-lg mx-auto p-4 bg-white rounded-lg shadow-lg">
  <input id="image-url" type="url"
    placeholder="Enter Image URL"
    class="w-full p-2 border border-gray-300 rounded mb-4"
  />
  <button id="generate-caption-btn"
    class="bg-blue-500 text-white px-4 py-2 rounded hover:bg-blue-600"
  >
    Generate Caption
  </button>
  <div id="caption-output" class="mt-4 p-4 bg-gray-100 rounded"></div>
</div>

Typescript/JS code (frontend):

document.getElementById('generate-caption-btn')!.addEventListener('click', async () => {
  const imageUrlInput = document.getElementById("image-url") as HTMLInputElement;
  const outputDiv = document.getElementById("caption-output")!;
  const imageUrl = imageUrlInput.value;

  outputDiv.textContent = "Generating caption...";

  try {
    const caption = await generateImageCaption(imageUrl);
    outputDiv.textContent = caption;
  } catch (error) {
    outputDiv.textContent = "Error generating caption.";
    console.error(error);
  }
});

References