Archived

This repository has been archived on 2025-12-24. You can view files and clone it, but cannot push or open issues or pull requests.

babayaga 67a2d7f542 options:image to text

2025-03-08 12:47:58 +01:00

5.4 KiB

Raw Blame History

Brief

Image-to-text models are AI models that convert visual data (images) directly into descriptive, informative text. Unlike Optical Character Recognition (OCR), these models generate captions or descriptions reflecting the content, context, and meaning of the image rather than just extracting written characters. Several commercial and open-source models are available for integration into web-based or other software solutions.

Available Models (Commercial and Open-source)

Model Name	Provider/Platform	Price	Key Features	Language/Library	Links
BLIP-2	Hugging Face (Open-source, non-commercial)	Free	Image-captioning, Zero-shot inference, Multimodal	Python, PyTorch	BLIP-2 HuggingFace
OpenAI GPT-4V	OpenAI (Commercial)	Usage-based pricing (~$0.03/image)	Vision-enabled, advanced multimodal understanding, high accuracy	REST API (JS/TypeScript via fetch/promise)	OpenAI Vision
Microsoft Azure Vision	Microsoft Azure (Commercial)	Pay-as-you-go (~$1.50 per 1,000 transactions)	Advanced image description, object detection, image analysis	REST API (JS/TypeScript via REST fetch or Azure SDK)	Azure Vision Docs
Google Cloud Vision API	Google (Commercial)	Pay-as-you-go (~$1.50 per 1,000 requests)	Robust image annotation, label detection, powerful object description	REST API (JS/TypeScript via fetch/promise)	Google Vision
Flamingo Mini	Hugging Face (Open-source, non-commercial)	Free	Multimodal capabilities, advanced captioning, lightweight version	Python/PyTorch	Flamingo Mini HuggingFace
Salesforce BLIP	Salesforce AI / Hugging Face (Open-source)	Free	State-of-the-art captioning, visual question answering	Python/PyTorch	Salesforce BLIP HF Repo

Recommended Approach/Library (Commercial APIs - JS ESM compatible)

To directly integrate quickly and easily with a Typescript-based frontend project (ES modules compatible and without React, Tailwind CSS as UI), commercial API services such as OpenAI Vision API or Azure Vision API are highly recommended. Here’s a simple example integrating OpenAI’s GPT-4V:

// Example: Simple OpenAI Vision API call using fetch.

const API_KEY = "YOUR_OPENAI_API_KEY";

type GPTVisionResponse = {
  choices: {
    message: {
      content: string;
    };
  }[];
};

async function generateImageCaption(imageUrl: string): Promise<string> {
  const response = await fetch("https://api.openai.com/v1/chat/completions", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": `Bearer ${API_KEY}`,
    },
    body: JSON.stringify({
      "model": "gpt-4-vision-preview",
      "messages": [
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "Describe the following image in detail."},
            {"type": "image_url", "image_url": {"url": imageUrl}}
          ]
        }
      ],
      "max_tokens": 150
    }),
  });

  if (!response.ok) throw new Error("OpenAI API failed.");

  const data: GPTVisionResponse = await response.json();
  return data.choices[0].message.content;
}

// Usage example (with async/await)
const caption = await generateImageCaption("https://example.com/sample_image.jpg");
console.log("Generated Caption:", caption);

Integration with Tailwind CSS-Based Interface (without React)

Here's a minimalistic example using vanilla JavaScript/Typescript with TailwindCSS for styling a simple interface:

HTML markup:

<div class="max-w-lg mx-auto p-4 bg-white rounded-lg shadow-lg">
  <input id="image-url" type="url"
    placeholder="Enter Image URL"
    class="w-full p-2 border border-gray-300 rounded mb-4"
  />
  <button id="generate-caption-btn"
    class="bg-blue-500 text-white px-4 py-2 rounded hover:bg-blue-600"
  >
    Generate Caption
  </button>
  <div id="caption-output" class="mt-4 p-4 bg-gray-100 rounded"></div>
</div>

Typescript/JS code (frontend):

document.getElementById('generate-caption-btn')!.addEventListener('click', async () => {
  const imageUrlInput = document.getElementById("image-url") as HTMLInputElement;
  const outputDiv = document.getElementById("caption-output")!;
  const imageUrl = imageUrlInput.value;

  outputDiv.textContent = "Generating caption...";

  try {
    const caption = await generateImageCaption(imageUrl);
    outputDiv.textContent = caption;
  } catch (error) {
    outputDiv.textContent = "Error generating caption.";
    console.error(error);
  }
});

5.4 KiB Raw Blame History Unescape Escape

Brief

Available Models (Commercial and Open-source)

Recommended Approach/Library (Commercial APIs - JS ESM compatible)

Integration with Tailwind CSS-Based Interface (without React)

References

5.4 KiB

Raw Blame History