generated from polymech/site-template
192 lines
13 KiB
Markdown
192 lines
13 KiB
Markdown
## Image to Text for Markdown Tables
|
|
|
|
This document explores options for converting images to text, focusing on extracting structured information and presenting it as Markdown tables. We prioritize offline and free solutions and consider both commercial and open-source models, particularly from Hugging Face.
|
|
|
|
### Brief
|
|
|
|
Extracting structured text from images, especially to create Markdown tables with specific information like links, prices, and specifications, is a complex task that goes beyond simple Optical Character Recognition (OCR). While OCR transcribes text, identifying and organizing structured information requires further processing, potentially involving layout analysis, entity recognition, and relationship extraction.
|
|
|
|
### Offline / Free Options
|
|
|
|
For offline and free solutions, the landscape is more challenging for highly structured extraction but offers viable options, especially when combined.
|
|
|
|
#### 1. Tesseract OCR
|
|
|
|
- **Description:** A widely used, open-source OCR engine originally developed by HP and now maintained by Google. It is highly capable and supports numerous languages.
|
|
- **Offline:** Yes, Tesseract is designed for offline use.
|
|
- **Free:** Yes, it is released under the Apache License 2.0.
|
|
- **Markdown Tables:** Tesseract by itself performs OCR - converting image text to plain text. It does *not* inherently output Markdown tables or understand structured data like links, prices, or specs.
|
|
- **Limitations:** Output is plain text, not structured. Requires additional processing to identify and structure data. Accuracy can vary based on image quality and layout complexity.
|
|
- **Libraries (Typescript ESM compatible and others):**
|
|
- **`tesseract.js` (JavaScript/Typescript, ESM compatible):** A Javascript port of Tesseract OCR for use in the browser and Node.js.
|
|
```typescript
|
|
import { createScheduler } from 'tesseract.js';
|
|
|
|
async function recognizeText(imagePath: string) {
|
|
const scheduler = createScheduler();
|
|
const worker = await scheduler.addWorker();
|
|
|
|
const result = await worker.recognize(imagePath);
|
|
console.log(result.data.text);
|
|
|
|
await scheduler.terminate();
|
|
}
|
|
|
|
recognizeText('image.png');
|
|
```
|
|
- **`node-tesseract-ocr` (Node.js wrapper):** A Node.js wrapper around the Tesseract OCR library.
|
|
```javascript
|
|
const tesseract = require('node-tesseract-ocr')
|
|
|
|
const config = {
|
|
l: 'eng', // Replace 'eng' with your language code
|
|
oem: 1,
|
|
psm: 3,
|
|
}
|
|
|
|
tesseract('image.png', config)
|
|
.then((text) => {
|
|
console.log(text)
|
|
})
|
|
.catch((error) => {
|
|
console.log(error.message)
|
|
})
|
|
```
|
|
- **`pytesseract` (Python wrapper):** A Python wrapper for Tesseract. Requires Tesseract to be installed on your system.
|
|
|
|
#### 2. EasyOCR
|
|
|
|
- **Description:** An OCR library that leverages PyTorch and CRAFT for text detection and CRNN for text recognition. It is designed to be easy to use and supports multiple languages with pre-trained models.
|
|
- **Offline:** Yes, EasyOCR works offline after installing the library and downloading language models.
|
|
- **Free:** Yes, EasyOCR is open-source and free to use under the MIT License.
|
|
- **Markdown Tables:** Similar to Tesseract, EasyOCR is primarily an OCR engine. It excels at text detection and recognition but does not natively create Markdown tables or extract structured data.
|
|
- **Limitations:** Output is primarily plain text. Needs further processing for structure. Relies on pre-trained models, and performance might depend on model and language support.
|
|
- **Libraries (Python):**
|
|
- **`easyocr` (Python):**
|
|
```python
|
|
import easyocr
|
|
reader = easyocr.Reader(['en']) # need to run only once to load model into memory
|
|
result = reader.readtext('image.png')
|
|
for (bbox, text, prob) in result:
|
|
print(f"Text: {text}, Probability: {prob}")
|
|
```
|
|
|
|
#### 3. Hugging Face Hub - Open Source Models (Offline Potential)
|
|
|
|
- **Description:** Hugging Face Hub hosts a vast collection of pre-trained models, including many for OCR, document layout analysis, and potentially even table detection. Some models can be used offline with libraries like `transformers.js` or Python's `transformers` library.
|
|
- **Offline:** Potentially offline. Many models can be downloaded and used offline. Check model documentation for offline compatibility and library requirements.
|
|
- **Free:** Many models on Hugging Face Hub are open-source and free to use.
|
|
- **Markdown Tables:** This is where more advanced structured extraction becomes feasible.
|
|
- **OCR Models:** Models like those from PaddleOCR (available on HF Hub) may offer more robust OCR capabilities.
|
|
- **Layout Analysis Models:** Models designed for document layout analysis can identify regions of text, tables, and other elements within an image. This is a crucial step towards structured output. Look for models tagged with "document-layout-analysis" or "table-detection" on Hugging Face Hub.
|
|
- **Table Detection and Recognition Models:** Specialized models exist for detecting and recognizing tables specifically. Some might even output structured data formats that can be converted to Markdown tables. Explore models related to "table-ocr" or "table-recognition."
|
|
|
|
- **Exploration Steps on Hugging Face Hub:**
|
|
1. **Search:** On Hugging Face Hub ([https://huggingface.co/models](https://huggingface.co/models)), search for keywords like "OCR," "document layout analysis," "table detection," "table recognition."
|
|
2. **Model Cards:** Review model cards for details on:
|
|
- **Tasks:** Does it perform OCR, layout analysis, table detection, etc.?
|
|
- **Offline Usage:** Is it compatible with `transformers` for Python or `transformers.js` for browser/Node.js and offline inference?
|
|
- **Input/Output:** What type of input does it expect (image)? What output does it produce (text, bounding boxes, structured table data)?
|
|
- **License:** Is it free for your intended use?
|
|
3. **Example Code (Python with `transformers` - conceptual):**
|
|
|
|
```python
|
|
from transformers import pipeline, AutoProcessor, AutoModelForDocumentQuestionAnswering
|
|
from PIL import Image
|
|
|
|
# Example - Needs model specific processor and model names.
|
|
# Replace "your-table-detection-model" and "your-table-extraction-model"
|
|
# with actual Hugging Face model identifiers after research.
|
|
|
|
# Conceptual example - Might need different model types and processors
|
|
table_detection_processor = AutoProcessor.from_pretrained("your-table-detection-model")
|
|
table_detection_model = AutoModelForDocumentQuestionAnswering.from_pretrained("your-table-detection-model")
|
|
|
|
table_extraction_processor = AutoProcessor.from_pretrained("your-table-extraction-model")
|
|
table_extraction_model = AutoModelForDocumentQuestionAnswering.from_pretrained("your-table-extraction-model")
|
|
|
|
image = Image.open("image_with_table.png").convert("RGB")
|
|
|
|
# 1. Table Detection (Conceptual step - model dependent)
|
|
inputs_detection = table_detection_processor(images=image, return_tensors="pt")
|
|
outputs_detection = table_detection_model(**inputs_detection)
|
|
# ... process outputs_detection to identify table regions ...
|
|
|
|
# 2. Table Extraction (Conceptual step - model dependent and may require region inputs)
|
|
inputs_extraction = table_extraction_processor(images=image, return_tensors="pt", table_regions=...) # if region input is needed
|
|
outputs_extraction = table_extraction_model(**inputs_extraction)
|
|
# ... process outputs_extraction to get structured table data ...
|
|
|
|
# 3. Convert to Markdown table
|
|
table_data = ... # process outputs_extraction to get data in a suitable format
|
|
markdown_table = generate_markdown_table(table_data)
|
|
print(markdown_table)
|
|
|
|
def generate_markdown_table(data):
|
|
markdown = "|" + "|".join(data["headers"]) + "|\n"
|
|
markdown += "|" + "|".join(["---"] * len(data["headers"])) + "|\n"
|
|
for row in data["rows"]:
|
|
markdown += "|" + "|".join(row) + "|\n"
|
|
return markdown
|
|
|
|
```
|
|
**Important Notes for Hugging Face Models:**
|
|
- **Model Selection is Key:** The effectiveness of Hugging Face models depends heavily on choosing the right model for the task (OCR, layout analysis, table detection). Careful research and experimentation on Hugging Face Hub are essential.
|
|
- **Model and Processor Compatibility:** Ensure that the chosen model has a corresponding processor (`AutoProcessor`) and library support (`transformers`, `transformers.js`) for your desired environment (Python, Javascript).
|
|
- **Input/Output Handling:** Understand the model's expected input format (image, document, text, regions) and output format. Processing the raw model output to extract structured data and format it into Markdown tables will likely require custom code.
|
|
|
|
### Commercial Models (Consider for Comparison and Potential Online Alternatives)
|
|
|
|
While the focus is on free/offline, understanding commercial options provides a benchmark and alternative if constraints allow for online services or paid solutions.
|
|
|
|
- **Cloud Vision APIs (Google Cloud Vision, Azure Computer Vision, AWS Textract):**
|
|
- **Description:** Powerful cloud-based APIs offering advanced OCR, document analysis, and table extraction capabilities. Often provide structured output, including JSON or other formats that can be easily converted to Markdown tables.
|
|
- **Offline:** No, these are cloud-based services.
|
|
- **Free Tier / Paid:** Typically offer a free tier with limited usage and then become paid based on consumption.
|
|
- **Markdown Tables:** Often provide structured output that can be readily transformed into Markdown tables. Some might even directly output Markdown or formats close to it.
|
|
- **Adobe Acrobat Pro DC, Abbyy FineReader:**
|
|
- **Description:** Desktop software with robust OCR and document conversion features, including to structured formats.
|
|
- **Offline:** Yes, desktop software.
|
|
- **Paid:** Commercial software licenses required.
|
|
- **Markdown Tables:** May offer export options to structured formats (like CSV or XML) that can then be converted to Markdown tables. Might require an intermediate step.
|
|
|
|
### References
|
|
|
|
- **Tesseract OCR:** [https://tesseract-ocr.github.io/](https://tesseract-ocr.github.io/)
|
|
- **Tesseract.js:** [https://tesseract.projectnaptha.com/](https://tesseract.projectnaptha.com/)
|
|
- **node-tesseract-ocr:** [https://www.npmjs.com/package/node-tesseract-ocr](https://www.npmjs.com/package/node-tesseract-ocr)
|
|
- **pytesseract:** [https://pypi.org/project/pytesseract/](https://pypi.org/project/pytesseract/)
|
|
- **EasyOCR:** [https://www.jaided.ai/easyocr/](https://www.jaided.ai/easyocr/)
|
|
- **Hugging Face Hub Models:** [https://huggingface.co/models](https://huggingface.co/models)
|
|
- **Transformers Python Library:** [https://huggingface.co/docs/transformers/index](https://huggingface.co/docs/transformers/index)
|
|
- **Transformers.js Library:** [https://huggingface.co/docs/transformers.js/index](https://huggingface.co/docs/transformers.js/index)
|
|
|
|
### Example Code (Markdown Table Generation - Typescript)
|
|
|
|
```typescript
|
|
function generateMarkdownTable(headers: string[], rows: string[][]): string {
|
|
let markdown = "";
|
|
markdown += "|" + headers.join("|") + "|\n";
|
|
markdown += "|" + headers.map(() => "---").join("|") + "|\n";
|
|
for (const row of rows) {
|
|
markdown += "|" + row.join("|") + "|\n";
|
|
}
|
|
return markdown;
|
|
}
|
|
|
|
// Example Usage:
|
|
const tableHeaders = ["Product", "Price", "Link", "Specs"];
|
|
const tableData = [
|
|
["Product A", "$10", "[Link A](https://example.com/a)", "Spec 1, Spec 2"],
|
|
["Product B", "$20", "[Link B](https://example.com/b)", "Spec 3, Spec 4"],
|
|
];
|
|
|
|
const markdownTable = generateMarkdownTable(tableHeaders, tableData);
|
|
console.log(markdownTable);
|
|
```
|
|
|
|
### Conclusion
|
|
|
|
Creating Markdown tables with structured information from images offline and for free is a challenging but achievable goal. Starting with OCR engines like Tesseract or EasyOCR is fundamental. For more advanced structured extraction, exploring layout analysis and table detection models on Hugging Face Hub is recommended.
|
|
|
|
Remember that truly offline, free, and highly accurate structured extraction often requires a combination of tools and potentially some level of custom development to process OCR output, identify entities (links, prices, specs), and format them into Markdown tables. Commercial cloud services generally offer more readily available structured output but at a cost and require internet connectivity. |