tutorial 2026-06-14

Image OCR: Extract Text From Any Picture in 100+ Languages (2026)

Image OCR: Extract Text From Any Picture in 100+ Languages

You snap a photo of a Japanese restaurant menu, a German road sign, a handwritten lecture board, or that one weirdly-formatted receipt your accounting team needs digitised — and now you want the text, in a copy-pasteable form, without typing it out by hand. That's OCR (Optical Character Recognition), and in 2026 it's finally good enough that you can do it on your phone, in your browser, in 100+ languages, without uploading the source image anywhere.

This guide explains how modern OCR works, when to use Ai2Done's Image to Text tool vs the OCR built into your phone vs cloud APIs, and the privacy reasoning behind doing OCR locally for anything you wouldn't post publicly — passports, ID cards, business cards, medical documents, family-tree research, the lot.

TL;DR

Use the Image to Text tool when the source image contains sensitive info (IDs, contracts, medical) — runs 100% in your browser.
Use iOS Live Text or Google Lens for casual one-tap copy from a phone photo — instant, but data may be processed in the cloud.
Use cloud OCR (Google Cloud Vision, AWS Textract) when you need very specific features like table structure recognition, handwriting, or 50+ pages per call.
For PDFs, use the Extract Text tool — it auto-detects whether the PDF is text-based (no OCR needed) or scanned (OCR runs).
For 100+ languages, modern Tesseract supports them all; quality varies (English/Chinese/Japanese/Korean are near-perfect, low-resource African and Indic scripts vary).

Why this is harder than it looks

Reading text from a photo seems trivial — you do it every day with your eyes. For a computer it requires solving three independent problems that all interact:

Detection: where in the image is there text at all? On a flat document scan, the answer is "everywhere." On a real-world photo (restaurant menu held at an angle under fluorescent lighting), text might be 5% of the pixels, rotated 12°, partially shadowed, and overlapping a colorful background.
Recognition: what character is this glyph? A capital "I", a lowercase "l", and the digit "1" are visually nearly identical in most fonts. Japanese kanji vs simplified Chinese is a 30-year-old hard problem because the characters look identical but mean different things.
Layout: in what reading order should the characters be assembled into words, lines, paragraphs, columns? A multi-column newspaper page or an invoice with tables is an entirely separate ML problem from the per-character recognition.

Naive OCR libraries from 2010 solved (2) reasonably for clean black-on-white scans and failed at (1) and (3) on real photos. Modern systems use deep learning end-to-end — a single neural network that takes the image and emits text in reading order, handling detection and layout implicitly.

The current state of the art for browser-side OCR is Tesseract 5 (open source, run by Google) with LSTM-based recognition. It supports 100+ languages, runs fast in WebAssembly, and produces accuracy comparable to commercial offerings for the most common 30 languages.

Method 1: Ai2Done Image to Text (browser-side, privacy-first)

The Ai2Done Image to Text tool wraps Tesseract.js (Tesseract 5 compiled to WebAssembly) in a clean UI:

Open /tools/image_to_text in any modern browser.
Pick the language — choose from a dropdown of 100+ options. For multi-language documents (e.g. a Chinese restaurant receipt with English brand names), you can select multiple languages at once.
Upload your image — drag-and-drop a JPG, PNG, HEIC, WebP, or BMP. The tool also accepts a paste from clipboard (handy for screenshots).
Wait 2-15 seconds — Tesseract runs locally on your CPU. First-time use downloads the language model (~5 MB per language); subsequent runs are instant because the model is cached in your browser.
Copy or download — output appears as plain text; you can also export as a searchable PDF where the OCR layer is invisible-but-selectable on top of the original image.

The whole thing runs in your browser. The image, the language model, and the extracted text never touch a server. For sensitive documents (passports, medical records, bank statements) this is the only safe pattern — every OCR-as-a-service offering retains your uploaded image for at least debugging purposes, often longer.

Accuracy tips:

For best results, the source image should be at least 300 DPI equivalent (~1500×2000 px for an A4 page).
Straighten and crop before OCR if you can — Tesseract handles up to ~15° of rotation gracefully but does much better on perfectly aligned text.
For low-contrast scans, the tool has a "binarise" toggle that converts to pure black-and-white using Otsu's method — often a 10-15% accuracy bump on faint or yellowed pages.
Multi-column layouts: enable "detect columns" so Tesseract doesn't read across columns.

Method 2: iOS Live Text / Google Lens (one-tap on phone)

For casual everyday OCR, the OCR built into your phone is genuinely magical:

iOS Live Text (iOS 15+): point the camera at any text, tap the indicator in the bottom-right corner of the viewfinder, and select text exactly as you would on a webpage. Works offline on iPhone 11 and newer; older devices fall back to a cloud round-trip.
Google Lens (Android, Chrome, Google Photos): same flow, slightly broader language support, runs cloud-side by default but offers an "on-device" preference for sensitive content on Pixel devices.

Both are perfect for "I need to copy this restaurant menu into a translator app" or "send me my friend's phone number from this whiteboard photo." For anything that needs to land in a downloadable text file or a CSV, they're awkward — you still have to manually copy each chunk into a notes app.

Method 3: Cloud OCR APIs (when you need scale or special features)

For automated pipelines processing thousands of documents, or when you need features beyond plain text extraction:

Google Cloud Vision API — excellent multi-language support, exceptional handwriting recognition, $1.50 per 1000 images.
AWS Textract — best-in-class for forms and tables (it returns structured key-value pairs and table cells, not just plain text), $1.50-50 per 1000 pages depending on features.
Azure Computer Vision — solid all-rounder, integrated with Microsoft 365 workflows.

The trade-off: every image you process is sent to a third-party server and retained per their data-retention policy (usually 30 days for debugging). For automated business workflows on non-sensitive data this is fine. For passports, medical records, contracts, or anything personal, it's a privacy step you may not want to take.

How we built it (technical deep-dive)

The Ai2Done Image to Text tool is built on:

Tesseract.js 5.1 — Tesseract 5 LSTM compiled to WebAssembly. The core engine is ~1.5 MB gzipped; each language model is 5-20 MB.
Lazy language loading — we don't ship 100 language models; the browser downloads only the languages you select, on demand. Models cache in the browser's HTTP cache so reload is instant.
Web Worker thread pool — for batch OCR of multiple images, we spawn workers up to navigator.hardwareConcurrency - 1 to keep the UI responsive while crunching.
Pre-processing pipeline — before handing the image to Tesseract, we run optional deskew (using Hough transform), binarisation (Otsu's method), and contrast normalisation. These help significantly on phone photos of physical documents.
Searchable PDF export — for the "OCR overlay" output, we use pdf-lib to compose the original image plus an invisible-text layer at the correct character positions. The output is a real PDF that any reader can open, search, and copy from.

For very large images (>4000 px on a side), we down-sample to 2000 px before OCR. Tesseract's accuracy plateaus around that resolution for most fonts, and the extra pixels just slow things down without improving the output.

FAQ

Q: Does the tool support handwritten text? A: Limited. Tesseract 5 has experimental handwriting models for English, Arabic, and a few others, but accuracy on real handwriting is 60-80% at best. For serious handwriting OCR, Google Cloud Vision or Microsoft Azure's Form Recognizer are still meaningfully better. We're tracking Tesseract 6 (expected late 2026) which promises a major handwriting upgrade.

Q: Can I OCR a PDF directly without screenshotting each page first? A: Yes — use the Extract Text tool. It opens the PDF, detects whether each page is text-based (extracts directly) or image-based (runs OCR), and produces a combined text output.

Q: My Chinese / Japanese / Korean OCR has weird character substitutions. Why? A: For CJK languages, picking the correct language model matters more than for European languages. Simplified Chinese and Traditional Chinese share many characters but use different glyph styles for some — picking chi_sim for a Traditional-Chinese document gives subtly wrong output. Pick the specific variant (chi_sim, chi_tra, jpn, jpn_vert for vertical Japanese, kor).

Q: How accurate is it really? A: On clean printed text in well-supported languages (English, French, Spanish, German, Chinese, Japanese, Korean, Russian, Arabic), expect 98-99.5% character accuracy at 300 DPI. On phone photos at typical lighting, 92-97%. On low-resource African or Indic scripts, sometimes lower — Tesseract's training data is uneven.

Q: Can the tool extract text from a video? A: Not directly — it operates on still images. If your text appears in video subtitles, use the Audio to Text tool instead (which uses Whisper, the OpenAI ASR model, also browser-side). If you need to grab text from a specific video frame, screenshot first then run Image to Text.

Q: Will OCR work on photos taken at an angle? A: Yes within ~15° of rotation. For severely tilted photos (taken from across a table), enable the "auto-deskew" option. Beyond ~30° rotation, results degrade quickly — straighten in your photo app first, then OCR.

Q: What about table-structured data — invoices, spreadsheets? A: Tesseract returns plain text and approximates layout with whitespace and line breaks. For genuine table-cell structure (key-value pairs, multi-column financial reports), AWS Textract is meaningfully better and currently the only reasonable option. We're adding browser-side table detection in 2026 Q4.

Try it now

Pull text out of any image in seconds:

Open the Image to Text tool →

Drag-drop a photo, pick the language(s), get text. No upload, no signup, no watermark.

منسق JSON

ترميز Base64

ترميز URL

منسق YAML

منسق XML

منسق SQL

فك JWT

دمج PDF

ضغط PDF

تقسيم PDF

تعديل PDF

PDF إلى Word

Word إلى PDF

PDF إلى JPG

مولد الصور بالذكاء الاصطناعي

إزالة الخلفية

Make Background Transparent

ضغط الصورة

تغيير حجم الصورة

دقة فائقة

ترميم الوجه

مترجم عميق بالذكاء الاصطناعي

كاتب الفقرة

مساعد البريد الإلكتروني الذكي

إعادة كتابة الجملة

ملخص النص

المثبت النحوي

مُعلق الكود

مشغل Tencent Video VIP

مشغل iQIYI VIP

مشغل Youku VIP

مشغل MangoTV VIP

تحميل يوتيوب

تحميل دوين

تحميل فيديو وي شات

CSV إلى Excel

Excel إلى PDF

XML إلى JSON

تقسيم Excel

تقسيم CSV

XML إلى Excel

Excel إلى XML