Scanned PDFs and OCR: how AttachKit reads image-only documents
Why scanned, image-only PDFs can't be searched, filled, or redacted by text, and how the Make searchable tool adds a selectable text layer with on-device OCR in 24 languages.
Last updated
A scanned PDF is a stack of photographs of pages. It looks like any other PDF, but there's no text underneath the pixels — so you can't search it, select or copy from it, and tools that work on text have nothing to work with. This article explains how to recognize one and how AttachKit's on-device OCR fixes it.
How to tell your PDF is scanned
- Try selecting text in any PDF viewer. If your cursor can't grab words — or selection drags a box over the whole page — the page is an image.
- In Fill, a scanned form has no detectable form fields (though you can still place text on top of the page manually).
- In Redact, Scan for PII stops and tells you "This PDF has no embedded text — it looks like a scan", offering to run OCR first.
- Converters like PDF to Word or PDF to text come out empty or nearly empty unless OCR is involved.
The fix: Make searchable
Make searchable runs optical character recognition (OCR) on every page and writes an invisible text layer underneath the page images. The rendered pages stay pixel-identical — the result looks exactly like your scan — but it becomes selectable and searchable in any PDF reader.
The OCR engine is Tesseract compiled to WebAssembly, running inside your browser. Your scan is never uploaded: the engine code is served from AttachKit's own origin, and the document never appears in any network request. You can watch the Network tab while it runs to confirm — see How AttachKit handles your files.
Drop the PDF, pick a language, and click Make searchable. The result downloads with -searchable added to the file name, and the success message reports how many words were indexed.
Choosing the OCR language
The OCR language picker offers 24 languages, from Arabic to Vietnamese — including Chinese (simplified and traditional), Spanish, French, German, Hindi, Japanese, Korean, Portuguese, Russian, and Ukrainian. English is the default, and your choice is remembered for next time.
Two things worth knowing:
- The first use of a language downloads a few megabytes of training data to the browser cache. English and Russian ship from AttachKit's own servers; other languages fetch their training data from a public CDN. Either way it's reference data the engine reads — your scan is never part of any request.
- Picking the wrong language is the most common cause of the error "OCR didn't find any text. Either the PDF has no readable content or the wrong language was selected." Match the picker to the language the document is written in and run it again.
How long it takes, and the page cap
OCR is the heaviest job AttachKit does — roughly 5 to 60 seconds per page depending on page size, scan quality, and your computer. While it runs you get a progress bar with the current page, an overall percentage, and a time estimate.
One run is capped at 200 pages. A longer document shows: "This PDF has N pages — too many to OCR in one pass. Split it into smaller files with the Pages tool, then run each part here." Use Pages to split, then OCR the parts.
You can click Cancel OCR at any time. Progress is checkpointed after every completed page to encrypted storage on your device, so a cancel — or a crashed tab — leaves a Resume from where you left off entry on the tool's start screen. Resuming skips the already-finished pages entirely.
Where OCR shows up in other tools
- Unlock: when removing a password has to fall back to image output, the download message suggests running Make searchable to restore selectable text.
- Redact: once a scan has a text layer (or after Redact's own on-device OCR of image pages), text search and PII detection work on it like any other PDF.
- Fill: after the searchable copy downloads, a one-click Fill in this form next? handoff sends it straight to Fill without re-uploading.
Scans that won't OCR
- Password-protected scans: OCR can't read an encrypted PDF — you'll see "This PDF is encrypted" with the action button disabled. Remove the password first with Unlock; see Encrypted PDFs explained.
- Corrupt or non-PDF files: "This file isn't a valid PDF" means the bytes are damaged or the file was renamed to
.pdf— re-export from the source. - Very poor scans: blurry, skewed, or low-resolution images recognize badly in any OCR engine. Re-scanning at 300 DPI, straight and well-lit, makes a bigger difference than any setting.
Related
- How AttachKit handles your files — why OCR running locally matters for sensitive scans
- Encrypted PDFs explained — unlocking a protected scan before OCR
- The Make searchable how-to guide walks through the tool step by step
Related
Still stuck? Contact support →