How Does the OCR Text Extractor Work?
Max Intel's OCR Text Extractor combines three recognition engines — PaddleOCR, Tesseract, and QR/barcode detection — to extract text from images with confidence scoring and automatic language detection. It processes images entirely in-browser using ONNX Runtime WebAssembly inference, with zero server uploads.
Smart Preprocessing for Better Accuracy
Research published by NIST in the Document Recognition and Retrieval conference series shows that preprocessing can improve OCR accuracy by 15–40% on degraded documents.
Raw images often produce poor OCR results due to uneven lighting, low contrast, or small text. When preprocessing is enabled (the default), Max Intel applies three techniques before OCR: grayscale conversion with perceptual weighting, Otsu binarization — an automatic thresholding algorithm published by Nobuyuki Otsu in IEEE Transactions on Systems, Man, and Cybernetics (1979) that optimally separates text from background, and upscaling for images below 1500px. These preprocessing steps can dramatically improve accuracy on screenshots, photos of documents, and low-resolution scans.
Format-Preserving Text Extraction
Unlike basic OCR tools that dump text as a single block, Max Intel analyzes the bounding boxes of each line and word to preserve the original document structure. Paragraph breaks are detected based on vertical spacing between lines. Indentation is preserved based on horizontal offset from the left margin. Column spacing in tabular content is maintained using word gap analysis. The result is extracted text that retains the visual structure of the original document.
OSINT Use Cases
OSINT investigators frequently need to extract text from screenshots of social media posts, photos of physical documents, scanned court records, leaked PDFs, and archived web pages saved as images. This tool converts all of those into searchable, copy-pasteable text. The LLM-ready JSON export is particularly useful — it packages extracted text with metadata (confidence scores, language, timestamps) into a structured format that AI assistants can analyze for entity extraction, timeline reconstruction, or pattern identification.
PDF Processing
For PDF files, the tool uses a hybrid approach powered by PDF.js. It first attempts to extract native text content from each page. If a page contains little or no native text (indicating it's a scanned image), it automatically renders the page at 2.5x resolution and runs OCR on the rendered image. This means the tool handles both native PDFs and scanned PDFs without any user configuration.
After extracting text, use the ZIP ↔ JSON Converter to package multiple extracted documents for AI analysis. The Document Search tool can help find the original sources of leaked documents online. For broader investigation workflows, the Dork Generator can locate exposed files and PDFs across the web.