Converting a printed page into accurate and editable text has long been one of the great challenges facing computer users. Advances have been made over the years, but the current professional optical character recognition (OCR) applications for the Mac — Abbyy’s FineReader Pro 5 and ScanSoft’s OmniPage Pro X — demonstrate that turning paper into pixels remains an imperfect process.
Both of these programs handle the basics: they accommodate any TWAIN-compliant scanner in Mac OS 9.2 and earlier, they let you fine-tune the OCR process by designating areas of text, graphics, and tables in a document, and they let you save files in a variety of formats. And on the surface, FineReader and OmniPage are very similar. Both programs present a main window containing two panes — an image of the original scan is on the left, and the software-recognized text is on the right. The program imposes zones — boxes that denote a page’s text, tables, and graphic elements — on the scanned image. The software can render these zones automatically, or you can use the included tools to map out the zones yourself.
FineReader provides greater access to each step in the OCR process than OmniPage. For example, FineReader includes an Analyze Layout button, a fea-ture missing from OmniPage. When you click on this button, FineReader quickly analyzes the document and maps out the zones. This is helpful if you want to change inappropriately recognized zones before letting the software scan the text. OmniPage’s Perform OCR button creates zones and recognizes text in one step. Once those processes are complete, you can then change zone types and, if necessary, run the OCR process again.
The two programs have similar export options, allowing you to save recognized text and
pictures in Rich Text Format (RTF), AppleWorks, Microsoft Word and Excel, PDF, and HTML. Although both programs allow you to “scan” files from standard formats such as TIFF, JPEG, and PICT, only OmniPage can import PDF files. OmniPage is also the only OS X-native option of the two, but because few scanners work with OS X, this is not currently a tremendous advantage.
Of the two programs, OmniPage has the better spelling-checker interface. Its Proofread OCR window lets you move easily through the spelling-checker process, thanks to Ignore and Change buttons activated with the return key. On the other hand, FineReader requires you to click on every button with the mouse and open and close a new window whenever you wish to add a new word to FineReader’s dictionary.
This is an important difference, considering that you’ll spend a lot of time using the spelling checker. If our tests are any indication, you’ll spend more time in FineReader’s spelling checker than in OmniPage’s, as FineReader questioned more words than OmniPage and, of those flagged words, got more of them wrong.
Head to Head
To test the automatic-recognition capabilities of the program, we scanned two documents with a multifunction HP OfficeJet G85 in black-and-white at 300 dpi — settings designed to produce a clean scan (see ”
Midrange Flatbed Scanners,” April 2002, for more on getting a good scan). The first document was a simple press release that contained a logo and a large block of text with some italic and boldface words. The second was a Macworld page comprising small type, multiple columns, a large graphic, and a table. We set up the two programs to automatically create zones and recognize the pages. We ran them through the spelling checker and then exported the pages as RTF (opened and read in Word) and PDF files.
For documents that have simple layouts, FineReader does an acceptable job. The program correctly identified the press release’s logo as a graphic, put all the text except the contact information at the bottom in a single text block, and correctly identified the bold and italic text in the exported RTF and PDF file (though it underlined the bold headline).
To get similar results in OmniPage, we had to tell the program that the page contained a single column (this is done via a pop-up menu located in the toolbar). Although the italic text appeared in the Word, document, the only bold type that appeared in the RTF file was in the headline, and spaces between paragraphs were exaggerated. OmniPage incorrectly identified the logo as text — but it did correctly recognize that text. In the PDF file, OmniPage again failed to produce the bold formatting in the body of the press release and, in places where the bold formatting should have appeared, often dropped words below the baseline of the surrounding text.
With the more complex Macworld page, OmniPage outperformed FineReader. After we selected the Mixed Pages option in the toolbar, OmniPage created numerous text blocks. The exported OmniPage RTF file closely matched the layout of the original page, though it contained some odd line breaks and spaces and suffered from the same baseline problems. Even after we told FineReader that the page had multiple columns, it lumped a couple of parallel columns together in a single zone, causing the exported text in the RTF file to be jumbled.
PDF files created with the two programs were cleaner. Although FineReader correctly identified the graphic and included it in the PDF file, its overly broad selection of text blocks caused it to leave out lines where the zones bordered each other at the end of some paragraphs in the exported file. Both programs properly produced the magazine page’s table, but where FineReader included a mixed set of font sizes, OmniPage produced more-consistent text formatting.
We were able to rectify many of these problems by manually adjusting the zones before having the programs analyze the documents. Both programs let you create rectangular and polygonal zones, and they let you combine adjacent zones. Converting the text zone around the press release’s logo to a graphic zone allowed OmniPage to correctly export the graphic. Likewise, drawing more zones on the magazine page improved the formatting of the RTF and PDF files exported from FineReader. But the fact that you have to resort to the manual tools calls into question the usefulness of the programs’ batch-processing capabilities and support for scanners with sheet-feeder attachments — features designed to let you scan and convert multiple documents with little user interference.
Macworld’s Buying Advice
Neither OmniPage Pro X nor FineReader Pro 5 is perfect. But OmniPage — with its power to more easily recognize complex documents automatically, ability to import PDF files, and cleaner spelling-checker interface — is closer to the mark, particularly if you plan on scanning a lot of documents that have complex layouts.