Secrets of the paperless office: optimizing OCR
Since I started using a document scanner about seven years ago, I’ve scanned many thousands of pages and used OCR (optical character recognition) software to convert those scans into searchable PDFs. I’ve also written extensively about the paperless office. But when you try to reduce the amount of paper you use, you inevitably increase the amount of hard-drive space you use. I began to wonder what combinations of scanner settings and software would get the best quality scan results while using the least hard-disk space.
What sparked my investigation was a claim that some OCR apps increase the file sizes of scanned images dramatically, whereas others (Acrobat Pro in particular) shrink them. When you plan to store and read scanned documents on an iOS device, compactness is especially important. Unfortunately, Adobe’s $499 Acrobat Pro XI () can no longer be driven externally by AppleScript, which means it requires tedious manual clicking to perform OCR. Were other OCR apps really inflating file sizes, and was there any way around this problem without resorting to Acrobat?
Hundreds of experiments later, I came up with some surprising results. Read on for all the details or skip to the “So, where’s the sweet spot?” section for the bottom line.
The ins and outs of OCR
When you initially save a scanned document as a PDF file, you get nothing more than a bitmapped image in a PDF wrapper. Your scanner’s software most likely has settings to determine the resolution of the scans in dpi (dots per inch), the color mode (black and white, grayscale, or color), and the amount of compression applied to the scanned image. All those settings affect not only the appearance of the scan but also the quality of information the OCR engine has to work with. Once OCR software recognizes the text in a PDF, it saves that text in an invisible layer along with the image so you can see what the document originally looked like, but can also search, select, and copy its text.
Besides recognizing the text, OCR software may downsample the image (decrease its resolution, so that it takes up less space) or change the compression used. Sometimes these features are user-configurable; in other cases, they’re hardwired. Acrobat Pro has yet another option—a feature called ClearScan that replaces all the bitmapped text with a custom font (which takes up much less space), and then swaps out the original image for one with a much lower resolution. ClearScan nearly always results in the smallest possible PDF, but it may not be the best choice if you want to be sure your scanned image looks exactly like the original, even when printed. In addition, using ClearScan means settling for Acrobat’s OCR engine, about which I’ll say more in a moment.
I wanted to have some solid statistics to work with, so I scanned a couple of documents dozens of times each, with many combinations of resolution, color mode, and compression. Then I ran various raw scans through four different OCR engines: ABBYY’s $100 ABBY FineReader Express (), Acrobat Pro X, Smile’s $100 PDFpenPro (), and the version of ABBYY FineReader built into Devon Technologies’ DEVONthink Pro Office (). The four engines I tested are a small subset of the OCR tools available on the Mac, but they’re among the most popular. I examined the results for file size, OCR accuracy, and image fidelity.
How OCR affects file size
Most desktop document scanners have an optical resolution of 600 dpi, but let you scan at a lower resolution if you prefer. For my tests, I used a Fujitsu ScanSnap iX500 (), which is 600 dpi natively but offers up to 1200 dpi through software interpolation. Discounting compression, doubling the number of dots per inch quadruples the file size—plus scanning at higher resolutions can take much longer. So the trick is to find the lowest resolution that will meet your needs.
Although many variables come into play, my results showed that for documents consisting mainly of black text on a white background, a 300-dpi grayscale scan can run anywhere from about 250KB to 1MB per page (depending on the level of compression) before applying OCR. It probably goes without saying that black-and-white images are smallest and color images largest, with grayscale in between. Likewise, increasing resolution always increases file size, while increasing compression decreases file size. (Files with the lightest compression tended to be about three to five times larger than those with the heaviest compression.) None of that is surprising, but what did surprise me was how OCR software changes the original sizes.
In every case, PDFpenPro did exactly what I expected, which was to increase the original file size only slightly. That is, it left the image alone and simply added the text. Acrobat Pro, given its default settings (that is, using neither ClearScan nor downsampling) behaved roughly the same as PDFpenPro with color and grayscale images; for the most part, it increased the sizes by a bit less than FineReader did. But with black-and-white images, Acrobat Pro applied its own compression which shrank the files, sometimes by as much as 90 percent.
On the other hand, FineReader Express, which also compressed the images again, produced entirely different results. Black-and-white images grew, sometimes profoundly—for example, a 77KB file became 343KB and a 2.7MB file ballooned to 13.2MB. With grayscale and color images, the results were inconsistent; some files grew while others shrank.
Although the stand-alone ABBYY FineReader Express doesn’t let you modify its settings for image recompression, the version built into DEVONthink Pro Office does let you enable downsampling to the resolution of your choice as well as set the level of compression used on graphics. So, with that version of FineReader I was able to get file sizes closer to what PDFpenPro and Acrobat Pro produced.
The settings that affect OCR accuracy
Depending on resolution, color mode, and compression, the one-page scanned letter I tested ranged in size from 77KB to 2.2MB before OCR. But if OCR accuracy suffers with smaller file sizes, that may not be a good trade off. So my next question was, which combination of settings and OCR engine produce the most accurate result?
To test accuracy, I opened a PDF in Preview, selected all the text, and copied it into a BBEdit document; then I used BBEdit’s Compare feature to highlight the differences between a given scan and a corrected model document. I counted errors as best I could; in many cases, such as when only spacing was different or when many words were run together, the number of errors was largely a matter of interpretation. Still, the overall trends were clear.
Resolution: At the lowest resolution I tested (150 dpi for grayscale and color; 300 dpi for black and white), OCR errors were so numerous in all the tested engines that it would have been almost as efficient for me to retype the documents as to correct the errors. Accuracy generally improved with increased resolution, but not linearly. For example, whereas a 300-dpi scan was far more accurate than a 150-dpi scan, the difference between a 300- and 600-dpi scan’s accuracy was tiny.
Color mode: Black-and-white images yielded the worst OCR accuracy by far. Grayscale images were superior to black-and-white ones at every resolution, and 300-dpi grayscale scans yielded much better results than 600- or even 1200-dpi black-and-white scans. Color scans produced roughly the same accuracy, on average, as grayscale scans, except at very low resolutions (in which case, color scans were considerably worse than grayscale).
Compression: The amount of compression applied to the image had relatively little bearing on OCR accuracy, especially at 300 dpi and higher. What I did see at the highest levels of compression was more noise, in the form of fuzzy text and speckled line art. Even the noisiest scans were entirely legible, but I felt a medium level of compression was more pleasant to look at, with only a modest increase in file size.
Engines: Of the tools I tested, FineReader (in either stand-alone or embedded form) was far more accurate than either Acrobat Pro or PDFpenPro, and in most tests, Acrobat Pro was the least accurate. Even though Acrobat Pro was capable of producing the smallest files, I felt the amount of editing required on its output offset the value of the file size.
So, where’s the sweet spot?
All these results boil down to the following: For the best compromise between file size and OCR accuracy, scan at 300 dpi in grayscale at medium compression, unless color plays an essential role in the original document, in which case switch to color but leave everything else the same. Avoid scanning in black and white, even if your documents are plain text on white paper.
Given the choice of OCR engines, avoid Acrobat Pro (especially version XI) despite its smaller file sizes. FineReader offers superior accuracy, an important consideration when you try to use your digitized documents. If you use an embedded version that lets you adjust compression and downsampling—like the one included in DEVONthink Pro Office—you will avoid problems with inconsistent file sizes. With any tool that lets you control the downsampling (remember, this happens after text recognition) adjust the settings to 150 dpi and go for about 50 percent compression quality.