Computers were supposed to make paperless offices possible, but most of us are surrounded by more pages than ever. It’s possible to break free from paper’s grip, however.
I’ve developed a system that has allowed me to eliminate all but a slim handful of papers from my formerly bulging filing cabinets. I scan my documents and convert them to PDF files that include both high-resolution images of the originals (for reference and later printing) and—using optical character recognition (OCR) software—digitized copies of document text (so I can search, select, or copy it).
As a result, I have more free space on my desk and in my file cabinets, and almost every piece of paper in my erstwhile files—old tax forms, letters from my bank, and more, as well as new stuff that still comes in—is now searchable, easy to back up, and accessible from just about anywhere.
Get the right hardware
The first key is a good scanner. Fujitsu’s ScanSnap ( ) is my tool of choice (and the one for which I’ll provide specific instructions below). It’s speedy (it can scan up to 18 pages per minute), it can scan both sides of the page at once, it automatically detects page size and type (color or black and white), and it includes a copy of Adobe Acrobat Standard.
While the ScanSnap is my favorite, you can find less-expensive sheet-fed scanners. Just be sure the model you choose has an automatic document feeder, bundled OCR software, a USB 2.0 interface, and a high page-per-minute rating.
As for hard-drive space, don’t worry: With the settings recommended below, expect scanned PDFs to occupy about 250KB per page (or side) for black-and-white documents, 500KB for color. In other words, you should be able to store up to 4,000 pages in 1GB of disk space.
Get the right software
You’ll also need two pieces of software. First, your scanner should come with some basic software, including OS X drivers and an application that lets you configure resolution, color bit depth, file type, default folders, and other settings; if you don’t have that, check the vendor’s Web site.
You’ll also need an application that can perform OCR on your scanned documents and then combine the text with the original image of the page in a PDF file. The best candidates are Acrobat 8 Professional ( ) Acrobat Standard (version 7 should work), DevonThink Pro Office Office ($150), or ReadIris Pro 9 ( ).
Most of the free or less-expensive versions of these programs won’t work. Adobe Reader won’t work, and neither will DevonThink or DevonThink Pro (the less-expensive versions of DevonThink Pro Office). Bundled versions of ReadIris should be OK. Your scanner may have come with all the software you need; the ScanSnap, for example, includes a copy of Acrobat Standard, and many other scanners come with ReadIris Pro. If your scanner comes with driver software but not a suitable OCR program, I recommend DevonThink Pro Office. It produces fast, accurate OCR results, imports scanned files automatically with minimal configuration, and provides a convenient interface for storing and searching scanned PDFs. Because Adobe Acrobat is so common, I’ll provide specific instructions for it below.
Configure the software
You then need to tweak the software to get the best results. Again, consult your scanner’s documentation to find out how to change these settings.
Resolution I’ve found that 300 dpi yields the best trade-off between quality and convenience. In ScanSnap Manager, go to the Scanning tab and choose Better (Faster) from the Image Quality pop-up menu to get 300-dpi scans.
File Type Since documents ultimately end up as PDFs in this system, I save scans as PDFs from the outset. In ScanSnap Manager, go to the File Option tab and choose PDF (*.pdf) from the File Format pop-up menu. If PDF isn’t available in your scanning software, choose TIFF.
File Location Most scanning software lets you put scanned images wherever you want. I save them to a subfolder named Scans in my Pictures folder, so I can easily find scanned documents later. In ScanSnap Manager, go to the Save tab, click on the Browse button, locate the folder you want to use, and click on Choose.
Other Settings Turn on automatic duplex scanning and color detection: In ScanSnap Manager, go to the Scanning tab, choose Auto Color Detection from the Color Mode pop-up menu and Duplex Scan (Double-Sided) from the Scanning Side pop-up menu. If you can, tell the software to remove blank pages from scans, adjust crooked images, and automatically rotate images that are upside down or in landscape mode. In ScanSnap Manager, go to the Scanning tab, click on Option, and select the Allow Automatic Blank Page Removal, Correct Skewed Character Strings Automatically, and Allow Automatic Image Rotation options.
Automate your scans
Next configure your scanning software so it will send scans directly to your OCR program, save the document with the new text layer, and close the file. Here’s how to do that in Acrobat; you can follow similar procedures for whichever software you’re using.
Start by configuring Acrobat’s OCR settings. Open a PDF file and choose Document: Recognize Text Using OCR: Start. In the dialog box that appears, click on Edit, make sure English (US) is selected in the Primary OCR Language pop-up menu, and choose Searchable Image (Exact) from the PDF Output Style pop-up menu.
Next, set up an AppleScript Folder Action to automate the processing of new scans. You can download AppleScripts for Acrobat and ReadIris Pro. (DevonThink Pro Office takes care of this process automatically. ScanSnap Manager can automatically open scanned files in any of these programs, but can’t automate the rest of the process.)
Once you’ve downloaded the script you need, save it in /Library/ Scripts/Folder Action Scripts; I call mine OCR This . Control- or right-click on the folder you’ve designated to hold new scans and, from the contextual menu, choose Enable Folder Actions. Control- or right-click again and choose Attach A Folder Action. In the window that appears, navigate to your new AppleScript file, select it, and click on Choose.
Now try scanning a new document. If everything works correctly, Acrobat should open shortly after your scan finishes, recognize the text in the document, and close the document window when it’s finished.
Set up a workflow
With the technology in place, you can start scanning. But if you have thousands of pages to digitize, you’re looking at a long process. Before you begin, figure out a good strategy.
Narrow the Field Consider culling documents you don’t really need, rather than scanning everything blindly.
Play the Name Game Give each document a descriptive label as soon as possible, while the contents are still fresh in your memory.
Devise a Filing System Make sure you know where you’re going to store each document—and move it there immediately after scanning.
Get Rid of the Paper After scanning your documents, decide which ones you need to shred, which you can recycle, and which you must keep safe (legal documents such as birth certificates and notarized contracts).
Don’t Overdo It If you’ve got zillions of papers, don’t try to scan them all at once. Set a goal of, say, 50 sheets a day. Once you’re caught up, get in the habit of regularly scanning and then disposing of (or filing) all new documents.
Joe Kissell is the senior editor of TidBits and the author of Real World Mac Maintenance and Backups (Peachpit Press, 2006).
[UPDATE: I’ve once again updated the AppleScripts I mentioned in this article, which now support Acrobat Standard and Pro up through version X, as well as Smile Software’s PDFpen and PDFpenPro. Make sure to read the Read Me file included. It includes detailed information about installing and using the scripts.]