Archiving old magazines

LINE ART AND OCR

Archiving old magazines

People want to scan and archive old documents, like magazines. Unless you can use line art mode, this job is a really difficult problem, because having both text and pictures on the same page causes conflicting goals.

? We want high quality crisp text. Text to be

printed is scanned best in line art mode at 300 dpi. Then the text prints great, but the pictures are awful.

? We want good image quality. Pictures need to

be scanned in color mode at perhaps 150 dpi (for this type of purpose). Moir? problems have a solution, see descreen for magazines, Chapter 12.

? If we scan full page documents in color mode

for the pictures, the text is not very crisp, and the file size is huge. One page can be several megabytes, and the file size becomes unacceptable.

? Color mode is not best for text, and JPG com-

pression is even worse for text. JPG compression causes artifacts around the text (dark smudges around the characters), and it can be pretty bad.

? Don't forget that copyright is a big issue. We

do not own rights to distribute this material. It only came with rights allowing our personal use.

What can we do? This is a really tough problem, don't expect any miracles. We have both text and pictures on the same one page, a conflict in purpose. Any scan mode is a major compromise.

You must carefully first define your goal and purpose. Do you need to retain the pictures, in color? Or just the text, allowing line art mode? Will you want to print these pages? Printing requires higher resolution, and a full page will be a huge file. If you only want to view the pages on the computer screen, 100 dpi possibly may be large enough, and a much smaller file, but 100 dpi wont print well.

If your goal can scan in line art mode for only the text (ignoring any pictures, which will go bad), the file size can be manageable, under 200 KB per page at 300 dpi line art. The text will print great, and this is the only good answer. Line art multipage is the subject of most of this section.

But magazines have pictures in them, and if you want to retain those pictures in color, you must scan the full size page in color mode. That creates a HUGE image. Maybe 6 MB per page even at only 150 dpi color. Grayscale mode is only onethird the size of Color mode, but multiply that size by the number of pages you have, and you still have a big problem.

JPG files can greatly compress those huge images, but even at only 150 dpi, JPG is probably at least 450 KB per full size page (assuming 8% size for 8x10.5 inches color, which is small enough to hurt the text quality). Don't ruin your work with a toolow JPG Quality setting (page 134). E xperiment a lot before starting a large project, to know the effects of the choices before committing to a plan.

Use the scanner's Descreen filter (Chapter 12) to scan magazine pages with pictures. Better scanners can show their stuff here ? about 30 seconds for a Microtek 8700, 8x10 inches of 300 dpi color WITH descreen (15 without). Lowering the White Point (Chapter 20) makes the paper white. For documents, I set it near the bottom of the large peak of the white background. Raising the Black Point makes the text darker, but don't overdo it, remember the pictures in the same image. Extra sharpening afterward helps the text in this mode, and you can exclude a picture by selecting the rectangular picture, invert the selection, and sharpen.

JPG is not a multi-page format, so file name organization will be awkward. But you could import those JPG images into say Word or Acrobat, so the result is one multi-page file in page order.

In MS Word, use menu Insert ? Picture ? From File to insert each JPG image at the cursor (you can scan images directly into Word there, but it won't be a JPG file). Then menu Format ? Picture to size it (and properties Float Over Text and Wrappingif you need to move it around on the page). Set Word's page margins small to hold the large images, and scan slightly inside the original page margins (to be smaller). Your printer can not print to the edge of its paper. Don't size the image larger than the printed page. See page 76 about scaling images in documents.

109

CHAPTER 10

Adobe Acrobat (the full version, not the free reader) creates the familiar multi-page PDF files, seen often as manuals on CD. PDF is a proprietary file format, but popular, and everyone has the free reader to view them. You need to know that those PDF manuals we see are NOT scanned pages. The normal and optimum PDF situation uses the original page layout program to format and output the original text source document as PDF. These text characters are vastly smaller than scanned full page images. File sizes like that are NOT possible if scanning. For example, the site has the 1040 tax form online. Two pages are in a 34 KB PDF file. Scanning only the first page in Acrobat at 300 dpi line art creates a 114 KB PDF file. One page of line art image is 3 times larger than two pages as text. Scanning that same one page in 150 dpi color mode creates a 4 MB PDF file.

Acrobat can directly scan pages into PDF files at menu File ? Import - Scan, but scanning pages in grayscale or color can be huge files (line art can be manageable). The Acrobat Capture OCR option can reduce the size. OCR is time consuming, but text is always smaller than images, and text is searchable. Acrobat requires at least 200 dpi for OCR, and 300 dpi results are much better. So much so, the resulting file size can be much smaller, which is counter-intuitive. Scanning the same one page of Newsweek magazine (8x10.5 inches) to a PDF file saw these numbers:

Color 100 dpi 2 MB OCR not applicable

Color 200 dpi 7.8 MB after OCR 4 MB

Color 300 dpi 17 MB after OCR 1.6 MB

Line art 300 dpi 190 KB after OCR 92 KB

100 dpi text may be slightly small on a larger monitor. Acrobat is a page layout program (page 78), so its Actual Size menu (100% size) does not show actual image size in pixels like photo programs do. It tries to show inches of paper on the screen (an impossible job) by simply assuming all screens are 72 dpi size. That may not be very close, and if your screen computes say 92 dpi size (Appendix A), then 92/72 = 128% size will show the scanned image at actual size on your screen in Acrobat.

So the "easy" way (with file size that might actually be usable) is to scan in at 300 dpi line art mode, each page is maybe 100 - 200 KB, feasible if no

pictures are required. But if you scan in color to retain the pictures, the act of scanning one page is no big deal. However in practice, this quickly becomes a problem, because even though it may be compressed some, the full size color page will be a few megabytes per page in the PDF file. The total file size will quickly become astronomical. But one CD-R disk can hold about 650 MB, maybe 100 pages even if 6 MB each, so it may work for you anyway.

The "best" way (perhaps a fantasy) to retain the color pictures would be to scan the pages twice, as line art and color scans. 300 dpi line art for the text, and perhaps even OCR to be editable (and searchable) text characters. E ither way is relatively small, and the text is usually the largest page area. Also scan the picture areas at maybe 100 or 150 dpi color, using descreen for moir?. Then reassemble the pages, selecting the best parts from each of the text and picture portions, into a page layout program, for example MS Word. This recombination would be the best, as far as size and clarity. However, this is of course a major job.

It's history now, but there was an excellent program that pretty much did all of this automatically (less the descreen), and it was a dream, effortlessly producing incredibly small and clear document files. It was ScanSoft Pagis Pro (but it does not work in Windows XP, and there appears to be no plans to upgrade it). Its proprietary XIF file format was designed specifically to do this. Pagis Pro will scan and save most other file types too, including standard multi-page TIF G3 or G4, but it seems to be a dead end street today. It's a real shame, it was the only solution both easy and good.

Mult i- page f ile f ormat s

A multi-page file has all pages of one document in one file, in order, easy to store, catalog, email, print, fax, etc. Photo file formats and photo editor programs are not suitable for multi-page. To create multi-page files for documents, you must use document software instead of photo software.

ScanSoft Pagis Pro and PaperPort are examples of document programs. Both of these have proprietary file formats (XIF and MAX) for multi-page documents of any type (line art, grayscale, color).

110

LINE ART AND OCR

The universal standard multi-page file format for line art text documents is 300 dpi in TIF files using compression protocols developed for sending fax over the phone line, called G3 and G4 for CCITT Group 3 (regular analog fax) or Group 4 (digital ISDN fax). G3 size includes CRC error checking bytes in the file data, which otherwise is part of G4 digital transmission protocols. G3 is compressed one-dimensionally called 1D (one row of pixels), or two-dimensionally called 2D (multiple rows of pixels). G4 files can be up to 60% smaller than G3 1D files. These are only for line art, and are multi-page file formats if used in document programs. A few photo programs may provide one of these compression formats in line art mode, but likely only as single page files.

Other than TIF, PDF and XIF , there are:

PaperPort uses a proprietary multi-page format called MAX , difficult to exchange with others, but a free reader program is available. PaperPort has the very important menu E dit ? Preferences ? A dvanced to determine how MAX files are created and stored. The Maximum Quality setting uses lossless compression, G4 if line art for very small and clear document files. The Smaller File setting uses JPG compression (with a JPG Quality factor). Either way is stored in the MAX multi-page format. Sometimes we want JPG, but users may not realize that their tiny MAX photo files are using JPG compression. You can reset it to the Maximum Quality setting for new scans. PaperPort can Export to other standard file formats, but it can only save scans into its own MAX file format.

OmniPage Pro is an OCR program of course, but it also has an option to save multi-page TIF image files, including G3 and G4. The file can only be saved from its menu File ? Save Image. That menu has a Single File or Separate Files option. OmniPage Pro 11 will also read XIF files, a good way to recover Pagis X IF files today in Windows X P.

TextBridge Pro can save multi-page image files too, only as uncompressed multi-page TIF files, or XIF files. XIF can be vastly smaller for mixed content, but compatibility with other programs is an issue. Before you scan, select menu Tools ? Scan Only (Defer OCR). Then scan or open all pages, then the regular Save As offers TIF or XIF file format.

A free multi-page solution is/was the program at Windows 9x menu Start ? Programs ? A ccessories ? Imaging (Windows XP does NOT include this now, but previous versions did). Scan with menu File ? Scan New, in Line Art mode. Append additional pages with menu Page ? A ppend ? Scan Page. See its Help menu for "append". Set file compression method to G3 or G4 with menu Tools ? Scan Options ? Custom ? Settings. You can simply use menu File - Print to print the multi-page document to your fax driver (saving to a file is not necessary just to scan and send a fax). Note this is NOT a photo program, it's a document program, and photo programs generally cannot open multi-page files. Document programs like Pagis Pro and PaperPort can open its files, but if you send this multi-page TIF file to someone, better make sure first that they will have to have some means to open it. Since XP no longer includes this program, long range plans are seriously in question now.

Reducing t he st orage size of a t ext document

Storing scanned documents for posterity suffers from size problems, and you can fill your disk. That's just the nature of the beast. But there are factors that can minimize the problem.

Use line art for text or line drawings if at all

possible. Line art is the best quality and the smallest file for this. Line art is 1 bit, and 8 times smaller than 8 bit Gray scale, and 24 times smaller than 24 bit RGB Color. A 300 dpi line art TIF file prints great, and is a smaller file than 150 dpi color JPG.

Pictures with the text are of course unacceptable as line art, but this may not always be important.

Don't overlook compression for line art, it

compresses really well. G3 or G4 compression is smallest and is the standard in document programs for line art TIF files. LZW TIF is more commonly found in photo programs, and LZW is also very effective for line art mode TIF files, much more so than for color images. All of these are lossless compression, no artifacts.

111

CHAPTER 10

Reduce the scan resolution to the necessary

value. Twice the resolution creates four times the image bytes. 300 dpi is normally plenty for line art (200 dpi line art is fax quality). If you only want to view it on the screen, 100 dpi may be enough size. 150 dpi is plenty to print color or gray scale on plain paper. E xperiment to know what you are getting. If you may need to print it, then test printing too.

Reduce the area being scanned if it is not neces-

sary to scan the entire page. Crop it as small as is useful, especially if you can omit areas of fine detail (like text). Or for line art, simply blanking out unwanted areas greatly enhances compression too, about as effective as omitting it.

OCR - text characters are always greatly smaller

than any image, but lower accuracy and the much greater effort (including proofing) are major factors.

For graphics, if color or grayscale is necessary,

use Indexed color with only 8 or 16 colors if possible. JPG must be 8 bit grayscale or 24 bit color (not small). The JPG artifacts are especially bad on the edges of graphics and text (page 134). If the image is graphics, it is often possible to reduce to 16 or fewer colors using indexed color mode. This normally improves the graphics image (color purity, small, clear, no JPG artifacts) since graphics normally contain only a few solid colors anyway.

For converting to reduced color, an Adaptive or Optimized palette with Nearest Color (no dithering) is usually best (page 130). This reduction in color count produces the smallest files, because there are fewer bits per pixel (16 colors is 4 bit color), and data compression will be more effective too. TIF, GIF, PNG file types can be used with indexed color.

PNG file format usually compresses a little

smaller than GIF or TIF LZW in any mode. It's also lossless, and most programs offer PNG (page 128).

Various file sizes of a representative document are shown below, a line art image written by Pagis

Pro with different compressions (with a few exceptions when Pagis Pro does not support the option).

The document is one page, the first page of an 8 1/2 x 11 inch 1040 US tax form scanned at 300 dpi line art, 2539x3294 pixels. I also made a copy using this same page, but with half of the original 8 1/2 x 11 inch page made blank, to show that compression of solid color areas (white or black) is extremely effective, almost infinite. So, you may want to blank out any unwanted material, it will compress better (also true for sending fax) and may look better too. After scanning, simply mark that area with the mouse, and hit Delete. Or, E dit - Cut it to the clipboard will also do it (to be ignored and forgotten there).

For mat , 300 dpi line ar t

TIF (no compression) BMP RLE (Run Length E ncoded) TIF PackBits (Mac format) TIF LZW (using Paint Shop Pro) GIF TIF Huffman TIF G3 1D PNG TIF G3 2D PDF (Acrobat proprietary) XIF 3.0 (Pagis Pro proprietary) MAX (PaperPort proprietary) TIF G4

Full Page of Dat a

1050 KB 526 KB 245 KB 190 KB 180 KB 140 KB 144 KB 121 KB 119 KB 114 KB 88 KB 89 KB 81 KB

Half Dat a / Half Blank

1050 KB 280 KB 130 KB 111 KB 89 KB 85 KB 74 KB 61 KB 61 KB 57 KB 45 KB 41 KB 36 KB

112

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download