Geek Speak Decoded: The Acronym Series, Part 2
Part 2: Image Formats, and why all PDFs are not the same
I’ll warn you, this will be less of a list you can make flashcards from and more discussion about the qualities of the formats. Once you get a scanner, there are a lot of options for how you scan documents, and how you store documents. One of the choices you will have to make is the format. The format is about the ways that your computer handles all the 0s and 1s that make up your images. Image files are usually much bigger than documents when it comes to storage space, but if you need to store images, there are ways to compress them without filling up your hard drive. Documents in their native format can still be “frozen in time” with the use of version control and a database driven repository. How you choose is a bigger decision than simply what the document extensions mean, but you have to start by understanding what will work from a compliance or business process perspective.
One concept that I want to introduce is the idea of “lossless” or “lossy” image formats. Lossless image formats mean that the computer is making no assumptions when it compresses your data. With lossless compression, every single bit of data that was originally in the file remains after the file is uncompressed. For business and archival use, we want to rely on lossless compression technology. Lossy technology refers to compression in which some of the data from the original file is lost. For the majority of business purposes, we want to utilize lossless technology for images.
Image Formats and File extensions
LOSSLESS COMPRESSION OPTIONS
TIFF (or TIF): Tagged Image File Format
TIFF is unique — out of the box, and without compression, it’s often used for photography. For our purposes, we are talking about either Group IV, LZW, or JPEG compressed TIFFs.
The type of compression used is based on whether or not the image is bitonal (black and white) or color. For these purposes, we are talking about Group IV for black and white images, and LZW or JPEG compression for Color.
TIFF CCITT group IV is based on fax standards and has been around since the late 80s when scanner manufacturers were pushing to get away from proprietary formats. TIFF images lend themselves to repurposing easily. It’s an open format, which means that it works well with a wide variety of text extraction tools and was initially developed as a black and white fax technology.
PNG: Portable Network Graphic
PNG offers screen resolution detail in a raster graphic file format that supports lossless data compression and was created as an improved non-patented replacement for GIF. Often, if you hear about “web ready” images, they are referring to PNG images because they are small. Non-proprietary formats like TIFF can be converted on the fly to PNG images, which enables better response time for mobile users.
BMP: Bitmap
Raw image formats, generally uncompressed and too large for functional business use. Bitmap supports lossless data conventions. You can convert BMPs to the other compressed formats.
Other lossless compression options exist ( like RAW), but because they originate as an unprocessed image from the camera or scanner itself, these images are rarely if ever used for document capture.
LOSSY COMPRESSION OPTIONS
GIF: Graphics Interchange Format
Compression technique developed in 1985 for palette-based images (predefined colors). Now superseded by PNG. GIF is a little different because it uses lossless compression, but it uses predefined colors, so its still considered lossy by graphics folks.
JPEG (or JPG): Joint Photographic Experts Group
With a compression ratio of 100:1, this lossy compression format is still the favorite. Don’t let the “lossy” part concern you, though. Basically, it can be explained as the difference between saying “make a red dot, make a red dot, make a red dot. . .“ 200 times, or as many red dots as you need, to saying “make 200 red dots.”
BUT WHAT ABOUT PDF?!?!?!? – THE ‘NOT-QUITE’ A FORMAT
PDF: Portable Document Format
Many people are surprised to find out that PDF is not actually an image format. It’s more of a container that holds an image. PDF stores the image as a separate object that it references. While a PDF looks like one file to you, it’s actually multiple files that know how to assemble themselves. The PDF stores the binary data for the images. Your PDF tools matter, because different PDF creation tools may store the same image in very different ways. This is one argument for using PDF/A, below:
PDF-A: Portable Document Format specialized for use in archiving. PDF itself did not meet these criteria because PDF documents can contain elements which are not reliably rendered because their appearance can change based on the viewer, host operating system, or state of the PDF itself.
Stay tuned for the next installment as we uncover the terminology around automated image processing. Barcodes, OCR and Regex.