DYK February 2012 - "Did You Know" the Three Most Dangerous Myths Regarding Document & Data Capture?

"Did You Know" the Three Most Dangerous Myths

Regarding Document & Data Capture?

It dawned on me the other day that in one way or another, I have been capturing data from myriad sources for almost 30 years. I've captured seismic data for global nuclear test monitoring, satellite and weapon system telemetry, even capturing, storing and retrieving ground test data from International Space Station components. I've also dabbled with the capture of video and audio intercepts and surveillance data. You would think that capturing paper and digital content from a business environment would seem, well rather mundane. It isn't.

In this DYK, I will present the CDI's Top Three Misconceptions about Capture. This is similar to a David Letterman's Top Ten lists, however the consequences of not considering these implications can mean the difference between a successful capture system and one that is virtually useless and sits idle. While certainly not as complex as designing the replacement for the Space Shuttle, the truth is that designing and implementing a successful capture system is in most instances an extremely customized venture based on the business processes, available personnel, budget, document collection characteristics and a host of other issues. Consequently, very few capture solutions are the same with each having a unique set of challenges. That is why my list is written in no particular order. Any item could have more or less of an impact depending on the individual organization.

So, let's take a look at some of the misconceptions:

1. "A scanner is a scanner; the cheaper, the better." Scanners come in so many different varieties that you really should do some research before you buy. Consider the method of attachment. Most scanners now are Universal Serial Bus (USB) and can be run on current computers. Be careful though as some scanners use a SCSI interface and require a special card or adapter for it to work. Some scanners now can be connected to via wired or wireless network as well but this typically adds to the cost. A BIG misconception is that the scanner comes with the necessary software when you buy it. With few exceptions, scanners will come with software that is known as a 'driver'. A scanner driver basically translates the communication between the scanner and computer hardware. While a driver set may allow some rudimentary testing of the scanner, it does NOT act as a capture application, which means you have the hardware configured but no software to make the scanner do what you need it to do. On the other hand, some scanners come bundled with software that is limited in functionality and allows you to scan. For most folks, this will suffice. Another primary issue is resolution. Resolution is a term that relates to optical quality of a scanned image and is specficied in dots per inch or d.p.i.. For plain text, you can get away with 200 dpi but 300 dpi is better. If you are looking to scan pictures or photographs, you should consider at least 600 dpi and perhaps up to 1200 dpi. The lower the resolution, the grainier the image. So, as you increase the size of the digital image, it becomes grainier.

Other major scanner features can be addressed by asking some of these questions;

Do I need a scanner with an automatic feeder or can I get along with just a flatbed? Do I need both?

Do I need duplex scanning capability (scanning both sides of a page)?

What is the minimum and maximum page size that the scanner can accommodate? Do these dimensions refer to the feeder or the flatbed?

Do I need to scan in color?

What maintenance, warranty and support options are available to me?

2. "If you get a really fast scanner, your process will go much quicker." Many folks believe a fast scanner will solve their paper problems in a short time. In case you haven't caught on to my sense of irony yet, in terms of all but the simplest capture jobs, actual scanning is the least time and resource intensive task in the overall capture process. Other aspects of capture are much more time consuming. Document preparation is generally the most labor and time-intensive process when scanning documents. Let's do some calculations. Suppose you have one scanner that does 60 pages per minute. OK, all things being equal, you could assume that in an 8 hour shift you could process 28,800 pages (60 pages x 60 minutes x 8 hours). Pretty impressive, right? Now let's say that that the documents are stapled and average about 15 pages each. Each staple you remove will take a typical operator about 20 seconds. So now, the 15 pages no longer take 15 seconds to scan, they take 35 seconds. You have just cut your daily production in half with one staple. And you haven't addressed how you will separate the documents or index them, or put them back together, if necessary. The point here is that the true aggregate speed of the capture process is more dependent on factors other than the speed of the scanner. Remember, scanners jam, pages occasionally need to be rescanned and a scanner is only as fast as you can keep it fed!

3. "If you scan it, you can find it." I hear this quite a bit from folks; "We want to scan our documents so that we can find them out on our shared network drive." While the premise is basically sound, the nature of it is akin to putting your documents into a black hole. This is particularly true as the volume of the document collection increases. Keep in mind that image files (TIF, JPG, etc) have no external context or data to indicate what the file is. So the only potential clues you have at your disposal is the file name and the folder hierarchy that contains it. Even files with textual content (MS Office, PDFs, etc) present challenges to being found. Many organizations want to replicate their paper file cabinets in terms of structure. This is understandable since it reflects the most familiar method that the users have had to get to their documents. However, if a document is mis-filed, it may never be found unless by accident. The next step up is to convert what you can to text so that can be searched against. This is known as Full Text Search. Searching the full text to find a document has multiple problems associated with it. First, it is a SLOW process. And the more content you have, the slower it is. The main problem with FTS though lies in the nature of the organization's business processes and the documents they utilize in their operation. Many of the dcouments that an individual organization uses contains similar or identical textual content. As an example, a bank employee searching for a document containing the word Smith, will likely get hundreds or even thousands of search results. In addition, if the content was scanned and converted to text, there will be an inherent error rate in the spelling of the words in the documment. In summary, I consider file naming and FTS to be insufficient for effective retrieval of your content. After all, when a user does a search for a document, the most positive result is to retrieve just the document or documents that correspond to the search criteria. Doing this requires indexing each of your documents with meaningful metadata such that the search results will be culled down considerably. While this can be a overwhelming thought, there are many tricks and tools you can use to make the job much easier.

Stay tuned! We will address best practices for tagging your content in our July DYK article. If you can't wait or have other questions, feel free to contact me to discuss.

Bill LaPorte - Director of Operations This email address is being protected from spambots. You need JavaScript enabled to view it. at 540-659-5157 or 540-842-0358 (cell)

Join us on Facebook
Share this email

RSS Feed

You Tube Video

Register for CDI's Free CDI Quarterly Newsletter.
Register for CDI's Free "Did You Know" (DYK) Monthly Article.
Register for CDI's Free Initial Consultation Request

If you wish to learn more about CDI go to our website or visit our Blog.
CDI values your privacy. At no time will CDI make your email address available to anyone without your permission.
If you no longer wish to receive our emails, click here.

431 Nursery Road Suite A300 Spring, Texas 77380 United States (281) 292-1333

Join us on Facebook
Share this email

RSS Feed

You Tube Video

Register for CDI's Free CDI Quarterly Newsletter.
Register for CDI's Free "Did You Know" (DYK) Monthly Article.
Register for CDI's Free Initial Consultation Request

Available Contracts