Document Imaging and Processing Work Together to Capture Enterprise Content

Document imaging and processing go together. For example, scanning, the starting process of document imaging, simply produces some information about the light reflected by the scanned document. It is further processing by the scanner's software that arranges and saves this information into a standard graphic format like JPEG.

The image so created might not be the best of all images. It might have punch hole marks, black borders, distorted characters and so on. Image cleaning software then works with this less-than-perfect image and improves its quality. The result could be an image that is better than the original paper document.

There is image cleanup software that can straighten askew images, convert white text on black background into black text on white background, etc. Users can control the cleaning up process so that the result comes out the way they want it to.

Character Recognition Processing

Text in graphic format, while readable by humans, is not computer-readable. Only by making the text characters computer readable can the document image be edited or indexed by the computer. And it is important to make the document editable and indexable.

The workflow processes usually require some kind of editing of the original document image, say by adding comments, removing any personal content, and so on before forwarding the document further in the workflow.

Indexing is essential for making the document searchable (and retrievable). We look indexing in more detail in the next section.

As both editing and indexing are thus highly important, it is very important to make them possible by making the text characters computer-readable.

Character recognition programs such as OCR and ICR do the needed processing, and convert the graphic text characters into machine-readable ASCII etc. format.

In the case of character recognition also, quality could be a problem. OCR might confuse between closely similar characters and provide wrong interpretations of what it sees. Sophisticated processing algorithms can ensure that this does not happen. There are even programs that can recognize handwriting and convert them into machine-readable text characters.

Making the Documents Retrievable

Indexing programs then process the documents to link certain identificatory words to the document. When this is done, the document can be retrieved from among the millions of documents in the computer repository by using the linked words.

Indexing can be by all the words in the document content, or a few distinctive tags that identify the nature of the content. Full-text indexing takes up a lot of file space and is not preferred in a business content management context. Hence tag-based indexing is more typical in content management systems. To enable tag-based indexing, the document creators or others provide distinctive tags to describe the document content.

It is different kinds of document imaging processing that moves paper documents into the enterprise content management system, creating digital images, converting image text into machine-readable characters, indexing the documents and making them retrievable from among the millions of documents in the enterprise's content repositories.