Wednesday, October 3, 2012

OCR pdf

OCR pdf – a valuable feature scanned documents can’t live without

We are used to opening, sending and creating pdf files at home and at work. Pdf file format is simply the convenient tool which makes our documents look good, prevents them from editing (if needed and your addressee does not have special software to edit pdf files) and offers compromise between the document size and its visual characteristics (quality, or readability).

There are several ways of creating pdf files and the easiest one would be to simply publish the text I am creating right now in MS Word as pdf document. In this case I will automatically get fully indexed pdf file which will allow me to copy and paste any text contents of the document and use them further as text fragments, without harming the integrity of the original pdf document. However, there are other ways of creating pdf files – and the most widespread, actually is scanning to pdf or creating pdf document from an image.

I think that most of us understand that the document image saved in one of the file formats like .jpeg, .tiff or .png will not have the text content for us to work with. That is why, when creating the pdf file from such image files, OCR pdf function has to be performed, or, to be more precise, the text data of the document image captured and placed into the pdf file as a separate layer, enabling us the copy-paste functionality described earlier. More to say, indexed pdf files are effective for corporate storage and search of the documents, as you are not only able to find the document by its name or sort them by type and date, but you are actually are capable of finding the documents the text contents of which actually match your search query. Of course, when you have 20 documents to create as pdf files every day, it is not an issue for you, but if the number of created and further used pdf documents comes to thousands or hundreds of thousands – this is where you will strongly experience such small but valuable features and details.

However, I have encountered many occasions in the history of working with pdf files when simple users save the documents as pdf incorrectly. They simply push “Scan to PDF” button on a scanner or avoid making any settings in the scanning software – in most cases they do not turn on the OCR function at all! This is sad and not effective, as later you will require the conversion of the scanned document into pdf with the OCRed text layer, and, to say frankly, such multiple conversions are not good for the image quality and affect the recognition results. The reason why people make such mistakes is that they do not want to make any settings when using software, ever. I mean it! You hardly can make your employees choose scanning profile, set up the resolution. And, there are people who do that, and a person with the lack of such technical skills will stop working and loose time. This is why at Cognitive Technologies we have created the product that does not require any scanning settings! It has a simple and user-friendly interface and allows scanning to pdf and conversion of documents into pdf files in three clicks! To learn more, search for Compressimo on the Internet!


When someone says pdf document or scan to pdf we all understand now what he or she means. In this article we will talk about one of the most valuable features of pdf file – document indexing. We will describe the typical errors of simple end users and will tell you about the products making your employee’s life easier.
