![]() Always use a loss-less compression for images that will have recognition performed. The conversion to image is set to 18, which is a loss-less fax compression.This prevents any extraction or recognition on the PDF during the conversion process as it is not needed at this step. The actions enable the convPdfIgnoreContent variable.This rulset snapshot from Datacap Studio converts a PDF to separate images without performing recognition in this step. This allows the images to be adjusted to recognize well. This is an example ruleset that converts the PDF to images without performing any recognition. Always use a loss-less compression for text that will be recognized. Do not use a lossy compression such as JPEG as this will reduce the quality of recognition, even if a high quality compression rate is chosen. Note: If you decide to retain color for pages that are color, use a loss-less compression such as LZW. The following shows the basic steps that can be integrated into an application as needed. Use image enhancement to fix rotation, deskew, and enhance the images.The following are general steps to first convert a PDF to images, then perform full page or field level recognition: If the original PDF is required for archiving or some other use, the PDF is still available in the batch and can be uploaded to a repository, or placed where required, at the end of the process. As this is not typical for most applications, is usually more reliable and flexible to convert to an image first. If all of the PDF documents are guaranteed to be from an electronic source, such as a word document converted to a PDF, where the pages are never skewed and are always clean, then direct recognition on the PDF can be an option. When a PDF contains scanned pages like in this situation, then the approach to first split out the pages, and then recognize each is usually the best approach. Some scanners ingests pages and output a PDF instead of separate images. While something like deskew might seem to be unnecessary because the text is clean, not only dekew improves recognition accuracy, it helps ensure that all of the text is considered to be on the same line. These are very strong reasons to first create an image then perform recognition, and why it is typically the best path. Recognition can be limited to a subset of pages.Field recognition is not possible on a PDF page. Field recognition can only be performed on an image that has an associated template or fingerprint with loaded zones.When Image registration is used, it must be performed on an image prior to recognition.These can all help achieve better recognition quality. Image cleanup and adjustment though border removal, despeckling, line removal, and so on.The following are the steps that can only be performed on a separate image: In image processing, the image can be adjusted to improve the quality of the recognized text. The primary benefit to first converting to an image is that this step will allow for image processing prior to recognition. The first approach where the PDF is first converted to an image than recognition is performed on the image is usually the best approach. ![]()
0 Comments
Leave a Reply. |