I have been asked many times – usually in relation to OCR – what about accuracy? My immediate reaction is frustration since people who usually ask about this are in a state of some confusion. The first point before launching into technical details is to understand what accuracy is, and secondly, to appreciate that the need for accuracy is directly related to what one wants to do with the information.
Let’s start with a human as we look at a screen or a document and read the content. We usually start by looking at the overall document to classify it and maybe consider how it arrived in deciding whether it is likely to be important or not. If it is not important then we often immediately decide on a retention schedule – does it get binned immediately, put on one side for maybe future access, passed to someone else who might find it interesting or immediately actioned.
If we initially think it may be important, then we take the next step of looking more closely, reading it (i.e. extracting meaning from it) and deciding what to do with the information. Depending on the initial quality (and our eyesight) we may initially misread some of the information even transposing some characters. Normally we correct this as we read the word or the sentence – but sometimes I find I need to read a sentence more than once and even go back for context in order to fully understand it. So in this context what is my accuracy? I am retrieving the information I need? I may be manually extracting information and even keying into another document. Mostly I do it accurately, but occasionally I make a mistake or in some extreme cases I find I cannot discern characters or words sufficiently to retrieve them – there is a reason that the ‘small print’ terms and conditions are printed in 6 point font!! So it is important to realize that humans are not 100% accurate.
Today’s recognition software works similarly to humans. We often look from a high level using a reduced size image – sometimes known as a postage stamp to start the classification process. Sometimes we may use an understanding of the channel or source. Then, if appropriate we start to OCR to try to understand (or read) the alphanumeric content before passing it on.
Pattern recognition technologies never provide 100% accurate read of all information – even barcode recognition using a laser reader with long lines in the barcode supplemented by a check digit occasionally will not read. But it is one thing to fail to read. It is another to transpose a character or a word, changing the meaning of a sentence, posting the wrong amount, or wrongly reading someone’s name.
In some cases it really does not matter – consider misspelling a name of an item in an invoice. The part number may be cross-referenced, but if the seller’s part number is different from the one in your system, then you may need the description. Likely you can look up the correct part even if the spelling is incorrect. So capturing data does not have to be 100% accurate to achieve correct retrievals.
Sometimes it is critical that you collect the same information that has been submitted on the form – in this rare case you need 100% accuracy. Humans, as noted above, do not achieve 100% accuracy on the first pass, so we usually use a second person to review and correct the data without reference to the first entry (blind key entry) which gets us to 99.9% or so, but this is still 1 error in 1,000 characters. This is known as double key entry. On a form that has an average of 120 characters, which is pretty typical, that means one error in every 80 forms or 20%. To get 100% accuracy you need to triple key.
OCR can remove the first entry need if the characters are well defined and the scan is good, but usually some repair is needed. When an integrator sets accuracy requirements (hopefully at the field level) it tells the OCR engine(s) how much accuracy is required. All OCR engines create an accuracy likelihood or confidence factor. So for example it may convert a ‘b’ and be 99% sure that it is an accurate conversion – it usually does this by passing it through a series of steps and different algorithms voting internally on its confidence. In full text OCR, it supplements this by passing each word by a dictionary correcting wrong words. In this example if the result is 99% certain it will pass – but if it is only 80% certain then it will flag the character as suspect and the user must verify it. So as a result if one sets the accuracy requirement too high then one ends up with many false negatives – those characters that the OCR process thinks may be wrong. Each of those, even if they are correct must be reviewed. It may therefore seem to many that the OCR process is not ‘accurate’, but the reality is that it is accurate, just doubtful about it.
What is Confidence?
The alternative is to set the confidence requirement much lower. In this case the integrator or user may set the accuracy requirement at say 70%. Let’s assume that a ‘b’ is not part of a dictionary word – for instance part of an alphanumeric part number. The OCR engine(s) may then think it may be a ‘6’ 70% chance, or a ‘b’ 80% chance. In this case it will convert it to a ‘b’ but flag it as suspect for review. These types of conversions are known as false negatives because the conversion was correct, but it was suspect – from an accuracy statistical standpoint it was flagged as maybe inaccurate. On the opposite side suppose the ‘b’ was damaged (for instance by a fold in the paper) and that makes the engine(s) decide that it is 71% likely that it is an ‘l’, 65% chance that it is a ‘b’ and 60% chance that it is a ‘6’. In this case the software will report that the character was an ‘l’ – clearly an error which is not flagged
So the problem with accuracy measurements is that a user can improve the ‘apparent’ accuracy by merely setting the confidence requirement very low – the OCR will think everything is right. The converse is that if the user sets the confidence requirements very high, then the OCR will think that many correct conversions are suspect.
To further increase accuracy and reduce the need fore repair, vendors have incorporated internal ‘Voting’ capability and dictionary lookups as well as techniques such as deciding that 3 different characters rarely come together in a word (e.g. qrw). The voting technique uses different OCR engine algorithms on each character and ‘votes’ on the result. It improves accuracy likelihood, but you still end up with the problems discussed above – machine printed characters work pretty well, but handprint and handwriting, which is another issue, poses challenges.
To Measure Real Accuracy
The only way to measure different OCR engines for true accuracy is to create a ‘truth deck’. This is a set of images that (hopefully) mirror your data variations that you have manually created the answers for (and preferably verified them) that you can compare the OCR results with.
To actually achieve the highest data accuracy:
- make sure that the images are created with as high a quality as you can – I do not mean by this high dpi or resolution – most (but not all) OCR works best at 300dpi. But preferably the paper needs to be moved consistently through the transport and kept straight and the thresholding conversion needs to be optimized for OCR.
- Figure out your post recognition validation strategies – database lookups, synonym dictionaries, mathematic calculations etc before re-keying, which is expensive
Accuracy is not an easy discussion and many fields may not need high accuracy, but if you want to discuss it more, please contact Harvey Spencer Associates.
Harvey Spencer – President, HSA Inc.
About HSA, Inc. (Harvey Spencer Associates):
Since 1989, HSA, Inc. based in New York has been specializing in electronic information (image based and electronic transaction) capture technologies. Our services include Market Analysis, Technology Planning Assistance, Product Positioning, Product Management, Client Sponsored Research and Strategic Planning Services. These products include high-speed document scanning hardware, image acquisition software, character recognition software (OCR, ICR), optical mark recognition, barcode recognition and other pattern recognition and classification tools.