Auto-classification is one of the hottest buzzwords in the document capture market. Every ISV these days has an auto-classification product that can be used to identify similar groups of documents. But, what is the killer app? We’ve seen some success in areas like identifying document types within large mortgage files, but it doesn’t seem like we’ve found that killer app yet. Parascript, the Boulder, CO-based recognition technology ISV, thinks it may be information governance.
“There is no question that auto-classification is an over-used term,” said Greg Council, VP of marketing and product management at Parascript. “The use cases we’ve seen have been basically confined to identifying document types within a pre-defined package of documents. Typically, users know what they are looking for.
“We started down the road of enhancing our document classification technology to address that type of need. We have a software partner in France working on a contract conformity application. They basically have to ensure that a package of documents is complete so it can be considered a valid contract. This involves solving three primary problems: establishing that the right documents are there, identifying document boundaries (first and last pages—basically, document separation), and running rules to look for specific data.
“I’ll admit that we were kind of weak with our auto-classification previously, but when we started addressing that application, we saw the opportunity to go beyond this type of use case. The broader market we think is information governance. Within that area, we fixated on two primary business problems.
“The first is the ability to control and manage documents within a records management (RM) system. Most current RM applications require that end users, people like subject matter experts or even file clerks and records managers, tag documents. However, in today’s IT environment there are so many storage options that users will often bypass their RM requirements. This creates a real problem as organizations don’t even understand what they have and therefore can’t control it. Basically, if it’s not tagged, it’s not recognized by an RM application.
“The second problem is more closely related to ECM, and that is findability. This is related to not having a good taxonomy around documents and ensuring that they are all defined the same way. Everybody has unstructured search engines in their ECM applications, but they still have trouble finding documents.
“Let’s take a credit union that we’ve been talking to. The loan officers and CSRs are having a real problem locating the documents they need to service customers during interactions. When this organization adopted its current document management system, all its documents were merged into it, but they are only classified by account numbers. So, if a customer wants information related a specific document, the CSR has to page through their entire file.
“They also have a warehouse of documents, and they don’t know what to keep as they transition to a new document management system. For example, they don’t know exactly which documents have value due to their being associated with existing accounts. They would also like to eliminate any duplicates.
“They are looking at employing six staff members to scan and visually look at each document to apply meta data. With auto-classification technology, if you set up proper rules, we think this should be able to be accomplished by a single person.”
Flexible technology and pricing
To address these types of needs, Parascript recently introduced Document Classification 2.0. “A lot of auto-classification technology utilizes just text,” said Council. “Our approach is to utilize text, as well as imagery, hand-writing, and really any visual feature that you can take into consideration. These factors can be utilized separately or combined, for which we’ve set up a proprietary voting algorithm.
“Also, there are two ways to approach auto-classification. One is through clustering, in which the system automatically creates groups of similar documents. This is best used when you really don’t know what you have, such is when you are trying to organize large file shares. The second is to train the system on samples. Both these approaches can be used in conjunction with extraction technology, such as our FormXtra technology.”
Document Classification 2.0 is being offered both as a standalone technology, as well as an option within FormXtra Capture 6.1. “We are still figuring out exactly how to package it,” said Council. “We are looking at pricing it through a pay-as-go model, based on volumes. Users won’t have to purchase a perpetual license, because many companies want to take on classification on a project basis. We plan to offer this model as both an on-premise solution and a hosted cloud service.
“One of the complaints we’ve heard about current auto-classification is that it has missed the mark because it’s overly complex or too expensive or both. We are going to address this.”
Said Parascript VP of Sales Mark Gallagher in a press release, ““We’re placing the power of document auto-classification in the hands of business users. On the one hand, you don’t have your team spending hours manually reviewing and organizing documents in the system. On the other hand, you don’t have to employ a team of programmers to auto-classify your documents.”
Market catching up to technology
Council added that increasing use of file sync and share applications like Dropbox and Box has created the perfect storm to drive demand for auto-classification. “We did an intro of Classification 2.0 at an ARMA event, and I was talking to a number of companies that have ECM applications; they all seem to be having the same problem,” said Council. “The longer these systems are in place, the more they start to drift. Eventually, these organizations lose control and their repositories become a mess.
“We think with Classification 2.0, we can clean this up through an automated process—not just for backfiles, but for new document streams as well. On top of that, in addition to Box and Dropbox, you have SharePoint, which really started this trend of people dumping documents into file shares outside their ECM systems. And it’s not just document images that our technology can be applied to. Classification is classification. In one engagement we are being asked about working with e –mails.”
For more information: http://bit.ly/ParaDocumentClassification2