Master data management initiatives are now being deployed with sizeable data volumes. A few years ago 10 million master data records was quite chunky, but we now see some examples of 100 million master data record applications. Simply processing such volumes has issues, and then you have to consider how you are going to keep your new shiny data in mint condition. You can put in a data quality "firewall" which, for example, will check for potential duplicate records about to be entered in, say, an order processing system. However applying clever matching algorithms to large volumes of data and still expecting a sensible response time is problematic.
This background makes the general availability of the DataRush engine by Pervasive Software potentially interesting, a company long established in the field of embeddable databases and data integration. The DataRush technology uses highly parallel techniques to enable processing of large amounts of data very quickly. For example common data quality algorithms such as "edit distance" are delivered with the engine, meaning that such common tasks as name and address checking can be done quickly. Beta applications at companies such as TC3, who process large volumes of health care claims, have seen some dramatic performance improvements over previous approaches. Another documented example is at PIERS, a company that collect bills of loading in the shipping world and analyses these to help companies understand trends in international trade. Extensive processing is needed to eliminate duplicate data before the data can be turned into meaningful information.
There is no shortage of business use cases in the world where data quality processing has to be applied to large volumes of data (another example is mortgage claims), so there should be a substantial market for something that can make this go much faster. This technology has the potential to be picked up by data quality software providers, and perhaps MDM vendors (MDM applications have a significant data quality component) to turbo-charge their own products. Given Pervasive's track record of producing reliable embeddable software, they will be taken seriously. The engine could in principle be used in other use cases, such as analytics, but data quality is the obvious focus at present. In addition to software vendors, there are plenty of systems integrators that custom-build applications in specialist areas with data quality elements, and in some of these cases volume and processing time will be a major issue.
It is early days, but with growing interest in data quality (a market that grew 17% in 2008 according to our latest research) and increasing need to deal with high data volumes, DataRush could be in the right place at the right time.