IT-Analysis.com
IT-Analysis.com Logo
Technology Applications
Enterprise SME Business Issues Technology Services Channels
Module Header
Philip HowardBloor IM Blog
Philip Howard
8th February - Bribery
Nigel StanleyBloor Security Blog
Nigel Stanley
8th February - Conficker grounds police checks
David NorfolkThe Norfolk Punt
David Norfolk
3rd February - What's wrong with "security"
Laurie McCabeLaurie McCabe
Laurie McCabe
2nd February - What is Total Cost of Ownership, and Why Should You Care?
Philip HowardBloor IM Blog
Philip Howard
2nd February - Calpont finally comes to market
Module Header
Q. What features do you want to see on this site?
 
Analysis
The case for a data quality platform
Philip Howard By: Philip Howard, Research Director - Data Management, Bloor Research
Published: 26th October 2007
Copyright Bloor Research © 2007
Logo for Bloor Research

Here at Bloor Research we have recently been investigating why so many data migration projects (84%) run over time or budget. Over half of the respondents in our survey who had run over budget blamed inadequate scoping (that is, they had not properly or fully profiled their data) and more than two thirds of those that had gone over time put their emphasis in the same place.

I mention this because it is symptomatic of all data integration and data movement projects: data quality needs to start before you begin your project (so that you can properly budget time and resources) and continues right through the course of the project and, where it is not a one-time project like data migration, is maintained on an on-going basis through production. Further, in order to maintain quality you need to be able to monitor data quality (via dashboards and the like) on an on-going basis as well. This is especially important within the context of data governance.

In other words, data quality follows the lifecycle of your data, and it spans multiple applications and systems as the data is reused and shared. In this article I discuss the need for an integrated data quality platform to support such an environment, with particular reference to the Trillium Software System, whose latest release (version 11) has just been announced.

Version 11 includes some very significant features, not least of which is that this release represents the culmination of the efforts the company has made, over the last few years, to fully integrate its Avellino acquisition. Specifically, this means that there is now a single repository (Metabase) which is shared across the environment (or you can have multiple Metabases), and that there is a single integrated interface across the product. The consequence of this is that data profiling and analysis need not be distinct from data cleansing and matching. In other words, you can now easily swap between one function and the other, as requirements dictate, rather than being forced to use a more waterfall-style approach in which profiling came first and quality came second.

The second major enhancement is the introduction of phrase analysis. This provides the ability to identify unique words and phrases (substrings), and combinations thereof, within a selected attribute. The importance of this is that it provides the ability to parse unstructured and semi-structured data and to then build data quality rules based on these. This is particularly important if you want to apply data quality rules to product data or other descriptive data that comes into the organisation in unstructured formats.

Shortly, phrase analysis will be extended by the introduction of Universal Data Libraries that will provide standardised taxonomies for things such as colours, units of measure, currencies, sizes and shapes and so on. Although the core libraries will be in English at first, you will be able to customise these libraries so that you can automatically recognise that blau = bleu = blue for example.

In addition, the company has opened up the Metabase so that customers and partners can more easily integrate data quality and profiling processes into their applications and in order to provide a foundation for interactive report building. Through this API, for example, customers can easily incorporate metadata from Trillium, such as profiling results or business rules, into broader metadata repositories or other applications.

Finally, this release sees the introduction of time series analysis. Previously, you could only take snapshots of quality information but, with time series support, trending becomes possible so that you can monitor data quality and profiling statistics over time.

Anyway, so much for the major new features (there are a number of smaller ones as well) in version 11. Now I want to focus more on strategy.

There are two types of data quality vendor: pure plays (like Trillium) and those integrated with ETL tools.

As far as pure plays in the market are concerned, most of these (apart from Trillium) are smaller companies that specialise in a particular area such as real-time data quality (embedding data quality into call centre applications) or product data quality, for example. In contrast, Trillium is recognised as an enterprise-wide solution, providing the tools and content to support data quality needs across a wide range of business applications, data domains, and implementation types. However, there is more to it than that: by choosing the smaller suppliers of point solutionswhether for price or because they offer deeper capabilities in a specific areayou will end up having multiple data quality tools when it might be better to have a single solution that did everything, even if, in some circumstances, it wasnt quite the best thing since sliced bread.

Now, this may seem like the traditional integrated solution versus best-of-breed argument. Some people like one, some people like the other. However, in the case of data quality it is not quite as simple as that because using a data quality platform such as Trilliums allows you to reuse business rules across both real-time and batch processes, and across different application environments. Further, if we consider the implementation of data governance then one of the precepts involved would be the adoption of common data quality standards across the organisation, which is best facilitated by using a common platform and reusable rules.

For example, consider the respective requirements for data quality in the data warehouse versus real-time data quality in the call centre. Separate tools for each of these instances would almost certainly lead to different standards, duplicate values and other inconsistencies between the transactional application and the data warehouse. This, in turn, will perpetuate the mis-alignment and lack of understanding that is so common across different business functions, and between those people who are making strategic decisions as opposed to those that are actually executing day-to-day processes. The remedy for such a mismatch is to look for a data quality platform that extends across the enterprise both for different types of applications and different types of implementation.

If you accept the idea that the most desirable approach will be to have a single data quality platform you might then ask whether this should be a part of a larger data integration platform or whether your needs will be better served by a data quality specialist such as Trillium.

Specifically, of course, Trillium aims (and claims) to provide more functionality than its competitors, especially those coming from the ETL space. Hence the introduction of phrase analysis, the range and extent of Trilliums interfaces (it aims to integrate with anything), its new API, its Unicode support, its TS Insight product for trending and reporting, and so on.

However, leaving aside these technical considerations there is a clear argument for investing in an integrated ETL/data quality environment: one vendor, one tool set and so on. However, there is also a clear argument to not invest in this way, because many data quality requirements have nothing to do with data integration or ETL. If you want to embed data matching, say, in a call centre application then this has nothing to do with ETL, so why use an ETL-based tool rather than a pure data quality offering?

To take this further: the fundamental purpose of IT is to provide information to the business in a suitable format and in a timely manner; and data quality is fundamental to realising the value of that information. However, most large organisations have a wide variety of technologies in production and have very rapidly evolving business requirements. Because of this, data quality processes and capabilities must be deployed in a wide variety of contexts: batch and real-time, at the point of data capture and the point of data extraction, in data migrations and in on-going database maintenance operations. A data quality platform that can be directly integrated within these various contexts is likely to be more flexible and scalable (quicker to deploy and cheaper to operate) than data quality components that are wrapped within a specific environments such as a data integration suite. If you buy into this argument then Trillium must be a leading contender for any such platform.

PLEASE NOTE: This is an editorially-independent, sponsored article. If you would like your company, products or solutions to be featured on IT-Analysis.com please contact us.

Reader Comments

Sorry, we are no longer accepting comments on this item. We suggest trying to contact the author directly.

31st October 2007: 'Vincent McBurney' said:

I don't see why a data quality platform owned by an ETL vendor is different to one owned by a pure play vendor. After all Excel didn't stop being a good spreadsheet when it was added to Microsoft Office. Informatica - Similarity, Business Objects - FirstLogic, SAS - DataFlux and IBM - Vality/QualityStage are all good data quality tools that you can implement as stand alone platforms. They all offer most of the functionality you described in Trillium such as SOA and profiling and metrics.

So congrats to Trillium for raising the bar but it's got nothing to do with them being a pure play vendor. All the leading data integration vendors have improving data quality platforms.

Reply to Vincent McBurney?

13th November 2007: 'Daragh O Brien' said:

I'd agree with Vincent. Apart from anything else, Informatica are still selling what was the 'Athanor' toolset as a seperate product (I think it is called Informatica Data Quality now.. gee that took some thinking from the creatives).

While it is nice to see analysts come off the fence regarding vendor preferences I think that the more important thing that anyone looking to buy a tool needs to do is to figure out what they will need to support a proper framework for governance and control of information quality measurement and improvement.

It is possible that a purchase of Informatica's IDQ tool might be sufficient for the early stages of a programme but the extra 'oomph' of the integration with PowerCenter might be required later on up the maturity model (or if someone decides to bin your current architecture and move to a new platform while fixing all the quality problems).

To say that one size fits all is a bit premature.

[full disclosure - I'm an Informatica customer in my day job but I'm also the VP Publicity for the IAIDQ so I'm not endorsing the Informatica toolset, just flagging that there is more than one tool that might fit a platform need]

Reply to Daragh O Brien?

20th November 2007: 'Jason S.' said:

The key here would be to look ahead and think about where you might want to install DQ processes in the future. Then look at the vendor’s specs to see if they have the right connectors. For example, from what I surmise, Trillium has the only Siebel 8 connector. That’s only important if you’re planning to connect to that application.
Will Firstlogic, acquired by Business Objects, acquired by SAP ever make an effort to build a Siebel connector? Not likely.

Reply to Jason S.?

24th January 2008: 'Kumar Sansar' said:

I just read your article and it was as though you sat through our meetings.

We used a very embryonic open-specification solution called JUMP Metamodel that we found on sourceforge.

The final result was some modification to the meta-language that significantly improved our cleansing efforts.

Reply to Kumar Sansar?

Advertisement



Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761
Email: