IT-Analysis.com
IT-Analysis.com Logo
Enterprise SME Business Issues Technology Services Channels
Module Header
Peter AbrahamsAbrahams Accessibility
Peter Abrahams
7th February - Android: Ice Cream Sandwich Accessibliity
David NorfolkThe Norfolk Punt
David Norfolk
7th February - BCS CMSG Conference 2012
Fern HalperFern Halper
Dr Fern Halper
31st January - Four Vendor Views on Big Data and Big Data Analytics: IBM
Fran HowarthBloor Security Blog
Fran Howarth
30th January - Getting ahead in the cloud
Philip HowardBloor IM Blog
Philip Howard
25th January - Cassandra and Hadoop
Blogs > Fern Halper
The Importance of multi-language support in advanced search and text analytics
Fern Halper By: Dr Fern Halper, Partner, Hurwitz & Associates
Published: 17th March 2010
Copyright Hurwitz & Associates © 2010
Logo for Hurwitz & Associates

I had an interesting briefing with the Basis Technology team the other week. They updated me on the latest release of their technology called Rosette 7. In case you're not familiar with Basis Technology it is the multilingual engine that is embedded in some of the biggest Internet search engines out there—including Google, Bing, and Yahoo. Enterprises and the government also utilize it. But, the company is not just about keyword search. Its technology also enables the extraction of entities (about 18 different kinds) such as organizations, names, and places. What does this mean? It means that the software can discover these kinds of entities across massive amounts of data and perform context sensitive discovery in many different languages.

An Example
Heres a simple example. Say you're in the Canadian consulate and you want to understand what is being said about Canada across the world. You type "Canada" into your search engine and get back a listing of documents. How do you make sense of this? Using Basis Technology entity extraction (an enhancement to search and a basic component of text analytics), you could actually perform faceted (i.e. guided) navigation across multiple languages. This is illustrated in the figure below. Here, the user typed "Canada" into the search engine and got back 89 documents. In the main pane in the browser, you can see that an arrow in a number of different languages highlights the word Canada, so you know that it is included in these documents. On the left hand side of the screen is the guided navigation pane. For example, you can see that there are 15 documents that contain a reference to Obama and another 6 that contain a reference to Barack Obama. This is not necessarily a co-occurrence in a sentence, just in the document. So, any of these articles would contain a reference to Obama and Canada. This would help you determine what Obama might have said about Canada. Or, what the connection is between Canada and the BBC (under organization). This idea is not necessarily new, but the strong multilingual capabilities make it compelling for global organizations.

If you have eagle eyes, you will notice that the search on Canada returned 89 documents, but the entity "Canada" only returned 61 documents. This illustrates what entity extraction is all about. When the search for Canada was run on the Rosette Name Indexer tab (see upper right hand corner of the screen shot) the query searched for Canada against all automatically extracted "Canada" entities that existed in all of the documents. This includes all persons, locations, and organizations that have similar names. This included entities like "Canada Post" and " Canada Life" which are organizations, not the country itself. Therefore the 28 other documents with a Canada variant are organizations or other entities.

Use Cases
There are obviously a number of different use cases where the ability to extract entities across languages can be important. Here are three:

  • Watch lists. With the ability to extract entities, such as people, in multiple languages, this kind of technology is good for government or financial watch lists. Basis can resolve matches and translate names in 9 different languages. This includes resolving multiple spelling variations of foreign names. It also enables organizations to match names of people, places, and organizations against entries in a multilingual database.
  • Legal discovery. Basis technology can identify entities in 55 different languages. This can obviously help in legal discovery by narrowing down the number of documents that companies would need to analyze, for example, in the case of a global enterprise. Additionally, it could process many documents and extract the entities associated with them to find the right set of documents needed in legal discovery.
  • Brand image, competitive intelligence. The technology can be used to extract company names across multiple languages. The software can also be used against disparate data sources, such as internal document management systems as well as external sources such as the Internet. This means that it could cull the Internet to extract company name (and variations on the name) in multiple languages. I would expect this technology to be used by "listening posts" and other "Voice of the Customer" services in the near future.

While this technology is not a text analytics analysis platform, it does provide an important piece of core functionality needed in a global economy. Look for more announcements from the company in 2010 around enhanced search in additional languages.

Reader Comments

We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.

Advertisement



Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761
Email: