I had an interesting briefing with the Basis Technology team the
other week. They updated me on the latest release of their
technology called Rosette 7. In case you're not familiar with
Basis Technology it is the multilingual engine that is embedded
in some of the biggest Internet search engines out
there—including Google, Bing, and Yahoo. Enterprises and
the government also utilize it. But, the company is not just
about keyword search. Its technology also enables the extraction
of entities (about 18 different kinds) such as organizations,
names, and places. What does this mean? It means that the
software can discover these kinds of entities across
massive amounts of data and perform context sensitive discovery
in many different languages.
An Example
Heres a simple example. Say you're in the Canadian consulate and
you want to understand what is being said about Canada across the
world. You type "Canada" into your search engine and get back a
listing of documents. How do you make sense of this? Using Basis
Technology entity extraction (an enhancement to search and a
basic component of text analytics), you could actually perform
faceted (i.e. guided) navigation across multiple languages. This
is illustrated in the figure below. Here, the user typed "Canada"
into the search engine and got back 89 documents. In the main
pane in the browser, you can see that an arrow in a number of
different languages highlights the word Canada, so you know that
it is included in these documents. On the left hand side of the
screen is the guided navigation pane. For example, you can see
that there are 15 documents that contain a reference to Obama and
another 6 that contain a reference to Barack Obama. This is not
necessarily a co-occurrence in a sentence, just in the document.
So, any of these articles would contain a reference to Obama and
Canada. This would help you determine what Obama might have said
about Canada. Or, what the connection is between Canada and the
BBC (under organization). This idea is not necessarily new, but
the strong multilingual capabilities make it compelling for
global organizations.
If you have eagle eyes, you will notice that the search on Canada
returned 89 documents, but the entity "Canada" only returned 61
documents. This illustrates what entity extraction is all about.
When the search for Canada was run on the Rosette Name Indexer
tab (see upper right hand corner of the screen shot) the query
searched for Canada against all automatically extracted "Canada"
entities that existed in all of the documents. This includes all
persons, locations, and organizations that have similar names.
This included entities like "Canada Post" and " Canada Life"
which are organizations, not the country itself. Therefore the 28
other documents with a Canada variant are organizations or other
entities.
Use Cases
There are obviously a number of different use cases where the
ability to extract entities across languages can be important.
Here are three:
- Watch lists. With the ability to extract entities, such as
people, in multiple languages, this kind of technology is good
for government or financial watch lists. Basis can resolve
matches and translate names in 9 different languages. This
includes resolving multiple spelling variations of foreign names.
It also enables organizations to match names of people, places,
and organizations against entries in a multilingual database.
- Legal discovery. Basis technology can identify entities in 55
different languages. This can obviously help in legal discovery
by narrowing down the number of documents that companies would
need to analyze, for example, in the case of a global enterprise.
Additionally, it could process many documents and extract the
entities associated with them to find the right set of documents
needed in legal discovery.
- Brand image, competitive intelligence. The technology can be
used to extract company names across multiple languages. The
software can also be used against disparate data sources, such as
internal document management systems as well as external sources
such as the Internet. This means that it could cull the Internet
to extract company name (and variations on the name) in multiple
languages. I would expect this technology to be used by
"listening posts" and other "Voice of the Customer" services in
the near future.
While this technology is not a text analytics analysis platform,
it does provide an important piece of core functionality needed
in a global economy. Look for more announcements from the company
in 2010 around enhanced search in additional languages.
We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.