What kinds of applications do we need a semantic web for? Is the
semantic web practical? These questions (among others) were posed
by Jamie Taylor of Metaweb Technologies to a group of panelists
at the Text Analytics Summit last week. The panelists were no
lightweights. They included Vladimir Zelevinsky from Endeca, Ron
Kaplan from Microsoft, and Kathleen Dahlgren from Cognition. I
found this to be one of the most engaging segments of the Summit.
First of all, many people define the semantic web as a "web of
meaning" or a "web of data" that will allow computer applications
to exploit the data directly. Check out the W3C webpage for more
information about definitions. The panelists at the Summit got
into an interesting discussion about parsing data sources for the
semantic web. Here are a few of the highlights. Please note that
I asked some additional questions after the panel, itself, so if
you're reading information you didn't hear on the panel this is
the reason.
-
What kind of applications is the Semantic Web good
for? It depends what you want to know. For example,
one of the panelists pointed out that you don't need the
semantic web to find a hardware store in Boston. However, more
unique queries might require it. Most people have had the
experience of knowing what they are looking for and using a
five or six word query and still not finding it. The panelists
pointed out that entities (people, places, things) were
relatively easy to extract; it is the relationships between the
entities that is harder. Vladimir Zelevinsky explained it like
this in terms of information retrieval need/information
retrieval technologies:
- Known Item Search -> Keyword Search (e.g.,
Google—where you need to find what you know exists);
- Unknown Item Search -> Guided Navigation (e.g.,
Faceted search where you need to explore the data space);
- Unknown Relationship Search -> Semantic Web (where you
are looking not for separate items in the repository, in this
case the web, but for the connection(s) between them).
The semantic web could pay off in applications that require
understanding the relationships between these entities. Ron
Kaplan also noted that semantic web technology provides a
standard way of merging data from different sources, and that
will probably enable some useful new applications.
-
Scaling the semantic web. Everyone seemed to
agree that manually tagging documents is a brittle exercise.
Vladimir Zelevinsky, from Endeca, suggested putting a parser on
each machine. He said that since you type slower than 1
sentence per second that at the moment of creation, semantics
could be injected into the document. Of course, it is a bit
more complex than this, but this was an interesting notion.
Kathleen Dahlgren from Cognition said that NLP at scale was the
wave of the future. NLP is complex but deeply distributed.
Computers are getting faster and cheaper, and this can make it
fast and scalable.
Is it practical? There is a huge amount of data out there and it
keeps changing. There is also a lot of duplicate information on
the web. Is it economically viable to think about parsing the
web? Ron Kaplan said he had done a back of the envelope
calculation using the following assumptions:
"The simple order-of-magnitude calculation goes as follows: There
are roughly 2.5M seconds in a month, so an 8-core machine gives
you 20M cpu seconds. If it takes 1 second on the average to
process a sentence (an upper bound), then you can do 20M
sentences per month. If a web page has on the average 20
sentences, you get 1M pages per month per machine. So, 1000
machines can do a billion pages per month. More if 1 second over
estimates, less if 20 sentence/document underestimates."
So this is economically feasible. If there is a need. And that
remains the question. Is it economically viable and necessary to
try to find the information in the long tail?