|
|
 |
|

Making Information More Accessible
Thanks to high-speed networks
and the Internet, there's a wealth of information available to anyone with
access to a computer, but it's not always easy to figure out how best to
retrieve that information and use it. This was the impetus behind a KDI-funded
project called Universal Information Access: Translingual Retrieval,
Summarization, Tracking, Detection and Validation. As Dr. Yiming Yang, the
project's principal investigator, puts it, "Once you have data, you're curious
about how people can pull information from it."
Dr. Yang is a professor of the Language Technologies
Institute and the Computer Science Department at the School of Computer Science
of Carnegie Mellon University. Her co-PIs on the project were colleagues at
Carnegie Mellon, Dr. Eric Nyberg, Dr. John Lafferty, and Dr. Jaime Carbonell,
and they were assisted by postdocs and graduate students. The expertise of this
team of researchers spanned the fields of information retrieval, machine
learning, machine translation, computational linguistics, software engineering,
and human/computer interaction.
The team established five challenging objectives. The first
was translingual information retrieval (the accessing of documents
across both language barriers and "jargon barriers"). Second was multi-level
summarization, customized to the user's information needs.
Third came automated text categorization, which is a
way of putting documents into categories to make it easier for the user to
navigate them. Fourth was detection and tracking of new topics and
events of interest to each user as they unfold. (For example, if a person is
interested in a specific topic, the team built a system that alerts the user if
a news item appears about that topic. If the user selects that item, the system
automatically tracks follow-up reports.)
The last area of research was information validation,
which ensures that sources are reliable and that there is consistency among
sources.
During the project, each researcher shared information about
his or her own specialization. In Dr. Yang's case, that specialty is machine
learning algorithms. She explains that there are supervised and unsupervised
approaches to machine learning.
Supervised learning takes place when, for example,
you visit a news site on the Internet and choose two or three documents to
read. The system will learn the keywords, or informative phrases, that were
used in those stories. It will then build a profile of these keywords, and it
can decide, when more news stories appear, whether they relate to a topic of
interest to you.
Unsupervised learning is used to find similarities among
documents. A system can compare the words in documents, and if many similar
words are found, the documents are considered to be similar. In this way, the
system can cluster documents into main events during a certain time period
(such as a plane crash or an economic crisis in a particular country).
The team's work has many practical applications. For
example, on an Internet search engine, such as Google or Yahoo, documents are
listed in categories. To achieve that, these companies employ people to
manually put Web pages into their corresponding categories. But this is both
costly and time-consuming. "What we wanted," says Dr. Yang, "was to have a
system automatically assign documents to a category in a highly accurate
fashion. If you have a supervised learning algorithm, you need a training set
of labels for documents to build a prototype of each class. If I can take the
current documents listed in Yahoo as a training set, then I can train one of
those algorithms. The result will produce a classifier that has a model for all
the categories. The classifier will be automatically able to predict the
category for new documents."
"The difficult part is that most algorithms are not on the
scale of Yahoo," says Dr. Yang. "The next challenge is how to produce that kind
of algorithm."
Back to Top of Page |