Skip to contents

National Science Foundation Logo

Search KDI Site

Banner: Multi-Disciplinary Research at NSF: Accomplishments of the KDI Initiative
 KDI Home    Contact Us   

Image: People Button

Image: Ideas Button

Image: Tools Buttong

 About KDI

 Behind the Scenes

 Taking Stock

 Links and Resources



Making Information More Accessible

Picture of Dr. Yiming YangThanks to high-speed networks and the Internet, there's a wealth of information available to anyone with access to a computer, but it's not always easy to figure out how best to retrieve that information and use it. This was the impetus behind a KDI-funded project called Universal Information Access: Translingual Retrieval, Summarization, Tracking, Detection and Validation. As Dr. Yiming Yang, the project's principal investigator, puts it, "Once you have data, you're curious about how people can pull information from it."

Dr. Yang is a professor of the Language Technologies Institute and the Computer Science Department at the School of Computer Science of Carnegie Mellon University. Her co-PIs on the project were colleagues at Carnegie Mellon, Dr. Eric Nyberg, Dr. John Lafferty, and Dr. Jaime Carbonell, and they were assisted by postdocs and graduate students. The expertise of this team of researchers spanned the fields of information retrieval, machine learning, machine translation, computational linguistics, software engineering, and human/computer interaction.

The team established five challenging objectives. The first was translingual information retrieval (the accessing of documents across both language barriers and "jargon barriers"). Second was multi-level summarization, customized to the user's information needs.

Third came automated text categorization, which is a way of putting documents into categories to make it easier for the user to navigate them. Fourth was detection and tracking of new topics and events of interest to each user as they unfold. (For example, if a person is interested in a specific topic, the team built a system that alerts the user if a news item appears about that topic. If the user selects that item, the system automatically tracks follow-up reports.)

The last area of research was information validation, which ensures that sources are reliable and that there is consistency among sources.

During the project, each researcher shared information about his or her own specialization. In Dr. Yang's case, that specialty is machine learning algorithms. She explains that there are supervised and unsupervised approaches to machine learning.

Supervised learning takes place when, for example, you visit a news site on the Internet and choose two or three documents to read. The system will learn the keywords, or informative phrases, that were used in those stories. It will then build a profile of these keywords, and it can decide, when more news stories appear, whether they relate to a topic of interest to you.

Unsupervised learning is used to find similarities among documents. A system can compare the words in documents, and if many similar words are found, the documents are considered to be similar. In this way, the system can cluster documents into main events during a certain time period (such as a plane crash or an economic crisis in a particular country).

The team's work has many practical applications. For example, on an Internet search engine, such as Google or Yahoo, documents are listed in categories. To achieve that, these companies employ people to manually put Web pages into their corresponding categories. But this is both costly and time-consuming. "What we wanted," says Dr. Yang, "was to have a system automatically assign documents to a category in a highly accurate fashion. If you have a supervised learning algorithm, you need a training set of labels for documents to build a prototype of each class. If I can take the current documents listed in Yahoo as a training set, then I can train one of those algorithms. The result will produce a classifier that has a model for all the categories. The classifier will be automatically able to predict the category for new documents."

"The difficult part is that most algorithms are not on the scale of Yahoo," says Dr. Yang. "The next challenge is how to produce that kind of algorithm."


Back to Top of Page

People | Ideas | Tools
About KDI | Behind the Scenes | Taking Stock | Links and Resources

KDI Home | Contact Us | Site Map
NSF Home | CISE Home | Privacy Statement | Policies | Accessibility

National Science Foundation: Celebrating 50 Years Logo The National Science Foundation
4201 Wilson Boulevard, Arlington, Virginia 22230, USA
Tel: 703-292-5111, FIRS: 800-877-8339 | TDD: 703-292-5090