Data Mining and Homeland Security Applications
January 24, 2003
This material is available primarily for archival purposes. Telephone numbers or other contact information may be out of date; please see current contact information at media contacts.
"Data mining" refers to the automatic extraction of underlying patterns or connections implicitly contained in huge corporate or government databases or data collections accumulated from sources such as Web pages or television broadcasts. Data mining has emerged as a topic of great interest for a variety of homeland security applications.
NSF's Computer and Information Sciences and Engineering directorate has long sponsored research in data mining and related topics. Current programs that support projects studying aspects of data mining include Data and Applications Security, Information and Data Management, Digital Libraries, Digital Government, and Information Technology Research.
In 2002, the government intelligence community provided $6 million to supplement existing NSF research into data mining, with comparable funding likely for the next two years. A workshop was held in December 2001 to identify projects, programs and new research directions. From an initial pool of more than 40 potential projects, 14 were chosen to receive supplementary funding as part of the cooperative Knowledge Discovery and Dissemination (KDD) venture. More on the announcement is available at http://www.nsf.gov/od/lpa/news/02/pr0264.htm, and a brief description of the KDD projects is provided here.
Total Data Mining Systems
COPLINK Law Enforcement Testbed
Initiated in 1999, the COPLINK project led by Hsinchun Chen (firstname.lastname@example.org) at the University of Arizona is addressing the information sharing problems facing many government agencies, which stem from incompatible data content and formats. In addition, COPLINK is studying the social and cultural changes needed to maximize the impact of an agency's investments in information management. As an extension to COPLINK, Chen is providing data from the Phoenix and Tucson police departments-scrubbed of any features identifying a real person-for intelligence analysis research.
Visual Interaction with Multi-Lingual Data Sources
Gregory Newby (email@example.com), at the University of North Carolina, is building a system for data mining that brings to bear the current best methods for text retrieval and filtering, cross-language information retrieval, data visualization, and interaction environments. The system will integrate a large number of text-based data streams in English, Arabic, and Chinese and provide a visual environment in which human analysts can set up profiles for scanning incoming data and assessing trends, anomalies and new events.
Infrastructure for Data Mining and Analysis
Hector Garcia-Molina (firstname.lastname@example.org) and colleagues at Stanford University are building an integrated system that turns diverse data sources into a unified and understandable information resource. The system will extract data from Web sites, analyze and summarize real-time data streams on the fly, correlate and store the collected data, and notify decision makers when important events are detected. This project builds on efforts at Stanford to help individuals make better use of the information on the Web.
Generalized Data Mining Toolkit
Mohammed J. Zaki (email@example.com), at Rensselaer Polytechnic Institute, is developing a general toolkit for tackling a wide range of data mining tasks. Given massive, complex data sets, Zaki's toolkit will scour the data for patterns that might be worth pursuing. The toolkit will identify statistical correlations, rare or abnormal events and long-range, subtle relationships. The extracted patterns will be relayed to human analysts for follow-up action.
Wading through Data Streams
Tools for Monitoring Online Information Sources
Columbia University's Newsblaster system automatically collects, clusters, categorizes, and summarizes news from several sites on the Web. Kathleen McKeown (firstname.lastname@example.org) and Julia Hirschberg at Columbia will lead efforts to extend Newsblaster to online sources containing informal speech, including e-mail messages, chat rooms, and voicemail. As Newsblaster currently does with news, the tools will track events over time and automatically summarize information on a particular event, highlighting agreed upon details, conflicting facts, and unexpected information.
Monitoring Message Streams to Detect Events
Fred Roberts (email@example.com), at Rutgers University, is leading an effort to establish the groundwork for a state-of-the-art scheme to classify and profile documents and messages bearing on specific events or themes. Starting with human assistance to relate existing messages and documents to known events or themes, the system will "learn" how to recognize incoming messages associated with these topics. This application builds on previous work in classifying and detecting patterns of messages in news articles and other text databases and is related to Dr. Roberts' project on classifying and detecting patterns of disease outbreaks from epidemiology databases.
Distributed Mining and Monitoring of High-Speed Data Streams
Data in homeland security and intelligence applications arrives in high-speed data streams, and the data must be processed and mined as it comes in, which presents challenges for existing data-management and data-mining methods. Johannes Gehrke (firstname.lastname@example.org), at Cornell University, is leading an effort to mine real-time data streams by processing incoming data on the fly, identifying potential new "events" continuously and updating currently identified events. These techniques could also be applied to monitoring network data, Web site click-stream data, and biological databases.
Data Mining of Event and Text-Based Data Streams
Padhraic Smyth (email@example.com) and colleagues at the University of California, Irvine, are developing data mining algorithms for automatically creating profiles of "entities" from data streams of text and events. Adaptive statistical techniques create models of individual entities such as Web surfers, learning both a user's patterns over time and interests. The work extends previous data mining research by tracking events and text streams over time, extracting information relevant to entities of interest and updating entity models. The goal of this research is to develop a suite of data exploration tools that analysts can use to browse, monitor, query and understand massive event and text-based data sets.
Understanding What Is Said
Statistical Methods for Understanding "Who Did What to Whom"
Data mining applications, and computers generally, have a difficult time understanding what the language in a message actually means. Dan Jurafsky (firstname.lastname@example.org) and colleagues at the University of Colorado are developing statistics-based techniques to produce standardized "who did what to whom, where and when" versions of messages written in English, Chinese and Arabic. Such methods could improve the accuracy of message summaries, the ability to answer questions, and the ability to link messages in different languages or from different sources.
Mining Multilingual Resources using Text Analytics
Finding the "right" information online has become more difficult as the amount of text has grown dramatically. Salim Roukos (email@example.com), at IBM's T.J. Watson Research Center, is leading an effort to improve computers' "common sense" in understanding the meaning of documents written in several languages. This project is developing new technology to automatically extract entities (peoples, places, and things of interest) and relationships among them and to form "semantic networks" linking these elements, so computers can perform a more sophisticated analysis.
Identifying Patterns for Intelligence Analysis
A central activity of intelligence analysis is identifying patterns in records about interrelated people, places, things and events. David Jensen (firstname.lastname@example.org) and colleagues at the University of Massachusetts, Amherst, are developing a class of methods that can learn more accurate predictive patterns than existing methods from fewer data points. The new methods will use statistical techniques customized for intelligence data. Accurate statistical inferences will help intelligence analysts interpret more subtle indicators and produce earlier warnings about emerging threats.
Cross-Language Event Detection and Tracking
Yiming Yang (email@example.com) and colleagues at Carnegie Mellon University are combining and enhancing techniques in Web mining, cross-language information retrieval and topic detection and tracking (TDT) in an integrated framework to support automated detection and tracking of events from news stories in Arabic, Chinese, German and English. This system bridges language barriers by acquiring multilingual documents from sources on the Web and using those documents to train statistical TDT algorithms automatically. The system also organizes data by topic, genre and languages to support intelligent use of those data in system optimization and for user interaction. Advances in these areas will benefit any discipline that must cope with large volumes of information, such as scientific research, crisis management, and international business.
Audio and Video Sources
Knowledge Discovery from Continuous Video Sources
Vast amounts of surveillance video, broadcast television and on-line multimedia overwhelm the human resources from the worldwide intelligence community who must watch and listen to it all. Howard Wactlar (firstname.lastname@example.org) and colleagues at Carnegie Mellon University are developing technologies to automatically detect and correlate features such as spoken names, pictured faces and referenced locations from the imagery and audio of video sources. This project builds on the Informedia Digital Video Library effort and related research in automated video and sensor analysis, indexing and search.
Talk Prints: Recognizing Speakers by the Way They Talk
Conventional "voice print" techniques identify speakers by acoustic cues related to a person's vocal physiology. Elizabeth Shriberg (email@example.com) and colleagues at SRI International are working to improve conventional voice print techniques by adding cues from the way a person talks-including his or her choice of words, speaking rate, inflections, pauses and even the person's tendency to interrupt others. Such "talk prints" could also help distinguish between casual chats and more formal conversations, or help identify behavior such as changes in a speaker's emotional state.
David Hart, NSF, (703) 292-8070, email: firstname.lastname@example.org
Gary Strong, NSF, (703) 292-8980, email: email@example.com
The National Science Foundation (NSF) is an independent federal agency that supports fundamental research and education across all fields of science and engineering. In fiscal year (FY) 2018, its budget is $7.8 billion. NSF funds reach all 50 states through grants to nearly 2,000 colleges, universities and other institutions. Each year, NSF receives more than 50,000 competitive proposals for funding and makes about 12,000 new funding awards.
Useful NSF Web Sites:
NSF Home Page: https://www.nsf.gov
NSF News: https://www.nsf.gov/news/
For the News Media: https://www.nsf.gov/news/newsroom.jsp
Science and Engineering Statistics: https://www.nsf.gov/statistics/
Awards Searches: https://www.nsf.gov/awardsearch/