Award Abstract # 1447639
BIGDATA: F: DKA: Collaborative Research: Clustering Algorithms for Data Streams

NSF Org: IIS
Div Of Information & Intelligent Systems
Awardee: JOHNS HOPKINS UNIVERSITY, THE
Initial Amendment Date: August 26, 2014
Latest Amendment Date: August 26, 2014
Award Number: 1447639
Award Instrument: Standard Grant
Program Manager: Aidong Zhang
IIS
 Div Of Information & Intelligent Systems
CSE
 Direct For Computer & Info Scie & Enginr
Start Date: September 1, 2014
End Date: August 31, 2018 (Estimated)
Total Intended Award Amount: $1,000,000.00
Total Awarded Amount to Date: $1,000,000.00
Funds Obligated to Date: FY 2014 = $1,000,000.00
History of Investigator:
  • Vladimir  Braverman (Principal Investigator)
    vova@cs.jhu.edu
  • Alexander  Szalay (Co-Principal Investigator)
  • Randal  Burns (Co-Principal Investigator)
  • Tamas  Budavari (Co-Principal Investigator)
  • Benjamin  Van Durme (Co-Principal Investigator)
Awardee Sponsored Research Office: Johns Hopkins University
1101 E 33rd St
Baltimore
MD  US  21218-2686
(443)997-1898
Sponsor Congressional District: 07
Primary Place of Performance: Johns Hopkins University
3400 N CHARLES ST
Baltimore
MD  US  21218-2608
Primary Place of Performance
Congressional District:
07
DUNS ID: 001910777
Parent DUNS ID: 001910777
NSF Program(s): Big Data Science &Engineering
Primary Program Source: 040100 NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 8083
Program Element Code(s): 8083
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

This project will develop novel theoretical methods and algorithms for clustering massive datasets with applications to astronomy, neuroscience and natural language processing. Clustering is the process of creating groups of data based on similarities between individual data points. The developed theoretical methods will be used in applications where clustering algorithms are critical and the input data is extremely large. First, new clustering algorithms will be designed to scale and will allow for better cosmological simulations. The simulations involve billions of particles in each snapshot, and existing clustering algorithms based upon a simple friends-of-friends approach do not scale to these cardinalities. Second, this project will advance the computational capabilities in statistical neuroscience by employing clustering algorithms to discover both regular patterns and anomalies in normal and abnormal brain graphs. Finally, this research will explore the important topic of finding anomalies in massive text streams, such as Twitter. In this setting, one is concerned with detecting anomalous bursts in traffic content that share a similar pattern. These bursts might signal an important political event or a natural disaster. This project will support undergraduate and graduate research aimed at developing skills needed for algorithmic work on massive data sets.

There exist numerous heuristics and approximation algorithms for many variants of the clustering problem. However, these methods are often slow or infeasible for applications with massive datasets. This research will improve space and time upper bounds for clustering algorithms in the streaming model. This project will address the k-mean and k-median problems in the dynamic streaming model, extend the results on separable data when the input comes from Euclidian space, improve the bounds in the sliding window model, combine the coresets technique with novel sampling approaches and the method of smooth histograms. The PIs' previous work has already been applied to natural language processing and this project will expand this direction further and explore the important topic of "First Story Detection." Furthermore, this research will explore the similarities and differences between various sampling and sketching techniques, and how they could be used in large multidimensional astronomical databases, like SDSS (Sloan Digital Sky Survey) SkyServer. These novel approaches will provide major speedups for the execution of large statistical aggregate queries. The new streaming algorithms will be used to find substructure in very large cosmological N-body simulations.

For further information see the project web site at: http://www.cs.jhu.edu/~vova

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 31)
Vladimir Braverman, Rafail Ostrovsky and Gregory Vorsanger "Weighted Sampling Without Replacement from Data Streams" Information Processing Letters , v.115 , 2015
Vladimir Braverman, Stephen R. Chestnut, David P. Woodruff, Lin F. Yang "Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors" Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016 , 2016 978-1-4503-4191-2
Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, David P. Woodruff "Beating CountSketch for heavy hitters in insertion streams." Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016 , 2016 978-1-4503-4132-5
Vladimir Braverman, Harry Lang, Keith Levin, Morteza Monemizadeh "Clustering on Sliding Windows in Polylogarithmic Space" 35th IARCS Annual Conference on Foundation of Software Technology and Theoretical Computer Science, FSTTCS 2015, , 2015 978-3-939897-97-2
Zaoxing Liu, Gregory Vorsanger, Vladimir Braverman, Vyas Sekar "Enabling a "RISC" Approach for Software-Defined Monitoring using Universal Streaming" HotNets-XIV Proceedings of the 14th ACM Workshop on Hot Topics in Networks , 2015 978-1-4503-4047-2
Vladimir Braverman, Stephen R. Chestnut: "Universal Sketches for the Frequency Negative Moments and Other Decreasing Streaming Sums." Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2015 , 2015 978-3-939897-89-7
Jaroslaw Blasiok, Vladimir Braverman, Stephen R. Chestnut, Robert Krauthgamer, Lin F. Yang "Streaming Symmetric Norms via Measure Concentration" The 49th ACM Symposium on Theory of Computing (STOC 2017) , 2017
Vladimir Braverman, Alan Roytman, Gregory Vorsanger "Approximating Subadditive Hadamard Functions on Implicit Matrices" the 20th International Workshop on Randomization and Computation (RANDOM'2016) , 2015
Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, David P. Woodruff "BPTree: an ?2 heavy hitters algorithm using constant memory" The 36th ACM SIGMOD?SIGACT?SIGAI Symposium on PRINCIPLESOFDATABASESYSTEMS(PODS 2017) , 2017
Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar Vladimir Braverman "One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon" The ACM SIGCOMM 2016 , 2016
Vladimir Braverman and Elena Grigorescu and Harry Lang and David P. Woodruff and Samson Zhou "Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows" Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2018 , 2018
(Showing: 1 - 10 of 31)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Summary:

The planned outcome of this research is new theoretical solutions for clustering massive data sets. In particular, we plan to improve clustering and sublinear algorithms in the streaming model of computation. 


Intellectual Merit:

This research effort resulted in new theoretical and practical solutions for handling massive data sets by significantly improving the performance of existing algorithms. Our results have been applied in numerous fields such as computer systems, networking, astronomy, differential privacy and machine learning. These results have been published and presented in top conferences include STOC, ICALP, SODA, ICML, NIPS, OSDI, SIGCOMM, EScience, Astronomy and Computing. 

 

Broader Impacts:

This research has been presented in top academic and industrial venues including MIT, Harvard, Princeton,CMU, UC Berkeley, Google, Facebook and many others. Our results on Software Defined Networking (UnivMon) has been presented in the 50th STOC (2018), also known as the TheoryFest.Our theoretical results for clustering have been presented at the Nexus of Information and Computation Theories Program at The Henri Poincare Institute (IHP), Paris, France in the Spring 2016. Our results for finding frequent elements in massive data have been presented at the "Chaining Methods and their Applications to Computer Science" conference at Harvard University in June 2016. 

 


Last Modified: 11/30/2018
Modified by: Vladimir M Braverman

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page