Big Data spokes header

Big Data Regional Innovation Hubs and Spokes Workshop #BDHubs

Held in conjunction with the 31st IEEE International Parallel and Distributed Processing Symposium

Buena Vista Palace Hotel
Orlando, Florida USA
June 2, 2017

The ability to access, analyze, and extract insight from vast amounts of data has driven innovation in several areas of science, medicine, and industry. However, an ongoing challenge has been how to apply insights gained from data-intensive research in one sector (be it academia, industry, government, etc.) to other sectors and wider audiences. In addition, there are several challenges specific to certain regions of the country, which could benefit from frequent face-to-face collaboration.

To address these issues, the National Science Foundation (NSF) recently awarded $6 million to establish four regional Hubs across the nation [1] which focus on data science innovation and partnership building. The consortia are coordinated from: University of Illinois at Urbana-Champaign (Midwest Hub) [2]; Columbia University (Northeast Hub) [3]; Georgia Institute of Technology and the University of North Carolina (South Hub) [4]; and the University of California, San Diego, the University of California, Berkeley, and the University of Washington (West Hub) [5]. These Hubs consist of over 250 partners from all 50 states. Collaborators include researchers from academia, industry, non-profits, and various local, tribal, state, and federal government groups.

During the fall of 2015, NSF released a solicitation [6] calling for Big Data Spoke projects (BD Spokes) and planning grants that will work in concert with the BD Hubs to achieve a specific mission. The activities by the BD Spokes will be guided by the themes of: (1) accelerating progress towards addressing societal grand challenges relevant to regional and national priority areas; (2) helping automate the Big Data lifecycle; and (3) enabling access to and increasing use of important and valuable available data assets. In September 2016, NSF awarded ten Big Data spokes projects and ten planning grants through this solicitation, totaling $10 million in funding.

This full-day event will allow interested parties to learn more about the BD Hubs and Spokes, especially how to become involved in their activities. During the opening panel, the executive directors for each BD Hub will introduce the activities and opportunities for their region. The main attraction of the workshop will consist of presentations from select BD Spoke and planning grant investigators, which will highlight the current research related activities facilitated by the BD Hubs.


[1] "Establishing A Brain Trust For Data Science (National Science Foundation)"., 2016

[2] "Midwest Big Data Hub"., 2016.

[3] "North East Big Data Hub"., 2016.

[4] "South Big Data Hub"., 2016.

[5] "West Big Data Innovation Hub"., 2016.

[6] "Big Data Regional Innovation Hubs: Establishing Spokes To Advance Big Data Applications (BD Spokes)"., 2016

Workshop Co-chairs

Chaitan Baru
Directorate for Computer & Information Science & Engineering (CISE)
National Science Foundation
Arlington, Virginia, USA

Fen Zhao
Directorate for Computer & Information Science & Engineering (CISE)
National Science Foundation
Arlington, Virginia, USA

Joanna Chan
AAAS Science & Technology Policy Fellow
On assignment to the National Science Foundation CISE

Organizing Committee

Melissa Cragin
Midwest Big Data Hub

René Bastón
Northeast Big Data Hub

Renata Rawlings-Goss
South Big Data Hub

Lea Shanley
South Big Data Hub

Meredith Lee
West Big Data Hub

About this Workshop

Each BD Hub has fostered stakeholders engagement in a variety of themes, allowing for broad appeal to IPDPS attendees. Each Hub supports multiple projects via Letters of Collaboration on thematic areas. Information on these projects (BD Spokes and Planning projects) are listed below:

Midwest BD Hub

SPOKE: Advanced Computational Neuroscience Network (ACNN)
Richard Gonzalez, University of Michigan at Ann Arbor
Dhabaleswar Panda, Ohio State University
Satya Sahoo, Case Western Reserve University
Franco Pestilli, Indiana University
SPOKE: Digital Agriculture - Unmanned Aircraft Systems, Plant Sciences and Education
Grant McGimpsey, University of North Dakota Main Campus
PLANNING: Big Data Innovations for Bridge Health
Robin Gandhi, University of Nebraska at Omaha
PLANNING: Cyberinfrastructure to Enhance Data Quality and Support Reproducible Results in Sensor Originated Big Data
Elisa Bertino, Purdue University
PLANNING: Networked Resilience of Communities Facing Natural and Social Emergencies
Marshall Poole, University of Illinois at Urbana-Champaign

Northeast BD Hub

SPOKE: A Licensing Model and Ecosystem for Data Sharing
Samuel Madden, Massachusetts Institute of Technology
Tim Kraska, Brown University
Jane Greenberg, Drexel University
SPOKE: Grand Challenges for Data-Driven Education
Beverly Woolf, University of Massachusetts at Amherst
Ivon Arroyo; Worcester Polytechnic Institute
Ryan Baker, Columbia University Teachers College
SPOKE: Integration of environmental factors and causal reasoning approaches for large-scale observational health research
Chirag Patel, Harvard University
Gregory Cooper, University of Pittsburgh
Vasant Honavar, Pennsylvania State University at University Park
Noemie Elhadad, Columbia University
PLANNING: Big Data Literacy: Building Capacity for Regional Collaboration in Closing the Big Data Divide
Stephen Uzzo, New York Hall of Science
PLANNING: Cross-organization Big Data Cyber Attack Awareness
John Yen, Pennsylvania State University at University Park
PLANNING: Planning for Privacy and Security in Big Data
Rebecca Wright, Rutgers University
Adam Smith, Pennsylvania State University
PLANNING: Partnerships for Energy Cycle Innovation through Big Data (PPEID)
Abani Patra, State University of New York at Buffalo

South BD Hub

SPOKE: Large-scale Medical Informatics for Patient Care Coordination and Engagement
Gari Clifford, Emory University
SPOKE: Smart Grids Big Data
Mladen Kezunovic, Texas A&M Engineering Experiment Station
Zoran Obradovic, Temple University
Santiago Grijalva, Georgia Tech Research Corporation
SPOKE: Using Big Data for Environmental Sustainability: Big Data + AI Technology = Accessible, Usable, Useful Knowledge!
Ashok Goel, Georgia Tech Research Corporation
Jennifer Hammock, Smithsonian Institution
PLANNING: Rare Disease Observatory
Rada Chirkova, North Carolina State University
Bruce Cairns, University of North Carolina at Chapel Hill

West BD Hub

SPOKE: Accelerating and Catalyzing Reproducibility in Scientific Computation and Data Synthesis
Michael Barton, Arizona State University
SPOKE: MetroInsight: Knowledge Discovery and Real-time Interventions from Sensory Data Flows in Urban Spaces
Rajesh Gupta, University of California at San Diego
Mani Srivastava, University of California at Los Angeles
Shade Shutters, Arizona State University
PLANNING: BD for Policing in the western United States
Eric Lindquist, Boise State University
PLANNING: Increasing collaborations in proteogenomics applications of genetic data
Eric Deutsch, Institute for Systems Biology
Andreas Prlic, University of California at San Diego

Meeting Agenda

8:00 AM Welcome Remarks
8:10 AM Introductory Remarks
Big Data Spokes 2017-2018
Fen Zhao, National Science Foundation
Harnessing the Data Revolution
Chaitan Baru, National Science Foundation
8:30 AM Keynote Address

Predictive Analytics using AWS
Sanjay Padhi, AWS Research

Dr. Sanjay Padhi, leads the AWS Research Initiatives including AWS’s federal initiatives with the National Science Foundation. Dr. Padhi has more than 15 years of experience in large-scale distributed computing, Data Analytics and Machine Learning. He is the co-creator Workload Management System, currently used for all the data processing and simulations activities by CMS, one of the largest experiments in the world at CERN, consisting of more than 180 institutions across 40 countries. He also co-founded the ZEUS Computing Grid project at Deutsches Elektronen-Synchrotron (DESY), Germany before joining CERN. Sanjay obtained his Ph.D from McGill University in High Energy Physics and is also currently appointed by the Dean of Faculty as an Adjunct Associate Professor of Physics at Brown University.

Using Cloud to harness the Fourth Paradigm
Vani Mandava, Microsoft Research

Vani Mandava is Director of the Data Science Outreach at Microsoft Research. As a Principal Program Manager, she has over a decade of experience designing and shipping software projects and features that are in use by millions of users across the world. Her role in Microsoft Research is to enable academic researchers and institutions develop technologies that fuel data-intensive scientific research using advanced techniques in data management, data mining, especially leveraging Microsoft’s cloud platform through the Azure for Research program. She has enabled the adoption of data mining best practices in various v1 products across Microsoft client, server and services in MS-Office, Sharepoint and Online Services (Bing Ads) organizations, and co-authored a book ‘Developing Solutions with Infopath’, and co-chaired ACM KDD Cup in 2013.

Scientific Computing with the Google Compute Platform
Karan Bhatia, Google

Karan Bhatia, PhD. is a Cloud Specialist for Google Cloud Platform and leads the efforts around Scientific Computing. An expert in Distributed Systems with a PhD and MS from University of California, San Diego, and BS in EECS from University of California Berkeley, currently residing in New Jersey.

10:00 AM Coffee Break
10:30 AM Panel of BD Hub Executive Directors (brief presentations followed by Q&A)
Rene Baston, Columbia University & Northeast Big Data Hub
Melissa Cragin, University of Illinois-Champaign & Midwest Big Data Hub
Meredith Lee, University of California at Berkeley & West Big Data Hub
Renata Rawlings-Goss, Georgia Institute of Technology & South Big Data Hub
Lea Shanley, University of North Carolina at Chapel Hill & South Big Data Hub
11:45 AM A Sociotechnical Investigation of BDHubs
Steve Slota, University of California at Irvine

Stephen Slota is a Post-Doctoral Scholar from the University of California at Irvine. He studies infrastructures of knowledge production with a special focus on science studies and policy.

12:15 PM Lunch
1:30 PM BD Spoke and Planning Grant Presentations – Part 1 (four @ 20 min each, 10 for talk, 10 for Q&A)
  1. Large-scale Medical Informatics for Patient care Coordination and Engagement
    Gari Clifford, Emory University
  2. Integration of Environmental Factors and Causal Reasoning Approaches for Large-scale Observational Health Research
    Vasant Honavar, Pennsylvania State University
    Chirag Patel, Harvard Medical School
  3. Big Data Innovations for Bridge Health
    Robin Gandhi, University of Nebraska at Omaha

    Bridges across the U.S. continue to deteriorate at an alarming rate and the American Society of Civil Engineers estimate a cost of over $76 billion to improve the country's functionally obsolete or structurally deficient bridges. This indicates a significant demand for innovative bridge health monitoring solutions that can strategically guide management, maintenance and replacement programs without risking public safety. Unfortunately, the need to improve how our bridges are managed and repaired or replaced faces similar issues and demands as the rest of the U.S. transportation network: continuously shrinking resources and governing bodies who do not have the necessary insights from bridge health data to find a workable solution. To discuss how to address these critical problems, researchers, practitioners, and individuals representing public and private sectors (transportation infrastructure and built environment owners, operators, designers and maintainers) convened with Big Data technology and analytics experts participated at the inaugural BRIDGE-ing Big Data Workshop hosted by the University of Nebraska in October 2015. During this workshop, it became clear that Big Data technology could assist with providing a timely solution. It also was apparent that past efforts focusing on utilizing bridge health monitoring and big data techniques as part of the management and maintenance/replacement processes are fragmented, and resulting datasets are not deemed trustworthy and are under-utilized.

    Since this inaugural workshop, this project has hosted another workshop in 2016, conducted surveys of bridge asset owners, and engaged in the cataloging of datasets including sources, copyrights, license, collection procedures, and expected access controls from private sector, academia, and government agencies. We are currently in the process of obtaining commitments from stakeholders and host collaboration workshops with small working groups to discuss, import/export, and share bridge structural health monitoring data. We also have pilot projects underway to ease the transition of bridge health data into big data pipelines using collaborations with NSF DIBBs and SAVI projects. Finally, we welcome proposals from businesses/researchers to develop innovative applications that integrate disparate and voluminous data sources. We are collaborating with the Midwest Big Data Hub transportation spoke to potentially inform similar activities for highways, buildings, power distribution networks and other civil infrastructure entities.

  4. Grand Challenges for Data-Driven Education
    Jaclyn Ocumpaugh, Teacher’s College, University of Pennsylvania

    Computers have been in classrooms for decades and yet educators have not identified the most effective ways of using them. Despite advances in evaluation methods to measure human learning, most teachers and researchers still use evaluation measures first described 50 -100 years ago. NSF This NSF Big Data Spoke project supports teachers, administrators and researchers to collaborate around extensive online education resources and big data. We leverage and extend state-of-the-art big data bases and technologies to measure online learning, especially features of student engagement and learning associated with improved student outcome. As more refined data becomes available from online instructional systems and the use of data mining techniques, NSF Big Data Spoke participants will learn to search for patterns and associations form online data and to draw conclusions about student knowledge, performance and behavior. This research addresses several grand challenges in education: 1) Predict future student events, e.g., college attendance, college major, from existing large-scale longitudinal educational data sets involving the same thousands of students. 2) Help teachers to make sense of dense online data to influence their teaching, e.g., what should they say or do in response to student activity. 3) Provide personal instruction to each student based on using big data that represents student skills and behavior and infers students’ cognitive, motivational, and metacognitive factors in learning. We will improve the capacity of data-driven education by sharing educational databases, managing yearly data competitions, and conducting educational data science workshops. Key outcomes include introducing many researchers to educational big data, learning analytics and models of teaching interventions. The project will improve classroom learning and leverage the unique types of data available from digital education to better understand students, groups and the settings in which they learn.

2:50 PM Coffee break
3:20 PM BD Spoke and Planning Grant Presentations – Part 2 (five @ 20 min each, 10 for talk, 10 for Q&A)
  1. Digital Agriculture - Unmanned Aircraft Systems, Plant Sciences and Education
    Travis Desell, University of North Dakota
  2. Advanced Computational Neuroscience Network (ACNN) – Accelerated Processing of Big Neuroscience Data: An Update from the Midwest ACNN Spoke
    Dhabaleswar K (DK) Panda, Ohio State University

    We will present an overview of the two major computational approaches being explored within the Spoke to accelerate processing of Big Neuroscience Data. The first approach focuses on accelerated processing of a single brain data on multi-core platforms. The second approach focuses on a distributed workflow paradigm to accelerate processing of multiple brain data. These approaches and their benefits on different brain datasets and HPC platforms will be presented. The talk will conclude with an overview of the other activities taking place within the spoke and our future plans.

  3. Rare Disease Observatory
    Michael Kowolenko, North Carolina State University

    The current challenge facing the patient with a rare disease is the establishment of the proper diagnosis. The lower the prevalence of the disease, the more difficult it is for the physician to recognize that a rare disease is present if the symptoms presented are mimicked in more common disease states. The objective of the project is to develop an Intelligence Augmentation classification system based on accepted diagnostic criteria for the rare disease Primary Ciliary Dyskinesia (PCD). This pilot will focus the use of both structured and unstructured data analytic systems that combine quantitative and qualitative analytics to augment decision making by the healthcare provider. A series of rules-based and machine-learning systems are deployed against a collection of medical records. The returns are collated and expressed as vector values for each patient. Scores are calculated and returned to the physician for determining whether patients should undergo testing. Our current activities are focused on training the machine to differentiate between several common respiratory conditions and PCD. The systems deployed make use of open-source software programs, a distributed computer infrastructure and various aspects of both rules-based and machine-learning systems. The system is being designed to be data agnostic and perform with a high degree of efficiency, utilizing distributed compute models. The system is readily expandable with ability to adjust to the size of the data aggregated.

  4. MetroInsight: Knowledge Discovery and Real-time Interventions from Sensory Data Flows in Urban Spaces
    Ilkay Altintas, University of California at San Diego
    Shade Shutters, Arizona State University
  5. Cross-organization Big Data Cyber Attack Awareness
    Peng Liu, Pennsylvania State University

    Today, network security analysts dive into big data (e.g., NetFlow data) hoping to gain near-real-time attack awareness. Since the same cyberattack campaign (e.g., WannaCry ransomware) often infects computers in multiple organizations, gaining cyberattack awareness could be substantially facilitated by cross-organization data sharing. Accordingly, "Cross-organization Big Data Cyber Attack Awareness" is the title of this NSF BD Spoke planning project.

    In this presentation, we first present the progress we have made in this planning project. Second, we present what we learned through the planning activities. Finally, we present the current planning activities.

5:00 PM Closing Remarks
5:10 PM Free-form discussion and networking