In the last 15 years, science has experienced a revolution. The emergence of sophisticated sensor networks, digital imagery, Internet search and social media posts, and the fact that pretty much everyone is walking around with a smartphone in their pocket has enabled data collection on unprecedented scales. New supercomputers with petabytes of storage, gigabytes of memory, tens of thousands of processors, and the ability to transfer data over high speed networks permit scientists to understand that data like never before.
Research conducted under this new Big Data paradigm (aka eScience) falls into two categories – simulation and correlation. In simulations, scientists assume a model for how a system operates. By perturbing the model’s parameters and initial conditions, it becomes possible to predict outcomes under a variety of conditions. This technique has been used to study climate models, turbulent flows, nuclear science, and much more.
The second approach – correlation – involves gathering massive amount of real data from a system, then studying it to discover hidden relationships (i.e. correlations) between measured values. One example would be studying which combination of factors like drought, temperature, per capita GDP, cell phone usage, local violence, food prices, and more affect the migratory behavior of human populations.
At Johns Hopkins University (JHU) I work within a research collective known the Institute for Data Intensive Engineering and Science (IDIES). Our group specializes in using Big Data to solve problems in engineering and the physical and biological sciences. I attended the IDIES annual symposium on October 16, 2015 and heard presentations from researchers across a range of fields. In this article, I share some of their cutting edge research.
The United States spends a staggering $3.1 trillion in health care costs per year, or about 17% of GDP. Yet approximately 30% of that amount is wasted on unnecessary tests and diagnostic costs. Scientists are currently using Big Data to find new solutions that will maximize health returns while minimizing expense.
The costs of health care are more than just financial. They also include staff time and wait periods to process test results, often in environments where every minute matters. Dr. Daniel Robinson of JHU’s Department of Applied Mathematics & Statistics is working on processing vast quanties of hospital data through novel cost-reduction models in order to ultimately suggest a set of best practices.
On a more personal level, regular medical check-ups can be time consuming, expensive, and for some patients physically impossible. Without regular monitoring, it is difficult to detect warning signs of potentially fatal diseases. For example, Dr. Robinson has studied septic shock, a critical complication of sepsis that is the 13th leading cause of death in the United States, and the #1 cause within intensive care units. A better understanding of how symptoms like altered speech, elevated pain levels, and tiredness link to the risk of septic shock could say many lives.
Realizing this potential has two components. The first is data acquisition. New wearable devices like the Apple Watch, Fitbit, BodyGuardian, wearable textiles, and many others in development will enable real-time monitoring of a person’s vital statistics. These include heart rate, circadian rhythms, steps taken per day, energy expenditure, light exposure, vocal tone, and many more. These devices can also issue app-based surveys on a regular basis to check in on one’s condition.
Second, once scientists are able to determine which health statistics are indicative of which conditions, these monitors can suggest an appropriate course of action. This kind of individualized health care has been referred to as “precision medicine.” President Obama even promoted it in his 2015 State of the Union Address, and earned a bipartisan ovation in the process. A similar system is already working in Denmark where data culled from their electronic health network is helping predict when a person’s condition is about to worsen.
Dr. Jung Hee Seo (JHU – Mechanical Engineering) is using Big Data to predict when somebody is about to suffer an aneurysm. Because of the vast variety of aneurysm classifications, large data sets are critical for robust predictions. Dr. Seo intends to use his results to build an automated aneurysm hemodynamics simulation and risk data hub. Dr. Hong Kai Ji (JHU – Biostatistics) is doing similar research to predict genome-wide regulatory element activities.
The development of new materials is critical to the advancement of technology. Yet one might be surprised to learn just how little we know about our materials. For example, of the 50,000 to 70,000 known inorganic compounds, we only have elastic constants for about 200, dielectric constrants for 300-400, and superconductivity properties for about 1000.
This lack of knowledge almost guarantees that there are better materials out there for numerous applications, e.g. a compound that would help batteries be less corrosive while having higher energy densities. In the past, we’ve lost years simply because we didn’t know what our materials were capable of. For example, lithium iron phosphate was first synthesized in 1977, but we only learned it was useful in cathodes in 1997. Magnesium diboride was synthesized in 1952, but was only recognized as a superconductor in 2001.
Dr. Kristin Persson (UC Berkeley) and her team have been using Big Data to solve this problem in an new way. They create quantum mechanical models of a material’s structure, then probe their properties using computationally expensive simulations on supercomputers. Their work has resulted in The Materials Project. Through an online interface, researchers now have unprecendented access to the properties of tens of thousands of materials. They are also provided open analysis tools that can inspire the design of novel materials.
Another area where Big Data is playing a large role is in climate prediction. The challenge is using a combination of data points to generate forecasts for weather data across the world. For example, by measuring properties like temperature, wind speed, and humidity across the planet as a function of time, can we predict the weather in, say, Jordan?
Answering this question can be done either by using preconstructed models of climate behavior or by using statistical regression techniques. Dr. Ben Zaitchik (JHU – Earth & Planetary Sciences) and his team have attempted to answer that question by developing a web platform that allows the user to select both climate predictors and a statistical learning method (e.g. artificial neural networks, random forests, etc.) to generate a climate forecast. The application, which is fed by a massive spatial and temporal climate database, is slated to be released to the public in December.
Because local climate is driven by global factors, simulations at high resolution with numerous climate properties for both oceans and atmospheres can be absolutely gigantic. These are especially important since the cost of anchoring sensors to collect real ocean data can exceed tens of thousands of dollars per location.
Housing vacancy lies at the heart of Baltimore City’s problems. JHU assistant professor Tamas Budavári (Applied Mathematics & Statistics) has teamed up with the city to better understand the causes of the vacancy phenomenon. By utilizing over a hundred publicly available datasets, they have developed an amazing system of “blacklight maps” that allow users to visually inspect all aspects of the problem. By incorporating information like water, gas, and electricity consumption, postal records, parking violations, crime reports, and cell phone usage (are calls being made at 2pm or 2am?) we can begin to learn which factors correlate with vacancy, then take cost effective actions to alleviate the problem.
As Big Data proliferates, the potential for collaborative science increases in extraordinary ways. To this end, agencies like the National Institutes of Health (NIH) are pushing for data to become just as large a part of the citation network as journal articles. Their new initiative, Big Data to Knowledge (BD2K), is designed to enable biomedical research to be treated as a data-intensive digital research enterprise. If data from different research teams can be integrated, indexed, and standardized, it offers the opportunity for the entire research enterprise to become more efficient and less expensive, ultimately creating opportunities for more scientists to launch research initiatives.
My personal research uses Big Data to solve a problem caused by Big Data. In a world in which researchers have more data as their fingertips than ever before, the uncertainty caused by small sample sizes has decreased. As this so-called statistical noise drops, the dominant source of error is systematic noise. Like a scale that is improperly calibrated, systematic noise inhibits scientists from obtaining results that are both precise and accurate, regardless of how many measurements are taken.
In my dissertation, I developed a method to minimize noise in large data sets provided we have some knowledge about the distributions from which the signal and noise were drawn. By understanding the signal and noise correlations between different points in space, we can draw statistical conclusions about the most likely value of the signal given the data. The more correlations (i.e. points) that are used, the better our answer will be. However, large numbers of points require powerful computational resources. To get my answers, I needed to parallelize my operations over multiple processors in an environment with massive amounts (e.g. ~ 1TB) of memory.
Fortunately, our ability to process Big Data has recently taken a big step forward. Thanks to a $30 million grant from the state of Maryland, a new system called the Maryland Advanced Research Computing Center (MARCC) has just come online. This joint venture between JHU and the University of Maryland at College Park has created a collaborative research center that allows users to remotely access over 19,000 processors, 50 1TB RAM nodes with 48 cores, and 17 petabytes of storage capacity. By hosting the system under one roof, users share savings in facility costs and management, and work within a standardized environment. Turnaround time for researchers accustomed to smaller clusters will be drastically reduced. Scientists also have the option of colocating their own computing systems within the facility to reduce network transmission costs.
The era of Big Data in science, which started with the Sloan Digital Sky Survey in 2000, is now in full force. These are exciting times, and I cannot wait to see the fruits this new paradigm will bear for all of us.