The saying goes, “History doesn’t repeat itself, but it does rhyme.”  Two researchers have ventured to learn the cadence.  Armed with a New York Times 22-year archive and readily accessible online sources (e.g., Wikipedia, OpenCyc, FreeBase, GeoNames), researchers from Technion-Israel Institute of Technology and Microsoft Research have developed a prototype predictive model to forecast real-world events (e.g., disease outbreaks).  The model generalizes from specific sets of event sequences and “patterns of evidence in near-term newsfeeds” to predict the likelihood of future outcomes.  Better predictions of human and natural events will hopefully lead to better preparation, or better yet, prevention.

The dramatic reduction in the costs of data storage, the increasing availability of analytic tools, and explosion of data shared by individuals, business, and open government initiatives have resulted in a myriad of complex databases, fittingly referred to as “Big Data.”  Big data claims to have Big Potential.  These potential benefits include the ability to identify current trends, such as the tracking of seasonal flu outbreaks and presidential polls and box office hype.  But the real potential lies in the algorithmic promise of predictive power–Big Data as soothsayer.  Recorded Future is one such data-mining initiative, offering a system that analyzes information from multiple web sources, utilizes tools to identify historical developments, and applies “temporal analytics” to offer clients hypotheses of future events.

Of course, statisticians, sociologists, and scientists have utilized datasets and past events for predictive purposes.  The difference with predictive analytic models based on broad, web-based datasets is the sheer size of the sample.  The challenge is not collecting the data, but rather developing a systematic and efficient way of making sense of it.  Kira Radinsky of Technion-Israel Institute and Eric Horvitz of Microsoft Research discuss their predictive system in a recently released paper, “Mining the Web to Predict Future Events.”  To demonstrate the relative benefits of Internet data-mining, they considered the ability of news data to predict cholera outbreaks in Angola.  A system based on computational analytics and relational models could have issued alerts about a downstream risk of cholera “nearly a year in advance;” whereas human experts, relying on fewer samples aimed at generating “predictions for guiding near-term action,” may “overlook such long-term interactions.” Radinsky and Horvitz outline a number of general advantages of this predictive system over its human counterparts:

Real-time Processing:  ”[A] computational system has the ability to learn from patterns from large amounts of data, can monitor numerous information sources, can learn new probabilistic associations over time, and can continue to do real-time monitoring, prediction, and alerting on increases in the likelihoods of forthcoming events.”

Long-term Analysis: “Beyond knowledge that is easily discovered in studies or available from experts, new relationships and context-sensitive probabilities of outcome can be discovered by a computational system with long tentacles into historical corpa . . . .”

Unbiased: “It can be valuable to identify situations where there is a significantly lower likelihood of an event than expected by experts based on the large set of observations and feeds being considered in automated manner.”

Larger Datasets:  A predictive system “typically will have faster and more comprehensive access to news stories that may seem less important on the surface (e.g., story about a funeral . . . in a local newspaper . . .), but that might provide valuable evidence in the evolution of large, more important stories (e.g., massive riots).”

While there is no immediate plan to commercialize the prototype, the potential power of such predictive analytics is clear.  As Horvitz indicated to MIT Technology Review, “I truly view this as a foreshadowing of what’s to come.”  Since Horvitz develops predictive models geared to analyze foreshadowing, I’ll take him at his word.

The growing availability of datasets from both public and provate sources has created a market for new analytical tools to aid business, government, health, and entertainment sectors.  While we should encourage the development of Big Data applications, we must balance the benefits of Big Data with the fears of data misuse or Big Brother tendencies.  What type of government policies will aid the responsible development and utilization of new databases? What are the implications for open government data policies, law enforcement access, education, and privacy regulations?

Elizabeth Maratea

Image Source

Tagged with:

One Response to Big Data Promises Big Predictions

  1. Michael Dearington says:

    Great and very informative post, Lizzie! I bet the results will serve as a good control in proving certain biases, especially the availability bias, which in part causes people to have an inflated sense of frequency of certain events that affect them in particular, are particularly recent, or are particularly vivid.