“Big data is a bit of a buzzword,” says Ryan Barrett. “People in data science can use it to scare off those that aren’t.” As the Director of Credit Risk & Data Science at Sandy-based fintech firm Acima Credit, he works with data every day. “People who don’t know about it are like, ‘Oh my god, they’re talking about big data.'”
Big data is one of those awe-inducing concepts, like artificial intelligence, that we non-initiates tend to leave at face value. As though it’s a technological curtain we never think to peer behind, we nod in respect and pretend a deep familiarity, certain that we’re the only ones left scratching our head. It’s big, it’s data, and it’s going to change the world. Or so we’re told. For data scientists, however, the reality is much more mundane.
Data That Is… Big?
Data is information. A dataset is a collection of data points; a data point is merely a small bit of information about something. You could have a small dataset (the names and birthdates of everyone in your family, say). If you make it bigger—expanding it until it eventually contains the names and birthdates of everyone on earth—does it, at some unseen point, cross a barrier and become capital B capital D Big Data?
Not necessarily, says University of Utah Associate Professor of Information Systems Rohit Aggarwal. “Big data,” he says, “is much more than just big datasets.” Acknowledging that “there will always be hype cycles,” he emphasizes that big data “is the development of open sourced ecosystem to store and process data.” Based on Mr. Aggarwal’s definition, is big data a technological framework for processing information? Is it the information itself? Or, is it a methodology for rendering the information useful? Mr. Barrett says, “Usually, when we’re talking about big data we’re talking about data that requires some sort of parallel processing. It’s just a way to do multiple calculations, much quicker.”
The Challenges Of Big Data
If the world’s largest dataset is packed away in musty boxes and never processed, is it big data? As it turns out, this is the exact challenge facing many industries. Not the musty boxes—actually, probably the musty boxes, too—but the issue of sequestered data. Mr. Barrett talks about his former experience as a data scientist in the healthcare sector.
“These companies are sitting on some of the world’s largest reservoirs of valuable data, but many of them aren’t doing anything with it,” he says. Turns out, it’s pretty hard to “do anything with it” if the data storage method doesn’t allow one dataset to talk to another. Or if tools like Hadoop and Spark—computational systems that, in the words of Mr. Aggarwal, “process large amounts of data in a distributed fashion using commodity hardware”—can’t readily access said data to work their magic on it.
Legacy institutions, Mr. Barrett says, suffer from infrastructures predating the so-called information age. Facebook, Google, and other technologically-forward companies don’t have any issue harnessing their data, because they’re built to leverage, share, and process it. In many ways, data is their main product. Legacy institutions, in contrast, often seem to approach data as some unwanted appendage of dubious utility and immense hassle. Yet these siloed institutions, says Mr. Barrett, are the ones with the most to gain from the efficiencies and insights that data can bring.
Use Cases For Big Data
In the healthcare sector, research tends to be the bright spot for data leverage. Where other parts of the medical apparatus struggle against decades—or even centuries—of inertia, many medical research disciplines have developed within living memory. And they have data as their very backbone.
Human beings generate some of the largest datasets in existence. “Genomic big data is big data taken to the next level,” says Gabor Marth, Professor of Human Genetics at the University of Utah. “All the genetic data of humanity could very well exceed any other dataset in the known universe.”
Mr. Marth says even with tremendous advances in computing, genomic data is so vast that “to process the genomic data of all humankind would require us to scale up our processing capabilities multiple orders.” While current computational capacity is “enough to process the data we’re able to collect to date,” he says, it is unknown if it can keep up with the frontiers of research. Gary Stone, Executive Director of Precision Genomics at Intermountain Healthcare, is succinct in his assessment: “Big data is key to unlocking the full potential of genomics.”
Recursion Pharmaceuticals “generates many terabytes of biological experiments per week, creating a data set that currently has tens of millions of images plus associated information,” says Ron Alfa, Recursion’s VP, Discovery & Product. (The , by comparison, churns out about eight terabytes of data per year.)
“We have built a first-of-its-kind biological dataset that is designed from the ground up specifically to do powerful machine learning analytics,” says Lina Nilsson, Sr. Director, Data Science Product at Recursion. “We shy away from buzzwords like big data or artificial intelligence.”
Nevertheless, big data truly comes into its own—data harnessed to some groundbreaking purpose—when paired with AI/machine learning/algorithmic intelligence. Or, as Ms. Nilsson says, “it is the intersection of powerful data sets and modern machine learning that is truly revolutionary.” That’s because large datasets and AI reinforce one another in a virtuous cycle: the more experiments get run and the more data generated, the better data models get, allowing an iterative process through which computers can learn from looking at the data.
This allows for the generation of more relevant data which further trains the machine and so forth. The cycle “also allows for emergent properties to emerge: the model can discover things that the human data scientist did not know to look for in the data,” she says.
Epidemiologists try to model how diseases and other health hazards spread. “We get a lot of our insights from trying to figure out causal effects directly from data,” says Damon Toth, an epidemiology research professor at the U of U. He says “there are so many dependencies and confounding variables” that it’s difficult to determine what is cause and what is coincidence in the tangle.
“A naïve look might show you a correlation that seems to be a cause,” he says. “With a big enough dataset [you can make a much more] reliable conclusion on cause and effect.” Toth’s recent work includes antibiotic-resistant bacterial infections and the transmission vectors of the Ebola virus.
Quantity And Quality
So, the dataset needs to be sufficiently large—the larger the better—but it also needs to be deep. Ms. Nilsson says both the size and the quality of the data set are vitally important to them giving useful information. “It is important to emphasize the distinction between a very large dataset with one or a small number of measurements and a dataset that contains hundreds of independent dimensions of measurements, or features,” she says. “Consider for example, a simple dataset containing the current zip code of every US resident. That is indeed a very large data set, but there are a limited number of questions that could be answered from it.”
Such a dataset would be broad, but one-dimensional. Ms. Nilsson contrasts that with a data set that is “more complex and that contains historical data for each zip code an individual has lived in, alongside the dates they lived there. With just a few more dimensions to the data, the second data set could be used to answer more complex demographical questions.”