Harvard Business Review
There is a lot of hype surrounding data and analytics. Firms are constantly exhorted to set strategies in place to collect and analyze big data, and warned about the potential negative consequences of not doing so. For example, the Wall Street Journal recently suggested that companies sit on a treasure trove of customer data but for the most part do not know how to use it. In this article we explore why. Based on our work with companies that are trying to find concrete and usable insights from petabytes of data, we have identified four common mistakes managers make when it comes to data.
The first challenge limiting the value of big data to firms is compatibility and integration. One of the key characteristics of big data is that it comes from a variety of sources. However, if this data is not naturally congruent or easy to integrate, the variety of sources can make it difficult for firms to actually save money or create value for customers. For example, in one of our projects we worked with a firm which had beautiful data both on customer purchases and loyalty and a separate database on online browsing behavior, but little way of cross-referencing these two sources of data to actually understand whether certain browsing behavior was predictive of sales. Firms can respond to the challenge by creating “data lakes”, holding vast amounts of data in their unstructured form. However, the very fact that these vast swathes of data now available to firm are often unstructured, such as in the form of strings of text, means it is very difficult to store them in as structured a way as could occur when data was merely binary. And that often makes it extremely difficult to integrate it across sources.
The second challenge to making big data valuable is its unstructured nature. Specialized advances are being made in mining text-based data, where context and technique can lead to insights similar to that of structured data, but other forms such as video data are still not easily analyzed. One example is that, despite state-of-the-art facial recognition software, authorities were unable to identify the two bombing suspects for the Boston Marathon from a multitude of video data, as the software struggled to cope with photos of their faces taken from a variety of angles.
Given the challenges of gaining insights from unstructured data, firms have been most successful with it when they use it to initially augment the speed and accuracy of existing data analysis practices. For example, in oil and gas exploration, big data is used to enhance existing operations and data analysis surrounding seismic drilling. Though the data they use may have increased in velocity, variety, and volume, ultimately it is still being used for the same purpose. In general, starting out with the hope of using unstructured data to try to generate new hypotheses is problematic until the firms have “practiced” and gained expertise in using unstructured data to enhance their answers to an existing question.
The third challenge — and in our opinion the most important factor that limits how valuable big data is to firms — is the difficulty of establishing causal relationships within large pools of overlapping observational data. Very large data sets usually contain a number of very similar or virtually identical observations that can lead to spurious correlations and as a result mislead managers in their decision-making. The Economist recently pointed out that ‘in a world of big data the correlations surface almost by themselves’, and a Sloan Management Review blog post emphasized that while many firms have access to big data, such data is not ’objective’, since the difficulty lies in distilling ‘true’ actionable insights from it. Similarly, typical machine learning algorithms used to analyze big data identify correlations that may not necessarily offer causal and therefore actionable managerial insights. In other words, the skill in making big data valuable is being able to move from mere observational correlations to correctly identifying what correlations indicate a causal pattern and should form the basis for strategic action. Doing so often requires looking beyond big data.
One well-known example of big data is Google Trends, which uses Google’s records of aggregate search queries. However, it is also a case where the fact that the data is merely correlational limits is usefulness. Initially researchers argued that this data could be used to project the spread of flu. However, later researchers found that because the data was backward-looking, using search data only very marginally improved performance relative to a very simple model based on past time patterns.
To take a more specific example, imagine a shoe retailer that advertises to consumers across the web who have previously visited their website. Raw data analysis would suggest that customers exposed to these ads are more likely to purchase shoes. However, these consumers who have previously visited the website have already demonstrated their interest in the specific retailer even prior to viewing the ad, and so are more likely than the average consumer to purchase. Was the ad effective? It is hard to say. Indeed, big data here does not allow any causal inference about marketing communication effectiveness. To understand whether such ads are effective, the retailer needs to run a randomized test or experiment, where one subset of consumers is randomly not exposed to the ad. By comparing the purchase probabilities across consumers who were exposed to the ad and those who were not, the company can then determine whether exposing consumers to an ad made them more likely to buy. Value is delivered in such instances not primarily by the access to data, but by the ability to design, implement and interpret meaningful experiments.
It’s experimentation, not analyzing big observational datasets that allows a firm to understand whether a relationship is merely correlational or might be reliably predictive because reflects an underlying causal mechanism. While it may be challenging for a manager to improve profitability using even one petabyte of observational data describing customer behavior, comparing the behavior of a customer who was exposed to a marketing activity to that of a customer who was by chance unexposed — the results of an experiment – can help a marketer to conclude whether the activity was profitable.
Implementing field experiments, drawing the right conclusions, and taking appropriate action is not necessarily easy. But successful companies have developed the ability to design, implement, evaluate and then act upon meaningful field experiments. It is this “test and learn” environment, coupled with the skill to take action on the insights and understanding whether they can be generalized, that can make big data valuable.
However, because of diminishing returns to increasingly large data samples, such experimentation does not necessarily require big data. For example, Google reports that it typically uses random samples of 0.1% of available data to perform analyses. Indeed, a recent article suggested that the size of big data can actually be detrimental as “the bigger the database, the easier it is to get support for any hypothesis you put forward.” In other words, because big data often offers overlapping insights, a firm can get similar insight from one-thousandth of the full dataset as from the entire dataset.
Experimentation is not the only method companies can use to infer valuable insights from big data. Another potential skill firms can develop is the ability to build better algorithms to deal with big data. One example for such algorithms is recommender systems. Recommender systems rely on algorithms trained on correlational data to recommend the most relevant products to a customer. Yet, it is not the size of the underlying data, but the ability to identify the critical pieces of information that best predict a customer’s preferences. Indeed, it is often not the size of the data but the machine learning algorithm used that determines the quality of the results. While predictive power may increase with the size of the data available, in many instances the improvements in predictions show diminishing returns to scale as data sets increase in size. But building better algorithms requires better data scientists. Companies that assume large volumes of data can be translated into insights without hiring employees with the ability to trace causal effects in that data are likely to be disappointed.
By itself, big data is unlikely to be valuable. It is only when combined with managerial, engineering, and analytic skill in determining the experiment or algorithm to apply to such data that it proves valuable to firms. This is clear when you compare the price of data to the price of data processing skills. The many contexts where data is cheap relative to the cost of retaining talent to process it, suggests that processing skills are more important than data itself in creating value for a firm.