Analytical approaches to mining big datasets are just tools in the hands of the people implementing them. Sometimes they can make things easy, other times they can be misused and lead to poor results. And other times approaches can take on a life of their own and appear as panaceas for any big data challenge. Machine learning is one of those mystified tools that often falls into this last category, and without a fundamental understanding of the underlying structural relationships can be costly and ineffective.
The issue is tied to the same principle behind causality versus correlation. It is a basic idiom that correlation does not imply causality; just because the crime rate in NYC has dropped in the last 10 years and the price of a candy bar has increased, does not mean that increased candy prices cause less crime. We know that because we know that the likelihood of some underlying fundamental and structural reason for those two things to be related is tenuous at best.
Yet, many people think they can just naively apply machine learning to data sets to mine for interesting results. When left unconstrained, machine learning techniques will often find far stronger signals with nonsensical relationships than with truly causally connected events. And even with refinement of the input to your models, i.e. your feature space, the results are still based on correlations.
Take it from an industry where analytics and quantitative analysis play a huge role – finance. In 2008, when the industry was permeated with systematic investment players executing their economic models based on machine learning techniques, something unexpected hit – a massive deleveraging in the market. By and large the winners in that situation where those who understood this structural dynamic and what is implied, not the players naively applying their statistical models to the past few years.
What can we learn from this? It is not that machine learning is bad, but just that it is a tool and that in order to not be caught with an empty bank account you have to apply it in conjunction with also developing a fundamental understanding. And if you are going to start somewhere, spending the time to carefully and logically construct a basic set of fundamental assumptions will often yield you less risky performance at first.
Ultimately understanding a signal from a noisy dataset takes a variety of approaches. It takes inductive approaches like machine learning and deductive approaches like hypothesis testing. Heralding the ability to do one piece only means you have part of the solution, and if you rely on that alone you are likely setting yourself up for failure.