In my last blog I tried to define the concept of insight. In this post I discuss insight generation. Insights are generated by systematically and exhaustively examining a) the output of various analytic models (including predictive, benchmarking, outlier-detection models, etc.) generated from a body of data, and b) the content and structure of the models themselves. Insight generation is a process that takes place together with model generation, but is separate from the decisioning process during which the generated models, as well as the insights and their associated action plans are applied on new data.
Insight generation depends on our ability to a) collect, organize and retain data, b) generate a variety of analytic models from that data, and c) analyze the generated models themselves. Therefore, in order to generate insights, we must have the ability to generate models. And in order to do that we must have data. Insights can be generated from collected data, data derived from the collected data, as well as the metadata of the collected data. This means that we need to be thinking not only about the data collection, management and archiving processes, but also about how to post-process the collected data; what attributes to derive, what metadata to collect.
In some cases data is collected by conducting reproducible experiments or simulations (synthetic data). In other cases there is only one shot at collecting a particular data set. Regardless, insight generation is highly dependent on how an environment is “instrumented.” For example, consumer marketers have gone from measuring a few attributes per consumer, think of the early consumer panels run by companies such as Nielsen, to measuring thousands of attributes, including consumer web behavior, and most recently, consumer interactions in social networks. The “right” instrumentation is not always immediately obvious, i.e., it is not obvious which of the data that can be captured needs to be captured. Oftentimes, it may not even be immediately possible to capture particular types of data. For example, it took some time between the advent of the web and our ability to capture browsing activity through cookies. But obviously, the better the instrumentation the better the analytic models, and thus the higher the likelihood that insights can be generated. Knowing how to instrument an environment and ultimately how to use the instrumentation to measure and gather data can be thought of as an experiment-design process and frequently requires domain knowledge.
Insight generation also involves the ability to organize murky data, which is typically the situation with environments involving big data, and focus on the data that makes “sense,” given a specific context and state of domain knowledge. Focusing on specific data given a particular data doesn’t mean that the rest of the collected data is unimportant. It’s just that one cannot make sense of it at that point in time.
It is important to not only collect and organize data, but also to properly archive it, since insight generation may only become possible when a body of archived data is combined with a set of newly collected data under a particular context. Or that the combination of archived with new data may lead to additional insights to those generated in the past. As the body of domain knowledge increases and new data is collected it may be possible to extract new insights even from data collected in the past. Consequently, having inexpensive and scalable big data infrastructures enables this capability.
Insight generation is serendipitous in nature. For this reason, insights are more likely to be generated from the examination of several analytic models that have been created from the same body of data because each model-creation approach considers different characteristics of the data to identify relations. We maintain that model analysis, and therefore insight generation, is facilitated when models can be expressed declaratively. A good example, of the advocated approach is used by IBM’s Watson system. This system uses ensemble learning to create many expert analytic models. Each created model provides a different perspective on a specific topic. Watson ensemble learning approach utilizes optimization, outlier identification and analysis, benchmarking, etc. techniques in the process of trying to generate insights.
While we are able to describe data collection and model creation in quite detailed ways, and have been able to largely automate them, this is still not the case with insight generation. This is in fact the most compelling reason for offering insight as a service; because we have not been able to broadly automate the generation of insights. What we have characterized as insight today has to be generated manually by the analysis of each analytic model derived from a body of data, even though there there is academic research that is starting to point to approaches for the automatic generation of insights. The analysis of the derived analytic models will enable us to understand which of the relations comprising a model are simply correlations supported by the analyzed data set (but don’t constitute insights because they don’t satisfy the other characteristics an insight must possess), and which are actually meaningful, important and satisfy all the characteristics we outlined before.
As I mentioned, in most cases today utilizing insights that are generated manually by experts and offered in the form of a service may be the only alternative organizations have to fully benefit from the big data they collect. The best examples are companies like FICO, Exelate, Opera Solutions, Gaininsight and a few others. However, there are additional advantages to offering insights as a service:
- Certain types of insights, e.g., benchmarking, can only offered as a service because the provider needs to compare data from a variety of organizations being benchmarked.
- Offering insight as a service could lower the overall cost of generating and reasoning over insights. This means that even organizations that can generate insights on their own may ultimately decide to outsource the insight generation and reasoning processes because specialized organizations may be able to perform them more efficiently and cost effectively.
- Offering insight as a service enables organizations to benefit from the expertise the insight generator develops by offering insights to multiple organizations of the same type. For example, FICO has now developed tremendous credit insight expertise which no single financial services organization can replicate.
I wanted to close by making the following point: I have argued that for an insight to be valid it must have an action associated with it. This action is applied during a decisioning process. The characteristics of a particular decisioning process will also need to be taken into consideration during the insight/action-generation process because the time (and maybe even other costs) allocated to apply a particular action during the decisioning process is very important. Watson’s Jeopardy play provided a great illustration of this point, as the system had a limited amount of time to come up with the correct response to beat its opponents. Below I provide an initial, rudimentary illustration of the time it needs to take to action specific actions in particular domains.
We are starting to make progress in understanding the difference between patterns and correlations derived from a data set and insights. This is becoming particularly important as we are dealing more frequently with big data but also because we need to use insights to gain a competitive advantage. Offering insight-generation manual services provides us with some short term reprieve but ultimately we need to develop automated systems because the data is getting bigger and our ability to act on it is not improving proportionately.