Across every industry and business function, data has become the key asset for creating machine intelligence and driving the digital strategy. For machine intelligence to work, one needs historical data to build the intelligence and current data (or data as of the present moment) to apply the intelligence.
Integrating data across multiple sources and resolving data quality issues is the single most important challenge in leveraging machine intelligence. This is also the biggest challenge even if one desires to derive simple rules or insights from data.
An architecture involving an enterprise data lake, where data from various source systems are stored in the most granular format is increasingly becoming the most effective data integration strategy. Once, data is aggregated in the data lake, the granular data is used for building data marts that aggregate data around specific contexts (e.g. customer, supplier, store, dealer etc.). The data marts are used for generating insights through dashboards, building machine learning models, or for providing data to various downstream applications (e.g. a customer 360 application). The data in the data lake and the subsequent data marts are updated based on the frequency that is needed for the specific use case (e.g. a near-real time update, end-of-day batch update etc.).
The creation of the data lake and development of data marts usually takes a significant amount of time and development effort. Once the data marts are ready, then additional time and effort is required to develop insights and models. The same is true when the solution is built using best of breed tools or using cloud services provided by major cloud providers.
Given the need for faster time to market, and the lack of large development teams, many organizations are adopting data platforms that enable the agile data journey. Such platforms are most effective for organizations that are looking for high return on investment, faster time to market and donot have a large technology team.
An agile data platform is one which is an end-to-end data and analytics platform. Such a platform provides:
These platforms enable the agile data journey. Which involves:
The advantage of such a platform is that once the data lake is created within weeks, a few use-case-specific data marts can be created very quickly. And the data marts can start serving the data needs of the identified use cases. This way, tangible business value can be created very quickly. An important advantage of such agile data platforms is their power of democratising the creation of data marts, insights and machine learning models.
Whenever a table is read into the data lake, the platform should create a meta data object for the table. As soon as the table is loaded into the data lake, the editable meta data object is populated with the following information:
Once a data lake is completed, usually the initial few data marts are created by users who have an extensive understanding of the meta data. An intelligent meta data capture methodology involves, capturing the meta data as these users create data marts without them explicitly documenting any meta data information. Below are a few examples of how this is accomplished (these are only indicative in nature):
The above are only a few examples. The idea is to capture the way a user is creating a data mart to learn the meta data without the user consciously documenting any meta data information.
Every time a data mart is created, a meta data object is created along with it. For all fields derived from the workflow, the meta data object is populated with the calculation for the filed. If a field in the data mart is created based on any aggregation logic, then the definition of the field is created appropriately in the meta data object of the data mart. The set of IDs in the data mart are also identified and defined in the meta data object.
For every filed, in the base tables or in the data marts, the meta data object contains a profile of the field. The profile that is created are as follows: