Intelligent meta data management – addressing the Achilles’ heel of agile data platforms
Background: the power of agile end-to-end data platforms
Across every industry and business function, data has become the key asset for creating machine intelligence and driving the digital strategy. For machine intelligence to work, one needs historical data to build the intelligence and current data (or data as of the present moment) to apply the intelligence.
Integrating data across multiple sources and resolving data quality issues is the single most important challenge in leveraging machine intelligence. This is also the biggest challenge even if one desires to derive simple rules or insights from data.
An architecture involving an enterprise data lake, where data from various source systems are stored in the most granular format is increasingly becoming the most effective data integration strategy. Once, data is aggregated in the data lake, the granular data is used for building data marts that aggregate data around specific contexts (e.g. customer, supplier, store, dealer etc.). The data marts are used for generating insights through dashboards, building machine learning models, or for providing data to various downstream applications (e.g. a customer 360 application). The data in the data lake and the subsequent data marts are updated based on the frequency that is needed for the specific use case (e.g. a near-real time update, end-of-day batch update etc.).
The creation of the data lake and development of data marts usually takes a significant amount of time and development effort. Once the data marts are ready, then additional time and effort is required to develop insights and models. The same is true when the solution is built using best of breed tools or using cloud services provided by major cloud providers.
Given the need for faster time to market, and the lack of large development teams, many organizations are adopting data platforms that enable the agile data journey. Such platforms are most effective for organizations that are looking for high return on investment, faster time to market and donot have a large technology team.
An agile data platform is one which is an end-to-end data and analytics platform. Such a platform provides:
- Pre-built connectors to all common databases and enterprises applications
- A set of extensive drag-and-drop data wrangling components which can be used by business users to create mini-data-marts without even writing a single line of code
- Visualization and ability of creating “ppt-like” drag and drop dashboard
- Connectors to common visualization tools (e.g. Tableau) if the need is to integrate the same with the platform
- Ability of building machine learning models using drag-and-drop components
- Ability of implementing machine learning models from the platform as web services
- Ability of supporting the data needs of downstream applications using web services
- Data quality management and master data management utilities which can be used as drag-and-drop components
These platforms enable the agile data journey. Which involves:
- Creating a data lake in a few weeks
- Building data marts using simple drag and drop workflows
- Building dashboards and machine learning models using drag and drop workflows
The advantage of such a platform is that once the data lake is created within weeks, a few use-case-specific data marts can be created very quickly. And the data marts can start serving the data needs of the identified use cases. This way, tangible business value can be created very quickly. An important advantage of such agile data platforms is their power of democratising the creation of data marts, insights and machine learning models.
The Achilles’ heel of an agile data platform
Difficulty of sharing the tacit knowledge of the meta data
Lack of control over users creating too many redundant data marts
Addressing the Achilles’ heel – machine intelligence for meta data management
Automated meta data creation for each table during first time data load
Whenever a table is read into the data lake, the platform should create a meta data object for the table. As soon as the table is loaded into the data lake, the editable meta data object is populated with the following information:
- Number of unique records
- Possible unique ID fields (e.g. customer number, supplier number, part number etc.) – customer ID in a customer master table
- Possible repeated ID fields – customer number in a transaction table
- Possible demographics filed (phone, email, Zip etc.)
- Maximum, minimum, average and missing for each numeric field – the exact name of the field is left to be created by the user
- Date fields with most recent and oldest – the name of the field needs to be added by the user
In case of change data capture, the name of the base field
Automated knowledge capture as users start using the base tables in the data lake
Once a data lake is completed, usually the initial few data marts are created by users who have an extensive understanding of the meta data. An intelligent meta data capture methodology involves, capturing the meta data as these users create data marts without them explicitly documenting any meta data information. Below are a few examples of how this is accomplished (these are only indicative in nature):
- In a workflow where ever a Sum, Distinct or Average aggregation is used, then the system automatically uses the name given to the aggregated field to create a meta data of the filed (e.g. if the aggregated name is “total_transactions” or “total transactions, then the base field is given the name “transaction”).
- In a workflow if there is a join defined, and the name of one of the keys is defined in the meta data object of table1, then the system automatically ascribes the same name to the key in table2.
- If there is a renaming of the input field, then the same is used it to create a logical name of the field.
- Whenever a workflow is created to build a data mart, the system identifies up-to three “un-named” fields that are most important and prompts the users to add an “English” name of those fields before the workflow can be executed.
- A workflow for building a data mart, usually comprises of multiple tables. Before a workflow is executed, an automated table comparison utility is executed to identify fields that are similar with respect to their range of values, similarity of values, uniqueness of values, number of records etc. In case of similar fields, the system prompts users to input if the same “English” name can be used for multiple fields across tables
- When a data mart is used for creating a visualization, then the name that is used for labelling the visualization is used to create a “English” name for the filed. E.g. if the field in a data mart is used to create a line graph where a filed is plotted in the Y axis and then the Y-axis is named, “monthly transaction” then the base filed is named accordingly (i.e. transaction)
The above are only a few examples. The idea is to capture the way a user is creating a data mart to learn the meta data without the user consciously documenting any meta data information.
Every time a data mart is created, a meta data object is created along with it. For all fields derived from the workflow, the meta data object is populated with the calculation for the filed. If a field in the data mart is created based on any aggregation logic, then the definition of the field is created appropriately in the meta data object of the data mart. The set of IDs in the data mart are also identified and defined in the meta data object.
Automated data mart management
A meta data bot
Data profiler for data quality
For every filed, in the base tables or in the data marts, the meta data object contains a profile of the field. The profile that is created are as follows:
- Character: no of unique values, missing
- Numeric: missing, max, min, average, median
- Date: oldest, latest, missing
- If it is determined that a filed could be a possible ID: then the uniqueness of the ID field is also provided