Intelligent meta data management – addressing the Achilles’ heel of agile data platforms

Introduction
__

Agile end-to-end data platforms, provide a combination of pre-built connectors for building data lake, easy utilities for building data marts and various downstream usage of data including development of machine learning models, visualization and providing data or decisions to other systems through web-services (APIs). Such platforms can create a data lake and data marts very quickly and promises the democratization of data usage across the organization. However, sharing the tacit knowledge of data experts and making the same available to a wider audience poses a significant challenge in the democratization impact of such platforms. Organizations are also weary of balancing the need for security and control while democratizing the use of data. Intelligent meta data management approaches aim at capturing the tacit knowledge about the data by looking at activities of users. Such methodologies coupled with data governance utilities could address the limitation of such platforms.

Background: the power of agile end-to-end data platforms
__

Across every industry and business function, data has become the key asset for creating machine intelligence and driving the digital strategy. For machine intelligence to work, one needs historical data to build the intelligence and current data (or data as of the present moment) to apply the intelligence.

Integrating data across multiple sources and resolving data quality issues is the single most important challenge in leveraging machine intelligence. This is also the biggest challenge even if one desires to derive simple rules or insights from data.

An architecture involving an enterprise data lake, where data from various source systems are stored in the most granular format is increasingly becoming the most effective data integration strategy. Once, data is aggregated in the data lake, the granular data is used for building data marts that aggregate data around specific contexts (e.g. customer, supplier, store, dealer etc.). The data marts are used for generating insights through dashboards, building machine learning models, or for providing data to various downstream applications (e.g. a customer 360 application). The data in the data lake and the subsequent data marts are updated based on the frequency that is needed for the specific use case (e.g. a near-real time update, end-of-day batch update etc.).

The creation of the data lake and development of data marts usually takes a significant amount of time and development effort. Once the data marts are ready, then additional time and effort is required to develop insights and models. The same is true when the solution is built using best of breed tools or using cloud services provided by major cloud providers.

Given the need for faster time to market, and the lack of large development teams, many organizations are adopting data platforms that enable the agile data journey. Such platforms are most effective for organizations that are looking for high return on investment, faster time to market and donot have a large technology team.

An agile data platform is one which is an end-to-end data and analytics platform. Such a platform provides:

  • Pre-built connectors to all common databases and enterprises applications
  • A set of extensive drag-and-drop data wrangling components which can be used by business users to create mini-data-marts without even writing a single line of code
  • Visualization and ability of creating “ppt-like” drag and drop dashboard
  • Connectors to common visualization tools (e.g. Tableau) if the need is to integrate the same with the platform
  • Ability of building machine learning models using drag-and-drop components
  • Ability of implementing machine learning models from the platform as web services
  • Ability of supporting the data needs of downstream applications using web services
  • Data quality management and master data management utilities which can be used as drag-and-drop components

These platforms enable the agile data journey. Which involves:

  • Creating a data lake in a few weeks
  • Building data marts using simple drag and drop workflows
  • Building dashboards and machine learning models using drag and drop workflows

The advantage of such a platform is that once the data lake is created within weeks, a few use-case-specific data marts can be created very quickly. And the data marts can start serving the data needs of the identified use cases. This way, tangible business value can be created very quickly. An important advantage of such agile data platforms is their power of democratising the creation of data marts, insights and machine learning models.

The Achilles’ heel of an agile data platform
__

While an agile data platform is perfectly suited for delivering results very quickly, it does have three important limitations. The first limitation involves the difficulty of sharing the tacit knowledge about the meta data of the organization. The second involves the possible lack of control because users may end up creating too many, and often redundant data marts without a well thought out strategy. The possibility of losing a strong data governance mechanism is the third key limitation of such platforms.

Difficulty of sharing the tacit knowledge of the meta data
__

A key advantage of an agile data platform is to democratise the power of data mart creation and make it available to a wider audience. But to create meaningful data marts, one should have a clear understanding of the data fields within the various tables in the source systems. While the platform can provide the tools for creating the data marts, it is very difficult to democratise the tacit knowledge about the meta data to a wider audience. Therefore, in many cases, though users may have the power to build data marts, they may not build data marts themselves because they lack an understanding of the meta data.

Lack of control over users creating too many redundant data marts
__

Once the agile data platform empowers users to build data marts, and democratises the power of data mart creation, nothing can stop users from creating too many data marts, each being too specific in nature. This could result in serious lack of control and confusion. On the other hand, a strict control over the right to data mart creation may destroy the important objective of democratisation.

Data governance
__

A democratic agile platform also runs the possible risk of exposing data to users without appropriate control. In addition, it may also result in unauthorised changes in data and possible errors in creation of metrics within data marts. Global best practices like EU GDPR requires many governance considerations – a few key aspects being, management of personally identifiable information, ensuring right to forget, consent management and provision of reason codes for analytically driven actions. Ensuring all governance requirements within an agile democratic platform is a key prerogative.

Addressing the Achilles’ heel – machine intelligence for meta data management
__

Automated meta data creation for each table during first time data load
__

Whenever a table is read into the data lake, the platform should create a meta data object for the table. As soon as the table is loaded into the data lake, the editable meta data object is populated with the following information:

  • Number of unique records
  • Possible unique ID fields (e.g. customer number, supplier number, part number etc.) – customer ID in a customer master table
  • Possible repeated ID fields – customer number in a transaction table
  • Possible demographics filed (phone, email, Zip etc.)
  • Maximum, minimum, average and missing for each numeric field – the exact name of the field is left to be created by the user
  • Date fields with most recent and oldest – the name of the field needs to be added by the user
    In case of change data capture, the name of the base field

Automated knowledge capture as users start using the base tables in the data lake
__

Once a data lake is completed, usually the initial few data marts are created by users who have an extensive understanding of the meta data. An intelligent meta data capture methodology involves, capturing the meta data as these users create data marts without them explicitly documenting any meta data information. Below are a few examples of how this is accomplished (these are only indicative in nature):

  • In a workflow where ever a Sum, Distinct or Average aggregation is used, then the system automatically uses the name given to the aggregated field to create a meta data of the filed (e.g. if the aggregated name is “total_transactions” or “total transactions, then the base field is given the name “transaction”).
  • In a workflow if there is a join defined, and the name of one of the keys is defined in the meta data object of table1, then the system automatically ascribes the same name to the key in table2.
  • If there is a renaming of the input field, then the same is used it to create a logical name of the field.
  • Whenever a workflow is created to build a data mart, the system identifies up-to three “un-named” fields that are most important and prompts the users to add an “English” name of those fields before the workflow can be executed.
  • A workflow for building a data mart, usually comprises of multiple tables. Before a workflow is executed, an automated table comparison utility is executed to identify fields that are similar with respect to their range of values, similarity of values, uniqueness of values, number of records etc. In case of similar fields, the system prompts users to input if the same “English” name can be used for multiple fields across tables
  • When a data mart is used for creating a visualization, then the name that is used for labelling the visualization is used to create a “English” name for the filed. E.g. if the field in a data mart is used to create a line graph where a filed is plotted in the Y axis and then the Y-axis is named, “monthly transaction” then the base filed is named accordingly (i.e. transaction)

The above are only a few examples. The idea is to capture the way a user is creating a data mart to learn the meta data without the user consciously documenting any meta data information.
Every time a data mart is created, a meta data object is created along with it. For all fields derived from the workflow, the meta data object is populated with the calculation for the filed. If a field in the data mart is created based on any aggregation logic, then the definition of the field is created appropriately in the meta data object of the data mart. The set of IDs in the data mart are also identified and defined in the meta data object.

Automated data mart management
__

Whenever a user creates a new data mart, the same is matched with all other existing data marts in terms of unique IDs, the number of records, the fields, the level of aggregation of base fields and the update frequency. All data marts which have a close resemblance with the new data mart are listed out and the meta data of those provided to the user. If the user still feels that the new data mart is necessary, only then the platform keeps the new data mart.

A meta data bot
__

A meta data bot helps users locate a certain field within source tables and data marts. The bot works as an AI assistant akin to a data expert who guides users by providing key meta data information. The bot also helps answer queries about range of values, types of distinct values for a categorical field etc.

Data lineage
__

Each workflow should have a data lineage view that helps users understand how records flow across the workflow. Though not directly related to meta data management, the lineage view helps users to understand a workflow that has been created by another user.

Data profiler for data quality
__

For every filed, in the base tables or in the data marts, the meta data object contains a profile of the field. The profile that is created are as follows:

  • Character: no of unique values, missing
  • Numeric: missing, max, min, average, median
  • Date: oldest, latest, missing
  • If it is determined that a filed could be a possible ID: then the uniqueness of the ID field is also provided

Addressing the Achilles’ heel – data governance
__

In addition to usual usage rights management, an agile data platform should provide the administrator the ability to identify fields that should be masked or encrypted so that a user could use the filed for joining multiple tables but will not be able to view the exact content of the filed. Similarly, a field can be defined to be regulated by a consent management indicator. Once such a field is included in any data mart, the system automatically should prompt the user to add the consent management field into the data mart. The same should be the case for handling communication management (e.g. do not call, do not mail etc.) as well. For handling right to forget, a specific component should be designed that should remove all records for a given customer from all data marts.

Conclusion
__

As the art and science of leveraging data becomes widely available to organizations, the need for high return on investment and faster time to market are becoming ever more important. In addition, business users who are not traditional programmers are becoming critical users of new-age data platforms. While tools and platforms are one key element for addressing this need, the most difficult challenge involves documenting and sharing the tacit knowledge about the data. A comprehensive solution to this problem is extremely difficult given the tacit nature of this knowledge. But intelligent meta data management is a step that can help this process significantly.
Bijoy Khandelwal
Bijoy Khandelwal

Head of Engineering & Products at Actify Data Labs