Resolving data challenges for transformational AI – Economictimes CIO

Artificial Intelligence (AI) is possibly the most widely discussed technology trend. There are strong views about the applications and benefits of AI, while some feel that it will bring in the apocalyptic day of machine controlling human beings; a few others believe that AI will be extremely beneficial for humans. In the near term, AI and machine learning is becoming critical to solve important business and social problems. In many cases, machine intelligence is being used as a supplement to humans, thereby “augmenting” human intelligence.

It is commonly believed that “mathematically complex” algorithms are the biggest challenges for leveraging the power of machine intelligence. However, a detailed discussion with any practitioner will reveal that this is mostly not the case. Integrating data across multiple sources and resolving data quality issues is often the most important challenge in leveraging machine intelligence. This is also the biggest challenge even if one desires to derive simple rules or insights from data.

Due to data integration and data quality challenges, most data driven initiatives within organisations tend to take very long to demonstrate tangible results. This causes budget escalation and frustrations among senior leadership.

Traditionally, organizations used to adopt a linear approach to data integration, wherein detailed Extract, Transform and Load (ETL) processes were created to obtain data from various systems and establish relationship between hundreds of data tables that belong to various source systems. These data warehousing initiatives used to take atleast 18 to 24 months to complete. In addition, it was often difficult to incorporate unstructured data like images, text files, streaming data from sensors etc.

It is not common to use data from the warehouse into various front-end applications, this is because the warehouse is a single storage of the entire data, and if a front-end application (e.g. a customer facing App, a distributor portal etc.) requires to fetch data from that large storage, the response time is usually very high, which is not acceptable for most front-end applications.

An agile approach to data integration attempts to change this paradigm and bring in a use-case-driven approach to data integration. The key element of this approach involves creation of a data lake. As opposed to a data warehouse, a data lake stores data in an AS-IS format without performing any transformation of the data. One performs only Extraction and Loading (EL) of the data in the as-is format from the source system into a single repository. The data lake also stores unstructured data like text, images etc. In this case, there is no transformation or summarization of the source data, hence there is no loss of information. The dramatic reduction in storage costs have made it possible to store data in the as-is form.

Using the right platform, a data lake can be created very quickly (within 4-6 weeks). Once, the data lake has been created, an agile approach can be used to combine the few tables that are required for creating a particular use case. These tables are combined into a mini data mart. For example, one may need to identify the relationships between a handful of 10 to 15 tables only to create a data mart that is required to analyse the effectiveness of salespersons. Each data mart supports specific insight generation or machine learning model development needs. This approach ensures that organizations see initial results very quickly.

Each data mart consists of relatively fewer data elements compared to a warehouse; hence, they can be used to provide data to front-end applications, within prescribed response times. Data from the data marts and results of machine learning models are exposed as APIs (application program interface) which can be consumed by various front-end applications.

A key aspect of using data from the data mart for various front-end applications involves the frequency at which data from the source systems are refreshed into the data lake. An ideal solution is to perform a near-real-time refresh, and to capture changes where the source data is overwritten. Specific capabilities like change data capture and ability of reading updates form data base logs is critical for this purpose.

The data lake approach allows different data marts to obtain different types of summarization of the same base data. As the granular as-is data is present in the data lake, one can perform different types of summarizations from the same data. Within a data lake framework, it is critical to have utilities that can convert (or make sense) of unstructured data like images, scanned PDFs, etc. and incorporate the same with traditional structured data.
Once a specific data mart has been created for one use case, then the next set of data marts can be created whenever needed – the agile data journey. The data lake can also use existing data warehouse as a data source thereby reusing existing investments in the data warehouse.

The Author is Head Product and Engineering at Actify Data Labs

Best Data Collection Tools To Look For in 2024

Resolving data challenges for transformational AI – Economictimes CIO

Mastering Survey Data Visualization

Survey Software User Guide: What You Need to Know

Customer Engagement Metrics For Business Success

Product Development Success with Client Feedback team Voxco Blog 1080x675 1

Top 10 feedback collection tools in 2024

Top 5 Customer Engagement Metrics to Measure in 2024

Syndicated Market Research: Key to Informed Decision-Making

Unlock The Benefits of Omnichannel Market Research Tools

Unlocking Engagement with Poll Questions

I consent to the use of following cookies:

Necessary

Marketing

Analytics

Preferences

Unclassified

Cookie Declaration About Cookies

Necessary (3) Marketing (4) Analytics (10) Preferences (1) Unclassified (10)

Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot function properly without these cookies.

Name	Domain	Purpose	Expiry	Type
hubspotutk	www.voxco.com	HubSpot functional cookie.	1 year	HTTP
lhc_dir_locale	amplifyreach.com	---	52 years	---
lhc_dirclass	amplifyreach.com	---	52 years	---

Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies we need your permission. This site uses different types of cookies. Some cookies are placed by third party services that appear on our pages.

Name	Domain	Purpose	Expiry	Type
_gid	www.voxco.com	Google Universal Analytics short-time unique user tracking identifier.	1 days	HTTP
MUID	bing.com	Microsoft User Identifier tracking cookie used by Bing Ads.	1 year	HTTP
MR	bat.bing.com	Microsoft User Identifier tracking cookie used by Bing Ads.	7 days	HTTP
IDE	doubleclick.net	Google advertising cookie used for user tracking and ad targeting purposes.	2 years	HTTP
_vwo_uuid_v2	www.voxco.com	Generic Visual Website Optimizer (VWO) user tracking cookie.	1 year	HTTP
_vis_opt_s	www.voxco.com	Generic Visual Website Optimizer (VWO) user tracking cookie that detects if the user is new or returning to a particular campaign.	3 months	HTTP
_vis_opt_test_cookie	www.voxco.com	A session (temporary) cookie used by Generic Visual Website Optimizer (VWO) to detect if the cookies are enabled on the browser of the user or not.	52 years	HTTP
_ga	www.voxco.com	Google Universal Analytics long-time unique user tracking identifier.	2 years	HTTP
_uetsid	www.voxco.com	Microsoft Bing Ads Universal Event Tracking (UET) tracking cookie.	1 days	HTTP
vuid	vimeo.com	Vimeo tracking cookie	2 years	HTTP