Вештачка интелигенција

New Tool Helps Clients Improve AI Models with High-Quality Data

Summary

Artificial intelligence (AI) has become increasingly dependent on high-quality data for accurate results. Without reliable and updated data, AI models can’t generate outputs that are as good as models using data with those characteristics. This is one of the lessons […]

New Tool Helps Clients Improve AI Models with High-Quality Data

Artificial intelligence (AI) has become increasingly dependent on high-quality data for accurate results. Without reliable and updated data, AI models can’t generate outputs that are as good as models using data with those characteristics. This is one of the lessons that Matillion, a provider of software and services for data extraction, transformation, and loading (ETL), will be delivering during its one-day virtual Data Unlocked conference on November 15.

Matillion is best known for helping companies efficiently transfer transactional data from operational data warehouses to cloud-based data warehouses like Amazon Redshift from AWS and Snowflake. While the use of big data for analysis is still popular, the surge in interest in AI, especially after the launch of ChatGPT nearly a year ago, has prompted Matillion to develop innovations to enhance its clients’ utilization of the latest AI technology. These innovations will be showcased at the virtual Data Unlocked summit taking place on Wednesday.

At the summit, Matillion will introduce new capabilities for working with unstructured data for AI purposes, according to Laura Malins, VP of Product at Matillion. The goal is to help users achieve better results from AI models by providing them with improved data.

“We offer components that convert semi-structured data into tabular format, but we believe there are many other possibilities and expect wider application for processing unstructured data such as video data, call data, and even scraped web data,” she says. “Then, you use AI to obtain some sort of summary, notification, or feedback from that data, and feed that data into a warehouse and use it to enhance your structured data.”

Use cases like predicting customer churn and sentiment analysis are not new in the field of data science. However, the emergence of highly powerful language models like GPT-4 has significantly reduced the effort required to achieve good results in these projects. As an established ETL/ELT pipelines provider, Matillion is uniquely positioned to deliver high-quality and reliable data for AI models, says Malins.

For example, Matillion could assist a call center client by using a language learning model to extract unstructured data from call records and provide a summary or feedback about a particular customer, she says.

“Give me an indication based on this data whether this user is happy, sad, or neutral, and why,” says Malins. “And then we can become smarter about the data we have and how we use it. We can obtain valuable and quantitative data from very unstructured and large data sources in ways that were never before possible.”

The aim is to build on Matillion’s role as a trusted provider of structured data and help clients take their unstructured data further into the AI realm, Malins adds. Currently, there is a lack of trust in the data.

“One thing we see is a fear with external models and who has access to those models and whether they can lose their data, etc.,” says the VP of Product. “Matillion would provide you that traceability, that data reliability. So, you have repeatability of process. You understand where the information is coming from, so if something is wrong, you can adjust it for next time.”

Currently, there is a lot of experimentation with generative AI and large language models, and one of the things companies are doing is creating ensembles of models, where the outputs of one model become the inputs for another model. This carries certain risks, but Matillion can offer some benefits to clients, says Malins.

“There’s a quote in the data industry that has always been true: garbage in, garbage out,” she says. “This just adds more garbage. The data is not validated and verified. AI speeds up the process, so it just multiplies that garbage and makes it harder to distinguish between what’s actually good and what’s not.”

If taken to extremes, this can lead to the collapse of AI models, where the output of the AI model becomes practically worthless. It is recognized that this possibility will increase the importance of traceability and data governance, says Malins.

“A key trend in the industry that will follow is traceability: who inputs what into the models and who owns the outputs,” she says. “I think 2023 was quite chaotic in terms of AI, and I think some elements of that will remain in 2024 as well. Companies have a lot of work ahead to tackle this technology and learn more about what can be gained and what can be invested and gained in return.”

The Data Unlocked event will feature key speakers such as Laura Malins, Matthew Scullion (CEO), Ciaran Dynes (Product Director at Matillion), as well as Frank Slootman (CEO of Snowflake); Mo Gawdat (former Chief Business Officer at Google); and David Coulthard, Formula 1 Grand Prix driver. Participation in the virtual event is free. You can register here.

Source: URL domain not available, use your imagination!

FAQ

What is ETL?
ETL stands for “Extract, Transform, Load” and refers to the process used to transfer data from operational sources to data warehouses.

What are unstructured data?
Unstructured data refers to data that lacks structure and format, such as text data, video footage, audio recordings, etc.

What are large language models (LLMs)?
Large language models are AI models that utilize massive amounts of data for training to learn language and generate text outputs that resemble human writing.

What is AI model collapse?
AI model collapse refers to the situation where the outputs of an AI model become essentially worthless due to a lack of reliable and high-quality input data.