Designing Data Quality for Ai-Usecase
Mona Rakibe, CEO and Co-Founder of Telmai, an AI-based data observability platform built for Open Architecture. A veteran in the Data Infrastructure landscape, Mona Rakibe worked in engineering and product leadership positions that fueled product innovation and growth strategies in the dynamic landscape of startups and enterprises like Reltio, EMC, Oracle, and BEA, where AI-driven solutions have played a pivotal role. She is consistently recognized for her work in Data and AI. She has been a finalist in the VentureBeat “AI Entrepreneur” category in 2023 and the "Silicon Valley Women Data Leader Award" in 2019.
Often, the data stored for AI workloads is in its raw formats and stored in data lakes with open formats. This talk will focus on designing a data quality strategy for these raw formats.
Slide deck: https://docs.google.com/presentation/d/1b9T0qJqrXCUlVBpOSyT1X6I7C6CeRn_B/edit?usp=drive_link&ouid=103073328804852071493&rtpof=true&sd=true
Mona Rakibe [00:00:09]: Okay. My name is Mona Rakibe. I'm the co founder CEO of Telmai. Telmai is a data observability company. Data quality automation framework, which literally accelerates your data reliability. Foundation automates a lot of stuff that we don't like to do typically, which is validate your data. And today I'm going to talk about how do you design data quality for AI use cases. So the first things first, I believe that it's very well established by now that the foundation of AI, the success of AI, is hugely dependent on the quality of input data, the quality of the data that's driving.
Mona Rakibe [00:00:59]: So a third, this was Gartner statistics, but a third of Genai project will be abandoned by 2025 because of things like foundation capabilities, including data quality. Now that's a really, really shocking number. And we are already seeing companies are abandoning their AI initiatives because they're not getting the ROI. 93% of the teams fail to detect data issues before there is a downstream impact. That means pretty much most companies are seeing the impact of that data directly by the consumers. Now, this, this is, as I mentioned, this is well established problem statement. We are here today for a quality AI quality conference means that we all are interested in, we all know that this problem is established. A year back, I had given a talk on, we had done some research on how controlling the quality of input data has a significant improvement on the precision of your model output.
Mona Rakibe [00:01:59]: So that research can be found on Tel, my website, Tel AI blogs. But today I'm not here to talk about that. So today I'm really here to talk about if that problem statement is established. Day in and day out, we are seeing that there is an impact on business, there's an impact on credibility, bad media and all of these things because of something as petty as an x, as a zero added to column value. Why, with such a smart set of teams and data infrastructure and engineering team, we have solved this problem that well. So today I'm going to spend a little bit of my time on focusing on what we learned and what we do at Telemarke to solve this problem at the scale and specifically for AI workloads. I just have ten minutes, so I'm not going to spend a lot of time. I know everybody's juiced out a lot of things to learn today, but if you think about data, I think about three primary use cases of data.
Mona Rakibe [00:03:01]: First is machine learning, AI use cases. The second is bi analytics. The third is where data itself is a product, like you're processing the data making it usable and selling the data itself, right. At a very broad level. If you think about these use cases, the way we look at data quality, the way we look at data accuracy might be different. In each of these use cases, the source from which we consume this data, the way we query that data is very different. And hence how we design for AI is very different. And one of the reasons why data quality is still on fog is the tools and technology, the other being the processes, the people, and the priority, which is probably less in our hand.
Mona Rakibe [00:03:45]: Let's talk about the technology piece here. So if you think about BI and analytics, the data, most of the data is in warehouse, it's already gone through some processing, you have controlled some of the amount of stuff that's there, and the output is more deterministic. So even if there is a failure, you can go back, do a root cause analysis. On the AI, on the other hand, it's much tougher, it's more probabilistic. Once the data so goes into your models, you have, if you see there are problems, you have to backtrack. It's much more harder to fix it afterwards. So the best bet we have is controlling the input that goes into your AI model. And now why is it hard? Most of your AI data is in data lakes.
Mona Rakibe [00:04:28]: Full disclosure, I'm more on the data engineering, data infrastructure side. So my skill set is more on the data engineering. And hence what can we do at the data architecture and engineering state that can help build better AI model? The first things first, most of this data is in data lake. Most of this data is in formats that are not like SQL queryable. These formats are JSON, CSV, parquet formats, open formats, and hence the whole thing about delta lakes and iceberg and all of those things. The data that's been used by AI is typically not in the best queryable format. So how do you design a data quality strategy that can be designed for open architecture? And this is only going to go on and on because we are planning for the scale of future. So first, the volume is very high, the variety of data is extremely complex.
Mona Rakibe [00:05:24]: So if you look at this, these are the different formats. That's very, very typical in enterprise. The second thing is we are preparing ourselves for a very high scale. So if you think about data quality a, you need to look at nested structure, JSON structure, semi structured data, and understand how complete is that data? How fresh is this data? How do I set up rules for this? And we are talking about a volume which is petabyte scale change data every single day. So how do we design a strategy that is, that scales and it's performant? But every CFO is worried about cloud cost. How do we do all of this with a low cloud cost? This is number two factor, why it is such an unsolved problem. Because just throwing a bunch of validation rule at an ingestion layer, different layers, it's not solving it for scale and cost that we are seeing today. The third thing specifically, if you look at lightweight metrics, like if you're trying to understand if the data is freshen, if the data has the right volume through checksum, those are slightly easier to solve by metadata scanning.
Mona Rakibe [00:06:30]: But if you're truly looking at data quality, if you're trying to control the noise level, if you're trying to understand anomalies in the data, you need to look at the full fidelity of the data. And this is an extremely hard problem to solve because everybody's hungry for more data. And if you want to look at the full data and solve it for data quality, you have to do it at low cost, but also at and scanning from there is extremely important in my opinion, because when you look at data quality, if somebody tells you, I know you have a problem, and not tell you how widespread this problem is, it's almost like creates a sense of anxiety without knowing what you do. So imagine going to a doctor and saying you have an infection. I can tell you have an infection, but I can't tell you which parts are affected. So doing data quality at full fidelity of the data without sampling becomes a prerequisite to achieve a truly scalable and usable data quality platform. Now, when I say data quality, how I mean measure data quality is outside of the metrics of observability, like pipeline health and stuff. Data quality goes beyond to add data accuracy, where we can find anomalies, drifts in the data, but also can we systematically scan the entire data and tell you 20% of this data doesn't mean the SLS that your AI system needs, and hence that 20% has to be parked separately and not ingested into your AI workload.
Mona Rakibe [00:08:01]: That to me is like truly powerful. That having an automated system where you can check the data and then exclude the data, including like putting in data contracts and anomaly detection, but excludes the data that's suspicious or can impact AI right at that ingestion into the AR models. So those are the things that when we talk about data quality, those are the consideration that will make a truly scalable AI grade data quality. And these are the consideration that everybody needs to think, and they think about getting the control over the quality of the data that's been ingested. So I'll talk about a little bit of how this happens. Oh wow. This slide is a little bit confused. The first thing is like eliminating problems in data at the ingestion will really improve your cost overall.
Mona Rakibe [00:08:52]: So there are a couple of things I've spoken to many companies. Some of them are using only 10% of the data, but they're querying the 100% of the data. So that impacts the cloud cost. If you had a way to understand what is the data that's actually being used using that, if you think about the medallion architecture, that's very well adopted. Now starting, why do you need to move the bad data into your goal layer? If you know that this data can be scanned, processed and quality checked, then just move the data, consume the data, query the data that has already passed the quality checks. This has huge impact on your cloud cost. So improve the model, improve the model performance, not just the outcome of the model, but the cloud cost by having the right foundation. And all of this can be automated.
Mona Rakibe [00:09:41]: There are tools today like Telman that can do that. The second thing is automatic data profiling. When your data scientists are using data, each of them is checking this data. For how complete is that data? Which data sets is needed for them? You're already loading making distributed checks across this. But if you had a tool that automatically scan raw data in its raw format, JSON what? Parquet whatever format in your data cloud lakes, then you can get insights like completeness, accuracy, correctness very quickly to then use it into your AI workload. Now for me, data quality monitoring is very important. It's an ongoing process and it has to be automated as a part of your pipeline. So not just checking data quality, but if your data quality is bad, then apply a circuit breaker frac and don't let that data flow and corrupt your downstream system.
Mona Rakibe [00:10:33]: So those are the type of data quality monitoring checks that need to be orchestrated as a part of your pipeline. Almost like guard having like your safety nets before the data enters into your AI workloads. And then oftentimes the problem we see in enterprises, the examples that I give about the bad reporting on the earnings and stuff, these are not the problems that we can catch through validations. These are the unknown unknowns that come in the data cycle that are very hard to eyeball and figure out at the scale that we operate today or be caught through and rules or contracts and so on. These are typically business anomalies or drifts that are much better solved through machine learning based models and time series analysis. So when you do data quality check not just looking at the known unknowns, but truly applying your machine learning based models to find if there's out of range values, if the revenue of this company has always been in xdev number, can we predict that what could be the year over year growth and make sure that it's not out of range? And even if it's out of range, can we know that? Is it accidental addition of zero? So those are the anomalies and outliers from machine learning models. And the last but not the least, data travels. Data is transformating.
Mona Rakibe [00:11:55]: It goes from one stage to another different layer, from bronze to silver to depending on how you have architected, when this data travels, can we have an automated way to know that the data has actually made its way? There could be a zillion things that can go wrong from APIs to many other issues. So having a systematic metric monitoring system that can check that the data has landed properly, the transformation objects have not messed up with anything. These are the guardrails that can actually help. And tel my does all a lot of this and much more, but it does in. The biggest advantage of this is like how can we automate this? So the data teams are focusing on building the models, and data quality is just something that should naturally work and be checked at the ingestion, at the data lake level. So at a high level, these things will help you kind of set the foundation. I'll summarize all of them. So whether you go with a homegrown tool or you go with a tool that's available in the market like Telmai, when you're designing for AI workload, I suggest you look for three main things when it comes to data quality.
Mona Rakibe [00:13:02]: First, it has to support any data types, any system, because your data is never going to be in a warehouse or variable format. So make sure you design for open architecture. Your tools support open architecture open formats. Second thing is sampling works for simple metrics. But if you're doing truly data quality and excluding data and finding where the problems are to remediate it, and you need to do it with no sampling. But of course, this all cannot be done unless you adopt an architecture that can do this at scale and high performance, but extremely low cost. So it's almost sounds like a unicorn world, but it's possible. Trust me, I've spent four years on this.
Mona Rakibe [00:13:43]: You can do all of this, but keep this in mind, because then you're designing for future and you're designing for AI workforce. That's all I had to share. I don't have a fancy slide which, with a bar code which says how to find Telmai. I have a very easy to find on LinkedIn, Monadoki Bay. Telmai is easy to find. It's tell me AI, tell my. So tell me AI. And I'm very, very reachable on LinkedIn, especially if you want to talk about the problem of fashion.
Mona Rakibe [00:14:12]: Thank you so much.