Talk to Your Data: The SQL Data Analyst

Paul van der Boor is a Senior Director of Data Science at Prosus and a member of its internal AI group.

Focused in building AI powered products that give companies the tools and expertise needed to harness to power of AI in their respective fields.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
In Agents in Production [Podcast Limited Series] - Episode Four, Donné Stevenson and Paul van der Boor break down the deployment of a Token Data Analyst agent at Prosus—why, how, and what worked. They discuss the challenges of productionizing the agent, from architecture to mitigating LLM overconfidence, key design choices, the role of pre-checks for clarity, and why they opted for simpler text-based processes over complex recursive methods.
Demetrios [00:00:00]: We are back for another edition of the process MLOps community limited series all around using agents in production. And this episode we break down the token data analyst, which is one of the agents that actually made it to production. So Paul and I talk about the inspiration behind it first and then we bring in Donay to talk about the technical aspects and where they hit challenges again. It's very rare for me to see companies that are this transparent about the technical difficulties, difficulties of putting something into production. And so I gotta give a huge shout out to process and thank you for making this happen and being this transparent. Let's get into this episode.
Paul van der Boor [00:00:47]: Welcome back to another episode of the Prosus AI ML Ops Agent production podcast.
Demetrios [00:00:53]: The miniseries I like.
Paul van der Boor [00:00:54]: What are we going to talk about today?
Demetrios [00:00:56]: Data Analyst and everything that that means, because I'm sure when we say it, nobody understands it. It is an internal project that made it through the great filter of your projects that have had success. So break it down. What is it?
Paul van der Boor [00:01:11]: Well, what we talked about last time is that we've, you know, we're moving down this sort of evolution of agents and getting them into production in ways that can add value. And you spoke about the graveyard of AI agents. The data analyst is one of those that is not at the graveyard. It's live and kicking. It's in production, answering a ton of questions. There's a lot of experimentation going on. And the data analyst is the agent that essentially helps anybody answer questions about data that sits in any of the companies in the pros group using an agentic framework. And it uses all of the agentic tools that we talked about earlier.
Paul van der Boor [00:01:51]: So the ability to go out and fetch data because it can talk to the database, it's understanding, understands the fields and the columns. And in essence, it's like talking to a real data analyst on Slack. That's what it is.
Demetrios [00:02:03]: And we're going to bring Donay on later to talk about the technical ins and outs of this. But before we do that, let's talk more about the inspiration, the actual tool, what it means on the business side to have this and a little less technical. If somebody wants to just jump to the technical part, they can skip ahead right now, but we're going to set the scene for what this actually means and who's using it. What successes have you found from it?
Paul van der Boor [00:02:32]: Yeah, the inspiration is at the end, comes back to, you know, who we are and what we do as the group at Prosus. So we're about 30,000 employees spread across various tech companies where people continuously try to innovate, make decisions in the product, try to understand, you know, what the customers are asking, where their orders went and so on. And that often requires going into a database and fetching the right information to be able to make a decision, respond to a customer, dispatch an order and so on. And so we've got thousands of people whose full time job data analysts are to help others in the organization, enable them to make decisions. And as we found our Tocan data analyst was actually able to answer those questions with high fidelity, high precision. Um, we wanted to roll that out to basically beef up everybody's ability to ask and talk to data, ask any question and talk to the data.
Demetrios [00:03:30]: And you say, so there's thousands of data analysts, I presume they know how to go in and ask questions of the data already. So is it only them that are getting use of this data analyst agent or are other people able to now get more familiar with the data and answer questions?
Paul van der Boor [00:03:51]: No, we, we see it as a addition to any data analyst team. So you've got existing teams of data analysts and the Tocan data analyst, this agent joins a team of data analysts and is able to answer a whole bunch of questions with them for end users and it's part of the team. So for example, when people go in and ask a question to the data analyst, it's in a Slack channel that the other data analysts are also there and would maybe typically be the first point of contact. Now the data analyst, the token data analyst is the first point of contact. It can answer the question. There's other data analysts in these teams that can see the answer, make sure that if there's issues with the answer, it can't retrieve it or it's incorrect. They can flag that to the end users. So it's very much in addition to the existing data analyst teams.
Paul van der Boor [00:04:38]: Of course, the token data analyst agent is always available, so it answers very quickly. We've seen that that's a huge benefit. It can't always answer everything, but because it's fast, it's got high accuracy, it's something that has been able to give people a lot more speed in finding the right information they need.
Demetrios [00:04:56]: Yeah, it's almost like the support burden of people coming to data analysts and, and maybe it's other data analysts coming to a another data analyst to say, hey, I'm trying to get deeper on something that you're very familiar with. Can you give me these answers? Right. That support burden has now been almost like filtered out a bit by the data analyst because you do not need to go to another human as your first point of contact. You can see if the agent can answer it and then if it can, cool, I got what I need, I'm going to go and do my thing. But otherwise, if it can't, then you can escalate it up to a human. And I think that's one of the pieces that we're seeing time and time again on these use cases is the support use case for agents to be able to be that first line of contact. And if they can't answer this or get the ticket closed, then you can escalate it up to a human. And so it relieves a bit of, of that first line, those simple questions, right?
Paul van der Boor [00:06:05]: Yeah. Let me give you a real example of how it's being used. So let's, let's take ifood, right? It is a leading global food delivery player based out of Brazil and they do about a billion, more than a billion orders every year. And there's a lot of people in the, in the organization who are trying to understand, you know, what's happening. So how many orders has this restaurant done or what are we seeing in this particular zip code? Or a customer operator that's trying to figure out what happened to a specific order with a certain order ID that needs to figure out what's going on. To get the answer to those questions, you need to be able to interact with the data. But these are often repeat questions. Hey, tell me what's the status of this order? Or like I said, I'm going to visit this partner, this restaurant.
Paul van der Boor [00:06:56]: How have they been doing over the last three months? Those are the kinds of things that the token data analyst is able to do really, really well because it understands the data. They're fairly simple queries. There's still a lot of very comple that the human data analysts obviously need to help with. But it's sort of, as you said, the first layer of high frequency questions that we know we can answer. Well, that the data analysts can support these people that aren't able to write SQL queries themselves or don't even have access to the databases to do so.
Demetrios [00:07:32]: Yeah, that's a great point. And you talk about ifood being one of the companies that you release this to. Did you release it to everyone in like the process group? And so you have anyone who wants to can now use the data analyst or is it something that you rolled out to specific companies in the portfolio?
Paul van der Boor [00:07:54]: Yeah, we, we have. It's, it's not sort of a flip of the switch that you just connect it and it works. There's a, a process that takes, you know, a couple of days to onboard a token data analyst on a specific use case. So it's available to onboard for everybody and we've made it fairly seamless to do that onboarding. It's very quick and easy to do so but it still requires effort because of course there's some integrations. You know, we can connect with Snowflake and with DataBricks and with BigQuery and many of the other commonly used data stores. But that still needs to happen. Right? That's an integration that needs to take place.
Demetrios [00:08:31]: Then it needs to be a specific instance, like you have the ifood data analyst instance so that you're not grabbing data analyst stuff from another.
Paul van der Boor [00:08:40]: This is all, I mean it's compartmental, separated and you know that, that is sort of by design how to. And its agents work. So it's per company, but then even within a company. So when it goes to olx, which is, you know, one of our global classified players or Glovo is also in the food delivery space, we're all using this data analyst. They need to then go and figure out, okay, this is a marketing use case and the data analyst needs to be connected to marketing data or it needs to be connected to the logistics data set which actually tracks the status, real time of orders. And so that integration needs to take place. And then once that's done, then they can use, anybody can ask questions about that data set. And we see that when people do that, there's two things that happen.
Paul van der Boor [00:09:25]: One is the time to get the answer reduces. We measured it's about 74% reduction, which.
Demetrios [00:09:33]: 74% reduction.
Paul van der Boor [00:09:35]: Yeah. So now that's huge. That's huge. I mean for folks who are.
Demetrios [00:09:40]: I thought it was like 20%. Yeah, even 20% would be a success. Right. But 74%, first of all, how did you measure that?
Paul van der Boor [00:09:48]: Right.
Demetrios [00:09:48]: We got to go deeper into these numbers because that is huge.
Paul van der Boor [00:09:51]: Yeah. I mean there's. So when, when somebody has a question, they typically post it to one of these Slack channels and then they need to wait for somebody who's available and then that person needs to go spend some time, write the query, check it and return and respond. Yeah, we're only measuring here the actual time it takes for somebody to go and once they're already working at that basically.
Demetrios [00:10:11]: Okay, yeah. So so query submitted.
Paul van der Boor [00:10:14]: Yeah.
Demetrios [00:10:14]: And then response submitted type of thing. Okay, yeah.
Paul van der Boor [00:10:17]: So that's one piece. The other piece is the fact that we see many Many more questions being answered as a whole on these data sets. So we saw very quickly after connecting token data analysts to certain, for example the marketing data set, the logistics data set, those became the most used data sets in the company because now anybody can ask a question to them. So we have a doubling or tripling of the number of insights we call that, that we can actually generate based on this data because it doesn't, there's no bottleneck or limit to the number of humans who can run queries. In fact, now anybody can go into that channel and ask a question. So those are the two interesting effects. Time reduction, but also net new insights generated as a total for a company. A lot of people, almost everybody in an organization who wants to be able to quickly get information.
Paul van der Boor [00:11:15]: And some organizations have really spent a lot of time structuring how to make sure that they democratize data access. Right? Sort of the popular term. But give you an example, at Ifood, they have a term called badhu. So it's an acronym for business analyst, Data heavy users. And these are the people that are basically going to need to interact with data very regularly, continuously, daily in their jobs to be able to do their job. Often it's in a supporting role, but not always. And maybe part of a sales team, a logistics team. And they realized that this was an important skill they needed to have as an organization to be able to move fast.
Paul van der Boor [00:11:55]: And so they created a whole program. They gave this name to the, to this cohort of people that they gave access to a certain set of tools. And it's exactly that constituency that would benefit a lot from this kind of a tool because they, they are interfacing with the data all the time, they know it intimately, well and they've got a lot of repeat questions. And a lot of organizations aren't as structured around this as for example, iFood, but we see that almost every organization in the group has that need and has some sort of a, you know, group of people who are continuously leaning on data to kind of asking questions.
Demetrios [00:12:32]: And if it wasn't with Tokon, they would be pissing off the data team. Probably, yeah. If they were asking things of the data teams. It reminds me of a conversation I had with like a big media company back in the day, right. And they said that they were inundated with data questions continuously and so much so that they couldn't get their work done eventually. What they did to fix this problem because it was before the agent world and any notion of being able to have self serve. They said we're going to send out data advocates into the company and we are going to pair them with each person and make sure that that person knows when they ask a question of the last marketing campaign data, they have to have a next action. So they can't just ask these random questions that later they're not going to action because those are what were taking up more time for the data team to answer and service.
Demetrios [00:13:36]: And if it didn't actually do anything, because, oh, the marketing team now knows that this was better than that. And these, like, small insights that you recognize, they are very powerful because they might just like, percolate in your mind for a while and then, boom, you have a stroke of insight in the shower and later it's like, it's all because I had those random queries. But for this team, they didn't have the resources to do that.
Paul van der Boor [00:14:02]: So isn't that silly? Because you want everybody to have a good intuition of where the business is going and how we're doing. And there's, you know, for us, often we're managing by, you know, certain types of North Star metrics. And there's not always a next action immediately. You just want to understand, where are we? Are we on track? How does this stuff. I do contribute to that. And sure, in some cases there's a very direct action, but sort of this feels like constraining anybody's ability to understand.
Demetrios [00:14:27]: Exactly.
Paul van der Boor [00:14:29]: But I think we're here to talk about the technical aspects.
Demetrios [00:14:35]: Might as well get into it.
Paul van der Boor [00:14:36]: Yeah, why don't we bring in Donay, who's our expert.
Demetrios [00:14:39]: Our resident expert.
Paul van der Boor [00:14:41]: Exactly.
Demetrios [00:14:41]: We just heard the breakdown on like, the why, the inspiration. But you get to tell us what's behind the scenes, what happened to make this actually go into production and what were some of these headaches and what would you do differently? Hopefully, if you knew then what you know now. But maybe we can just start with like, what does it look like behind the scenes when someone asks a question of the data? Then what?
Donné Stevenson [00:15:06]: Yeah, so the SQL analyst, I mean, it's, it is an agent. So if you're familiar with agent frameworks, it's kind of running on those same fundamental concepts of, you know, you have the LLM, which is the brain, you've got the tools which are doing the actual work, and then you've got the history, which is maintaining context, which allows the agent to continue a conversation with context and like, understanding its history to like, sort of more the details is because it is a specialized agent, it also has specialized tools. So I think when we talk about like chatbot agents often it's like, you know, web search, there's some like file handling. And these are not necessarily what the SQL agent is going to need to do its job. So it's got specialized tools which are specifically for it to understand context on the table, metadata schema, you know, data types, what tables are used for and where they fit into like the broader business context. And then I think probably the most fundamental one is the ability to execute SQL queries.
Demetrios [00:16:08]: But it doesn't have specific databases to call as a tool or how does it know where the data lives and which database to go to?
Donné Stevenson [00:16:18]: So we've built it in a way that kind of to the agent's perspective is a generalized ability to execute SQL, such an execute SQL tool. And then under the hood, we're building integrations one at a time for the different databases you could connect to. So when you can set up your agent, because it is based on company's data, they do kind of set up the agent with their context. And part of what they set up, the context is what SQL connector are you using and what dialect are you using. And that's part of what the agent knows as part of its context. And it will then execute SQL and let it know, Okay, I need you to use the redshift one, I need you to use the snowflake one. And this is the dialect and the version of that I'm using.
Demetrios [00:17:04]: Oh, interesting. So basically there's almost like a knowledge base that says if it is a question around marketing, you go to this database and this is the kind of database. So here's the SQL that you should be using.
Donné Stevenson [00:17:16]: Yeah, it's not at this point like it's a bit more fixed, a bit more deterministic. So you set up a SQL agent for your marketing database. And so when you set that up, you say, my marketing database base is using postgres and it's using this dialect of SQL. And then anytime a question is kind of asked on that use case, the agent only kind of gets that information. And that's where the Slack channels come in. Because that's how we're managing the agent, knowing are you on the marketing database or are you on the logistics database or are you on the groceries database?
Demetrios [00:17:48]: Because it's a specific Slack channel for each. Oh, okay. So then you don't even have to have the agent calling different databases. It's just that it knows I'm a marketing database agent or doesn't necessarily know that, but that is how you architected it that makes sense.
Donné Stevenson [00:18:06]: So the agent, from the agent's perspective, all it knows is I write SQL in T SQL and this is the database I have access to. And under the hood we take from those pieces of information. Okay, you've got to go to this particular executor in order to actually run the query.
Demetrios [00:18:24]: One thing that I wanted to know about was are you using specific text to SQL models like fine tuned in certain ways or is it just the throwing the model, throwing it at whatever OpenAI model.
Donné Stevenson [00:18:40]: Yeah. So right now it's just purely on the OpenAI model. When we POC'd it, the efforts around hosting your own model, it wasn't necessarily worth it as well as the additional cost because we were working with OpenAI generally for this kind of work. So the POC it made sense to kind of use the quickest solution. And because what we found is even a generalized model like that, they're really good at generating valid SQL. You don't need a specialist at generating SQL. They're very, very good at that. What they're not very good at is reasoning over the context around the table metadata.
Donné Stevenson [00:19:16]: And you don't need a specialized SQL agent for that. You need a good reasoner. So it made sense to kind of stick with models that were good at reasoning rather than models that were specialized at SQL.
Paul van der Boor [00:19:26]: Also because I think we do a lot of model fine tuning and so on for other use cases, we had a pretty good idea what it would take. But at looking at the token data analyst initially we wanted to make sure that it was usable and it needed to have certain accuracy and reliability from a user perspective. And generating the SQL was not the bottleneck there. There's a lot of other reasons. Sometimes it wouldn't answer.
Demetrios [00:19:48]: Yes, but that's what we need to talk about.
Paul van der Boor [00:19:50]: It wasn't the text to SQL capabilities.
Demetrios [00:19:52]: Yeah. What were the reasons that it wouldn't answer if it wasn't the SQL generation?
Donné Stevenson [00:19:57]: Ironically, it wasn't so much that it wouldn't answer, it's that it would continue to answer when it shouldn't. Too much confidence, so much confidence in these large language models. I think one of the, it's maybe one of the first places where we really had to sort of start thinking about prompt fine tuning, you know, trying to set up the Persona, trying to instruct this thing away from answering questions that it didn't know the answer to. Because these agents get like a bit of business context and they get the user question and from that they are then sort of expected to evaluate the user question for clarity, do I have enough information to answer the question? And very often, especially at the beginning, it was almost impossible for it to say no.
Demetrios [00:20:43]: Because it always thought it had enough context.
Donné Stevenson [00:20:45]: You would always sort of assume right, it was doing. And I think this is a large language model problem in general is where there is ambiguity or there is space for assumption, they will do that. And so it becomes quite unpredictable in these cases to know what it will do. And generally what it will do is.
Demetrios [00:21:02]: Ask, even if you were prompting it like crazy to say, make sure you ask the question this or that.
Donné Stevenson [00:21:08]: We had some use cases when we were building where we had like requirements on the amount of information it needed to have and he would still try.
Demetrios [00:21:16]: And answer, it would hallucinate the requirements. Of course I got that requirement.
Paul van der Boor [00:21:19]: I mean, it's actually one of the scariest failure cases because if the model is powerful enough in its reasoning capabilities and it has enough examples, it will generate an answer that's plausible. So it's like, yeah, this took 15 minutes to deliver and it's like, yeah, that sounds like, actually sounds about right, but there's actually no query ran right. And so it was a very difficult failure case to spot.
Donné Stevenson [00:21:44]: Forcing it to do it was very, very hard. And this is one of those places where we had to find another way to solve the problem because LLMs were not sufficient, prompting was insufficient. And so we've kind of introduced this idea of like a pre check, like you ask in isolation, does this question have enough? So the agent's job at this point is not to generate SQL. Its job is to say, do I have enough information? And you isolate that task and then it starts to behave a bit better. And for that purpose we've also built like a clarity data set, which is a data set where we give the question and the context. And yeah, the test is, is there enough information here for you to answer the question? And we test the models on perllm, on like how good is this model at doubting itself? And that's really been a big part of building the trust in the SQL agent because the answers are plausible and if it's not a user who's kind of into the data enough to see.
Demetrios [00:22:42]: That problem, it's a perfect storm.
Donné Stevenson [00:22:44]: They could quote numbers that aren't necessarily true and then the analysts who do know, they lose trust in it because it's kind of their assistant. If they don't trust it, they won't use it and they won't give that to people to use. And because they own it. They're the ones that have to add people to those Slack channels that let them query the actual agent.
Demetrios [00:23:02]: At the sake of going on a little tangent, not related to the data analyst, I want to talk about how you are breaking down the queries that are coming through or the agents on the back end that are modular in this sense. Because what you talked about there is this verification agent or I don't know if you have a specific name for it.
Donné Stevenson [00:23:26]: So I think in the original version of the Token analyst we just called it like a, like a clarity check and it was part of the original flow. So as the user question comes in. So it's not really part of the agentic framework in a typical sense. It's part of how the architecture of the system is built. And the architecture of the system has this sort of pre step before it's fed to the agent. So it goes into a single task which is not agent specific. It's just an LLM call and then that comes back and based on that decision, either the agent is kicked off or it goes back to the user.
Demetrios [00:24:02]: Yeah, because it feels like that can be very generalized across all of use cases. Anytime you're interacting with an agent, you should have that as your first step just to make sure, like do we have enough information? You have to give it the context that it needs to get. But then if you build that once, you can now have that as like a modular piece that you stick into all agents.
Donné Stevenson [00:24:25]: Yeah, and I think that's one of the cool things that the SQL agent has taught us about general agents. And I think as we go into more specific agents, because they have their task and they understand when we build them, we understand this is what we want the agent to do. Building the step that goes, okay, let's just double check, let's set up the question correctly for this task. That's like a super useful step to implement for any kind of agent to kind of enforce this uncertainty until LLMs get a little bit more human, a.
Demetrios [00:24:57]: Little less confident now. So that was one big piece. What were other parts on, like how it fell down? SQL generation isn't the hard part, as you said, clarifying if it has enough information. I know, I think you mentioned before that you had to create context for the business and that was a really difficult. Because if I say if I'm in marketing and I say an mql, what does that mean? Yeah, an SQL, what does that mean? And we know, yeah, okay, MQL is a marketing qualified lead. But what isn't a marketing qualified lead.
Donné Stevenson [00:25:33]: Yeah, so that was, I think that's the. Remains the biggest bottleneck to getting new use cases set up. And it is the, the biggest task that any analyst who wants to use this has to go through is setting up the context. So when we talk about context, generally we're talking about the schema. And the schema includes table descriptions and columns, column types and column descriptions. There's a lot of information that's needed and many companies that have like quite good data platforms, ifood was one of them, have a lot of this available. They have, you know, very sophisticated data on their actual tables. And even once we could kind of plug into that, we found that when it's written for humans, it doesn't mean it's written for agents.
Demetrios [00:26:19]: That's an insight. Wait, pull on that thread a little bit more. What does that mean for humans versus agents?
Donné Stevenson [00:26:26]: Well, because the metadata for these tables is still written by someone in marketing for other marketing data analysts. So there's still this kind of expected understanding of the abbreviations. There's an expected understanding of how the data connects to each other. I think we had one funny issue where the table someone was asking for location data or based on locations and the instructions still spoke about you have to use the country, you have to use the country for this. And someone asked for the city.
Paul van der Boor [00:27:01]: Other examples, financial metrics. Somebody would ask what's the contribution margin or what's GM1 like gross margin one? Which are all very, very strictly defined terms in a certain context. But the agent wouldn't know without that being actually said, okay, gross margin 1 equals this cost or this revenue minus this cost and so on. And so having that available to the agent at time of then interpreting a user's question is important because it isn't just there's a data catalog that describes gross margin one is this. You actually need a formula, right? And so those are the kinds of things I think we were missing typically in the data catalogs were naturally always sort of, there's a lot of implicit company knowledge already embedded in the way these things were described at the column level or table schema level that the agent still didn't really automatically be able to know or understand.
Demetrios [00:27:55]: How do you just continuing on the idea of agent versus human descriptions, how do you now think about like what needs to happen so that the analyst gets all the data it needs or it is more agentic in everything that you're building out?
Donné Stevenson [00:28:13]: So we, I mean, because we tended to build these, especially at the beginning, quite closely with the use cases that were so the teams that were actually using the analyst, we learned a lot about, like, how you need to phrase this information so that it is clear and there's very little space for ambiguity. And so when we have new use cases, we have, you know, like recommendations on how you structure and instruct it. Yeah. So it's just, I think the big thing we've learned is that you, we have to learn how to phrase statements without ambiguity. And there's things where you're like, oh, but it's obvious, but it's stupid things like, you know, if you have sufficient information, to us, that's clear, but what does sufficient information actually mean? And you run into these kinds of barriers all the time, especially when you're kind of describing what this is useful for. And so that's kind of what we've had to sort of learn. And the information, like the information we try to give as we're setting up new ones.
Demetrios [00:29:11]: So basically anything subjective, you got to get that out of there.
Donné Stevenson [00:29:13]: Yeah, yeah. Anything where especially. Yeah, if it's subjective or if there's an assumption of knowledge, an assumption of context or history, you have to get that out or you have to explicitly put it in.
Paul van der Boor [00:29:26]: I mean, and when we did these onboarding use cases, it was often like, well, imagine you hired a new data analyst and this is day one. They also wouldn't know a lot of these things. So what are the kinds of information you need to provide to them? And, you know, making sure that this stuff is embedded into the system, that it doesn't behave like a day one data analyst, but a hopefully year one data analyst or maybe even more down the road.
Demetrios [00:29:49]: Yeah. Well, now let's talk about how the architecture looks, because it is calling the database right away, like, what is it? What does it actually go into? You said that you have the verification step and then it generates the SQL, or does it call a tool that has a function where the SQL's already generated? And this is a hot topic of debate that I love asking people because it's like, well, do you have thousands of tools or thousands of function calls, or do you let the LLM generate that SQL, which may or may not be the best SQL that's generated?
Donné Stevenson [00:30:26]: Yeah. So we went through a couple of cycles with this, as when we were pocing. Of course, you're working with one kind of understanding of the problem and as you release it, that changes. So the big thing is, where do you generate SQL? Initially we did generate it as a separate tool. So the tool was sort of like generate SQL and it would be given the context generally, like the business rules, and the user questionnaire would come back with the SQL and then that would be sent over to an additional tool which would actually run the execution. But the kind of problem as context and the problems we were working on got harder and harder with that is sharing enough information to the tool that generates the SQL for it to generate SQL that is valid. So at the moment we generate SQL as part of the main agent is responsibility, and that then is passed through as a parameter. But it's one of those things where you are having to weigh up the consequences of that.
Donné Stevenson [00:31:27]: Because also what we then found was if that happens, it generates SQL as part of its main flow, which is used in a function is if it breaks, you have to go back to the main agent flow and kind of start from square one. And often that's not necessary. You talk about an analyst on day one, and I think that's a very good way to think of this analyst as a trainee. And I think even very practiced SQL analysts are going to date functions. You always have to look it up. It doesn't matter how long you've been doing it. If you've had to swap between date functions once in dialects, you're going to confuse them. And that's one of those things that this agent frequently makes mistakes on, is I don't know which date function to use, even though I know what dialect I'm in, and it's a waste.
Donné Stevenson [00:32:12]: Right. So you go to execute SQL function, executes, breaks, and I've got to go back to the main flow to fix what is probably like three characters that are wrong. So we've also, as we've learned, made these tools more sophisticated and allowed the tool that's getting SQL to also update the SQL. So there again, where is the SQL being generated? Well, actually, it's being generated in two places. And that's helpful because it's really like, how much context do I need to do this particular task? And when you're generating the initial query, you need all of it. But when you're fixing a date function or a small error syntax issues, you don't need the full context, you just need the error message.
Demetrios [00:32:53]: It's almost like you checkpoint it and you don't have to go back to square one, you just come back to where there's the problem.
Donné Stevenson [00:32:59]: Yeah, exactly.
Paul van der Boor [00:33:00]: Yeah.
Demetrios [00:33:01]: And I want to hear a bit more about the evolution and the design decisions that you didn't take and why.
Donné Stevenson [00:33:09]: I think one of the big things we get asked about very often is are we using rag and are we using embeddings for the task of selecting relevant schema? So getting relevant schema for the SQL analyst is the biggest step. If you think about going through your table documentation, finding the correct table, and then finding the relevant columns for your question, that's the biggest part you're going to have to do in this whole task. And again, for the agent, that's going to be the biggest thing it has to do correctly. And documentation can get quite big. We've got use cases that have 18 plus tables and we can't feed all of that information to the agent in one go. Especially when we first started building, the context just weren't long enough for us to do that, which is why the tool was built initially. And that process we do kind of in plain text right now. So we store the metadata in a structured way and that gets fed into an LLM call along with the question.
Donné Stevenson [00:34:10]: And we through like a series of steps, filter out the relevant tables and the relevant columns and the relevant information and then share as little as possible back to the agent. There is of course the option for, you know, we could have embedded this information and embedded the user question and kind of done it as that sort of process. Part of the reason we didn't was just the complexity of a solution like that, because we then have to continuously be updating those embeddings because they change quite frequently, especially at the beginning when the analysts are setting it up, because they're making context, testing it, changing context, testing it. And if you're doing this via embeddings, it just gets incredibly high maintenance. We didn't want that to start with. And I think what we found when we've tried other knowledge based projects is when you're using embeddings, it's quite difficult to embed the content in a way that will get your question to retrieve the correct information. And that comes again to this idea of clarity. Like questions are short, they don't have a lot of context.
Donné Stevenson [00:35:09]: And in a table, what's being embedded like the gist of it is quite different to what's in a question. And so it just created such a big problem that didn't need to be solved for us at the time. Maybe as projects get bigger, if we start looking at better ways to structure questions so that they more accurately match what an embedding for the table would look like, that would work, but at least initially it's so much easier also to rationalize over plain text than it is over embeddings when it starts making mistakes especially.
Paul van der Boor [00:35:42]: Yeah.
Demetrios [00:35:42]: And do you don't have to figure out the pains of rag and where are we chunking? How are we chunking this? There's so many things where rag can fall over, especially if you're doing naive rag. And then you're like, do I need to make my rag more advanced here or is that where the problem is? And so you're adding this complexity that it sounds like you didn't need.
Donné Stevenson [00:36:04]: Yeah. And it's just another thing you have to evaluate. Right. So now, besides evaluating, can I retrieve relevant context, is evaluating, is the embedding correct or not correct, but appropriate for the project? And it just, it created so much more complexity than I think is needed for this kind of project.
Demetrios [00:36:22]: Yeah, it starts brawl. Really.
Donné Stevenson [00:36:24]: Yeah. Because you're trying to solve one problem. Right. So what's the best tool for that? Not the most sophisticated tool for that.
Demetrios [00:36:30]: Yeah.
Donné Stevenson [00:36:32]: So accuracy is like a really hard thing with a project like this, or I think just with SQL in general, there's a lot of ways the same query can be written. So you can't really go, okay, well, let's just make sure that they match. Because of certain date factors. You also can't just check the outputs are exactly the same because if you ask it and the test set's not appropriately set up, you can't just use the numbers. So it's very difficult to measure accuracy. And we're very reliant on the analysts who are setting this up to kind of evaluate is it appropriate. So initially when we were building, we tried to push really hard to have fixed test sets with a defined question, the expected SQL query, and the expected output. It's quite difficult to get people to commit to those kind of hard requirements.
Donné Stevenson [00:37:22]: But it was a good place to start. Also, just when we were making changes to check that everything was working. But I think part of what we had to. What we learned when we were doing those kinds of evaluations was again, kind of refining the problem down to as small a space as possible. So when analysts set up the project, they set up all kind of, I think we call them like use cases. So it's like the use case for collecting orders for the last month based on, you know, said parameter. And they'll say, okay, we're 100% sure that the, the agent can answer these types of questions. And so they release that as being a question that people can ask regularly.
Donné Stevenson [00:38:00]: And kind of in that field, it can do that. Use Case and then use cases have like, levels of complexity as well. So the harder use cases, the harder it is to kind of get the agent to reliably answer it. And so we've classed the use cases into like, difficulty, and then we've classed them also into is this a releasable use case for users or is this something that's still under development? And so we also have like, every analyst has two versions of their agent available. They've got what we call the prod and the device, and the dev one is ones where they're constantly adding new tables, they're changing context, and they're running many, many queries to test it. And only when they're happy with context, they release it to prod. And prod is where they're adding the users that they would normally be serving or the other analysts that would need the agent, because that's where they're trusting those use cases to be correct. We don't get a lot of hard numbers from the analysts in terms of like, you know, we've got six use cases and, you know, 80% of them are at 100% accuracy.
Donné Stevenson [00:38:54]: We don't generally talk in metrics. I think, again, if you think about it as an analyst, they use it, they talk to it, and when they're like, I trust the agent understands, they kind of go, okay, you can now be released.
Demetrios [00:39:05]: Well, you're saying something that's pretty cool there is how closely integrated the development is with the analysts.
Donné Stevenson [00:39:12]: Yeah, and I think that's part of why this has worked as well as it has, is because we didn't build a general SQL analyst and say, okay, there's the analyst. Go forth and you know, build context. We worked very closely with the people who were going to use the analyst, and they kind of built it for them, for them. And because they had built it specifically for them, they knew and they trusted it and they were invested in it, and so they pushed it. Yeah, and that's a huge part, I think, of why they, it succeeded the way it did. And yeah, it's, it's a very important process because again, this, they have that knowledge. They have the domain information that we don't have, the agent can't infer, and they're the only ones who know how to share it in a way that, you know, is, is clear. And where we helped was sort of this.
Donné Stevenson [00:40:03]: How do we get rid of the ambiguity in your descriptions? You know, where do we deal with the fact that there's assumptions here and where do we put those Assumptions in explicit text, you know, so the agent has access to it. But we did have to work very closely with them and now it's a bit more mature. And because the teams we work with share information with each other like ifood, there's loads of data teams they share with each other, they can kind of run on their own. But that first step, the analysts had to be very, very involved in how they build the context.
Paul van der Boor [00:40:34]: I think this approach is to highlight why this is to your point, why this is so successful. We've seen and of course thought ourselves about other things that we've seen fail, right? So I think two common ones that others have tried that haven't sort of seen this uptake. One is the kind of really boiled down version of we've got a workflow that can generate a SQL query on our data for our data, right? And so it's basically an LLM, a prompt which has some access to the business context and its output is given a, you know, human readable query, a SQL query. But that didn't solve really like a lot of teams have built these in house solutions that didn't really solve the problem of where these thousands of other people who can't verify whether that answer is correct to run it. And so plus an actual data analyst doesn't need help with those simple queries, they need help with the complex joins and the ones they actually spend a lot of time on. So that's one avenue where sort of these homegrown solutions build your own, you know, SQL query writer actually don't really solve the big problem. The other one is there are solutions out there that claim they can entirely substitute a data analyst, right? They can basically say, well listen, we are just a completely no code talk to my data solution and there's no data analyst involved. And our solution, the token data analyst, is sort of a mix in between the two because you're interacting with a token data analyst in a Slack channel, public Slack channel, or at least a shared Slack channel with data analysts.
Paul van der Boor [00:42:04]: And that only happens after the data analyst has curated, honed and developed, co developed this system to actually give sufficiently reliable answers and the data analyst really becomes the kind of orchestrator supervisor of this actual agent. And I think that's probably why we're at a point now where there's real usage. Because these other two where you don't really solve for the actual how this thing gets used in production, because we're talking about agent in production. If you don't think about the users and expectations and who Else needs to contribute to this thing to be reliable, then there's less likelihood for it to be really adopted.
Demetrios [00:42:41]: What about the complexity? Have you seen there's a red line where it just tops out on questions that are super complex and the agents can't get to it?
Donné Stevenson [00:42:51]: Yeah, we've definitely seen and I mean this is something we're still talking to some teams about because the complexity kind of has generated this.
Paul van der Boor [00:43:00]: There's a ceiling.
Donné Stevenson [00:43:01]: Yeah, there is a ceiling. So maybe I give it too much character, but I kind of think of the agent kind of always trying to solve it in the simplest way possible. And so it's quite difficult to convince it to sort of do very complex queries where it's doing very complicated joins. There's like CTEs and multiple step queries.
Paul van der Boor [00:43:23]: Yeah, no, it's basically it always goes to the simplest solution which is if you go to the complex things, there's sort of a natural ceiling where I think users still a substantial constituent of our users constituency of our users, they actually think like it's still not good enough because they want to ask it much more thing many more complex things and it defaults to the simplest route which is, you know, if you look at the pyramid, it's a healthy share of the questions coming in, but there's still a lot of other more complex questions that it could and probably soon will be able to answer, but doesn't yet. And users I think find that disappointing or frustrating sometimes.
Donné Stevenson [00:44:04]: But I think that's why we also position this as it's not going to replace a data analyst. A data analyst is going to be able to do these very complex queries and it's also not going to be necessarily for someone who knows nothing about the data and has zero kind of insights into sort of what the information they need, like what the technical requirements are. It's really that middle ground where it's people who need some assistance with the data, they have a little bit of understanding of what they're looking for and it can do that. Right. And where the value is is that that analyst who can do the complicated queries doesn't have to do it anymore. So there's this offloading or like kind of first line where we can say, okay, try the analyst and you can get your answer from there. And then yeah, the analysts who kind of would normally be doing that, they can focus on those complicated questions and also their actual work. Right.
Donné Stevenson [00:44:57]: So the insights they're trying to generate for like, you know, for business users, for like campaign, like campaigns or like conferences and that sort of thing. Whereas you know, like the sort of in between users, they can use this to catch whatever insights they want and that means they can ask many more questions and they can make more data driven decisions than they could before. Just by virtue of this thing always being online.
Paul van der Boor [00:45:18]: Yeah. There might be a couple of other things that are I think interesting on the technical front that we learned at the beginning. For example, because it's got a validation step and it can sort of self correct and sometimes it can go back and fetch other columns at the beginning. One of the simple things we had to solve for is it gets stuck in loops. Right. So that's one thing like how do you get this thing to at some point just say I'm done. Right. Like I don't know or I do know.
Paul van der Boor [00:45:44]: Another thing was how do we make sure that we have that integration set up that's agent compatible? Because you're actually running queries on a database like you don't want to do select star. Right. And because these are like I said, ifood a billion orders a year. Right. So if you just select start on the order database like you should have some kind of guardrails on what it should and shouldn't. What query shouldn't be.
Demetrios [00:46:07]: Yeah, the agent could get really expensive.
Paul van der Boor [00:46:09]: Really fast or drop table. Obviously there's no write access or that kind of stuff should never be able to. So those things, you know, we had to solve, right. Then you had to make sure that some of the queries because when people started using a lot would basically require that much memory that they would shut down the database or overload and nothing would work anymore. We had to deal with the fact that at some point people wanted to visualize the data, right? So okay, well we had a code execution tool so we connected that. But it needed to be in certain standards. So how do you make sure that the charts that are produced are correct? We had issues on. Again I'm sharing problems here and talking about real hard learned lessons and real problems.
Paul van der Boor [00:46:54]: Other things that we struggle with, well, what data should it actually return to the user? Sometimes users wanted a, a set of orders. Give me a sample of orders that this they want to see as we file back. So how do you sort of extract what is allowable? How do you return that in Slack or in other environments to the user? These are other things. We went through generations of models, right? So we've gone through various generations of models that were doing parts of the flow and as we were upgrading frequently on a new generation new version, the whole thing would unexplainably start to behave very differently. Yeah, it will stop working, but also sometimes it'd be very different. And so you just convinced a court of analysts to be able to phrase their questions or the tool would look for the column names in a certain way and then that wouldn't work.
Demetrios [00:47:51]: Are you running prompt evals to solve that as you were going to upgrade? Because you have all these questions, I imagine from the analysts. Did you run them through? How did you like know if. Or did you just plug in the new model? And then it was like it does.
Paul van der Boor [00:48:07]: A little bit of both.
Donné Stevenson [00:48:09]: Yeah, well, yeah, it is a little bit of both. So when we did the initial transfer to try and move from the existing model to one of the newer ones, we had some test sets from the original use cases and we just saw massive changes in behavior and just the prompt had been truly optimized for the model that it had been built on and the context that analysts had been building and changing and updating for the agent that had all been optimized for this prompt. And that was one of the first times I had seen that you can. We don't make machine learning models anymore, but you can still over engine overfit the prompt. Yeah, you can overfit the prompt. That is the new risk of when you're working with these kinds of things. You can overfit the prompt. So yeah, when we moved models, we ran some of the tests we had and we saw huge changes in behavior.
Donné Stevenson [00:49:02]: And so we sort of had to roll back to an old model. And it's kind of one of those problems. We're still trying to figure out how to make these things sort of more agnostic to the model that they're being given. I think as models get smarter, maybe that'll be a little bit easier. But at the moment, yeah, it's just the model is the model.
Paul van der Boor [00:49:21]: I think this is also a space where we're recording this early 20 when we're just seeing this next wave of reasoning models come in. It feels like that's a big opportunity, especially on these kinds of fairly mechanical tasks that are very rational. We're not. Write me a creative marketing piece on. No, no. We're actually asking it to go very logically through a step of actions, pulling certain sets of data to get to an answer. And I think these reasoning models actually will offer for this particular agent a ton of potential to do that. Well, because of this rationale intermediate rationale generation, the inference compute that you can sort of Scale and I think we'll soon be able to fine tune on these kinds of tasks.
Paul van der Boor [00:50:12]: We'll probably see a leap in performance and probably a lifting of the ceiling of what these models can do to actually be able to answer questions using data sets, running queries and so on. That's my prediction on the agent, that's the next one.
Demetrios [00:50:29]: So where's this all going? Where do you see this heading in the future?
Donné Stevenson [00:50:35]: Yeah, I think that's one of the things I'm excited to see is especially if we're thinking about this pre validation step because the reasoning will help it to understand, to reason, do I have enough information? And also, and this is one of the things I'm excited is like can we do this, use this pre step to generate the question in a more structured way, which would again remove the assumptions, remove the gaps in knowledge, remove the ambiguities and that would then help down the line as well because you would only need to have that version of the question, not the original one.
Paul van der Boor [00:51:05]: Yeah.
Donné Stevenson [00:51:05]: So the reasoning one is, I think that's going to be where we see some, finally see some big jumps.
Demetrios [00:51:09]: And I wonder because it's funny how when you put the context, you don't know if you've given it enough context until you run it and you see that didn't work. Yeah, but if you could just give it to the reasoning model and say like is this enough context for you? That would be easier than running the whole thing and seeing if it fails and passes.
Donné Stevenson [00:51:31]: And we could also see to some degree what it's, what it knows, what it's assuming. Yeah, and so we can diagnose more easily where the disconnect is between what it's being given and what it's expected to do. Yeah, because right now we're really like reliant on like going through it and kind of guessing or you know, they're educated guesses after this long. But it's still kind of a guess of. Why has the 70 billion parameter model done this? It's a bit of a.
Paul van der Boor [00:51:58]: And we see the ability for us to like take these. Like that's what DeepSeq R1 is doing. It basically distills its reasoning into like a smaller QUIN or llama model. I think that's a perfect application of here for SQL generation. You can look at query and then SQL answer pairs that we've generated in the past, stitch rationales together, figure out why did you do that? Obviously that's the labeled or training data set you need to then use to fine tune this model and then it should be able to apply that reasoning about the data sets and the kinds of questions coming to do the downstream future task much better. So I think that's certainly something that we'll be looking into.
Demetrios [00:52:40]: As long as the data schema doesn't change and everything stays perfect, right?
Paul van der Boor [00:52:45]: Sure. But it should get also a lot more business context. And so maybe the data schema will change, but it will be able to do it much better than the current version of these data analysts, which definitely when the schema changes now it has.
Donné Stevenson [00:53:00]: Makes a false assumption and they can act as assistants. Right. So one of the things we want to work on but haven't quite managed to do in a scalable way yet is let the, you know, do some kind of evaluation of the schema as it exists and say, okay, where's point out the ambiguities, point out where there's not an information check, where there's terminology that's not clear. And if you could give that to a reasoning agent, again with the context of like question answer pairs, it could, you know, suggest okay, you need to change this, which for the onboarding process also would ease the process significantly for the analysts.
Demetrios [00:53:44]: The process AI team is hiring and you can find all the links to everything you need to know in the show notes below.