MLOps Community
+00:00 GMT
Sign in or Join the community to continue

AI Data Engineers - Data Engineering After AI

Posted Apr 25, 2025 | Views 9
# Agentic Approaches
# Data Engineering
# Ardent AI
Share

speakers

avatar
Vikram Chennai
Founder/CEO @ Ardent AI

Second Time Founder. 5 years building Deep learning models. Currently AI Data Engineers

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

A discussion of Agentic approaches to Data Engineering. Exploring the benefits and pitfalls of AI solutions and how to design product-grade AI agents, especially in data.

+ Read More

TRANSCRIPT

Vikram Chennai [00:00:00]: My name is Vikram. I'm building Ardent AI. I'm the founder and CEO, and I like drinking lattes.

Demetrios [00:00:10]: Welcome back to the MLOps Community Podcast. I'm your host, Demetrios, and today my man Vikram is doing some wild shit around bringing LLMs to the data engineering pipeline. Use case. I love this because it is merging these two worlds and hopefully saving some time for those data engineers that, you know, are constantly under the gun. Let's talk with him about how he's doing it and what challenges he has had building his product. And as always, if you enjoy this episode, share it with a friend. Or if you really want to do something helpful, go ahead and leave us a review. Give us some stars on Spotify.

Demetrios [00:00:51]: It doesn't even have to be five of them, you know, I'm okay with four, but you can also comment now on Spotify itself, which will trigger the algorithm. Give us more engagement. Let's jump into this conversation. Where you at, by the way? Are you in San Francisco right now?

Vikram Chennai [00:01:15]: In San Francisco, you've been doing meetups? Not too much. I actually moved here like six months ago. And so I started working out of some co working spaces and then sort of got like networked into the community. So there's not as many meetups anymore. I did a lot when I got here. I went to a lot of hackathons. Great place to meet people. But after that, once you sort of got, you know, your group of people, then it sort of naturally expands who you meet.

Vikram Chennai [00:01:44]: And then it's not as much go out, meet people yourself. It's more just like they just enter orbit.

Demetrios [00:01:51]: Enter orbit. I like it. The other thing that I was thinking about, who are the folks that you talk to as like, users of your product?

Vikram Chennai [00:02:05]: So for us, data engineers are the primary users, which might seem a little counterintuitive, especially since we're building AI data engineers. But what we found is data engineers are really the people that understand how to get things done in their stack, so they know which pipelining tool to use, the little things that can go right or wrong. And so when they're in control of a tool that can do data engineering for them, then you just get way more done way faster. So they've been our best users so far, and we've seen them just do incredible things.

Demetrios [00:02:47]: So I want to look under the hood at how you are creating what you're creating, but you kind of glossed over something right there, and we need to dig into it. What is an AI data engineer?

Vikram Chennai [00:03:03]: Yeah, so it's an AI agent that's connected to your stack that can perform data engineering tasks like building pipelines or doing schema migrations with a sentence. So it's very similar to something like Devin, except we're verticalized and focused on data engineering only and really integrating deeply with that stack. So an example of that would be you have airflow set up with GitHub, and that's how you control the code for your data pipelines. And you want to build a new data pipeline. Normally you'd have to go clone that repo, you know, find wherever you're storing your. Your dags, your data pipelines, and then go write all the code and then check the databases and then figure out what to write. And, you know, the structure of everything, the schemas and all that mapping. Or if you're calling from an API endpoint, you have to understand what that endpoint will drop.

Vikram Chennai [00:03:56]: So you have to do all this work and then you can push a pipeline up and then test if it works for us. Our agent has a terminal, so it has a computer attached, and it can do all of that for you. So you just tell it, hey, I want a pipeline that does this. It'll go look at the web, look at that API endpoint, figure out, oh, how does this thing output? And then build the entire pipeline for you, push it to your GitHub repo as a PR, and so you don't have to do any of the work, you just say, hey, go do this thing. It'll do it for you.

Demetrios [00:04:27]: And this is primarily with data pipelines?

Vikram Chennai [00:04:31]: Yes.

Demetrios [00:04:33]: Are you also doing stuff within databases? Like, the fun of trying to query the databases, or is it just the. Maybe you're transforming or you're doing, like, how complex. Have you seen this get?

Vikram Chennai [00:04:48]: Yeah, so we have the context into databases. So when we have users set up, they set up with whatever pipelining service. So if it's Airflow, if it's Dagster or Prevector or whatever the hell they're using, but then you also have context over all of their databases, too. So you can connect Postgres or connect Snowflake or Databricks, SQL or MongoDB or whatever you use. And so our agent actually has context over what that looks like, and it allows you. We're focusing specifically on pipelines, but when it writes that code, then it'll know the schema. So if there's a table you want to drop things into, it's not trying to guess or make up a new table. It knows you already have these three tables, it'll make the most sense to drop the new information into Table 1.

Vikram Chennai [00:05:37]: And here's how we're going to transform it to do that.

Demetrios [00:05:41]: One thing that I've heard a lot of folks talk about is this type of coding with AI, or dare I say Vibe coding, is very useful when things are quite simple, but then it kind of falls flat once there is a lot of complexity. And when you're at the enterprise level and you're looking at a stack that goes back 20 years, it gets very, very hard to have AI be able to really help out in the coding and code generation side of the house. Have you seen something similar where if you're trying to do one, two, three step pipelines, it's really good at that. But as soon as you start bringing in this very messy data from all over the place, and it's being transformed 10 times before, or you're asking it to be transformed 10 times so that you can get the exact data you're looking for, it falls flat.

Vikram Chennai [00:06:54]: So we've seen a little bit of that for sure. But I think the two main ways we've tried to combat it, and I think that it's kind of a general principle too. One is managing context. So for something like we were talking to a few companies, some enterprise companies, and they had like 15,000 pipelines. Now obviously you can't feed that all into a context window. There's no way, one, the context window doesn't scale to that, and two, even if it did, it would get incredibly confused on what you're asking for. So there's a portion of generating really, really, really good context which pretty much is trying to simplify the problem down for the agent. Yeah, so that's a huge portion of what you're trying to do.

Vikram Chennai [00:07:37]: And I think the other part is specified training. So for more generalized coding agents, I think one of the things they struggle with is they're not designed for specific task flows. They're trying to do everything at once. And so because of how language models work, they're probability engines. So if your probability distribution is literally everything, you're gonna struggle to do the specific stuff. But if you design products around specific verticals, specific context and specific outcomes, then you'll get a much, much higher quality result. But I think that the problem of like Vibe coding and getting errors at that scale is something that actually won't ever disappear. But I think that's where you have to build specialized infrastructure to make sure that you can solve those problems.

Vikram Chennai [00:08:26]: So one thing that we're exploring right now in our next build is creating staging environments for everything that they connect by default and trying to make them really lightweight and ephemeral. So let's say you make a change in your database instead of it going directly to whatever database you have connected, assuming that maybe you don't have a staging environment, or if you do have a staging environment, it'll just say, here's the change that we're going to make. This is what it looks like. But that also allows the agent to go reflect on that and say, hey, this is how the system would change. Is this correct? And so even if it screws it up on time one or time two or time three or whatever, even 10 times, it's not getting committed to your actual code base or to your database or whatever, and it has the ability to reflect on that and so it can then correct itself. And the accuracy over a bunch of different trials goes up a lot.

Demetrios [00:09:21]: Wow. Okay. Now are you also creating some kind of an evaluation feedback loop so that you understand what is successful and what is not?

Vikram Chennai [00:09:36]: So we've actually created a benchmark because there really doesn't exist a good one for data engineering.

Demetrios [00:09:42]: Well, I was going to even say it may have to be on a case by case basis. Or have you found it can be a bit more generalized with just data pipeline evaluation benchmark?

Vikram Chennai [00:09:54]: Yeah, it, I think it actually can be a lot more generalized, but I think that's specifically because of how data engineering works. It's not exactly general. It's because the tools that people use are pretty standard. Like they're using Airflow or Prefect or Dagster or Databricks or whatever. You can actually name every single one of the popular tools. And because of how important data is to companies, it's very, very unlikely that people are going to use some kind of unknown tool that no one's really battle tested. Because if you have an error with your database and for some reason they've, you know, coded it really, really badly and it deletes everything, you're screwed. Like, you're never going to take that risk.

Vikram Chennai [00:10:35]: So, so people always, you know, tend to go with the standards and that creates a nice framework for us to be able to test in. Or we can say as long as it can operate, you know, these tools really well, and we'll give it, you know, hundreds and hundreds of tasks to optimize over, and it's getting better at that, then we're pretty confident that the majority of people will be served well with it.

Demetrios [00:10:58]: Okay. And you mentioned in the staging environment that allows the agent to almost game out what would happen if we made what the first agent is recommending. It's. It's almost like seeing into the future in a way, and then reflecting upon if that is the right move or not. Can you talk a little bit more about what type of validation checks you have and is it all through agents or is there also some determinism involved in that too?

Vikram Chennai [00:11:32]: So it's mostly through the agent itself, but we can also pull in like data quality checks and other tests that you have running. So our goal is to replicate your actual environment as much as possible. So it writes a pipeline, then it runs whatever tests, if you have them, will bring them in. The agent itself will be able to see into error logs and all that kind of stuff to understand, hey, did I write code with bad syntax or hallucinated something? Or does the output look the way that's expected? And it takes all of that and then it can just sort of loop until it gets things right.

Demetrios [00:12:06]: That's really cool. A lot of times folks talk about how hard it is to get right when they are talking with agents. The ability for the agent to ask questions if it does not have enough context. How have you gone about fixing that problem? Because I could say like, yeah, set me up in airflow pipeline and use my database. And the agent might go and just set up some random airflow pipeline with some random database and say, here's what you asked for and. Or it could hallucinate something when it doesn't have the right context and the right ideas. Right. So what are you doing to make sure that right off the bat the agent doesn't go start working unless it has the right amount of information?

Vikram Chennai [00:12:56]: Yeah. So one thing we actually added in our task flow and actually specifically train on, is if tasks are possible. So one of our test cases, like a very simple one, was we're going to not feed it in postgres credentials and we're just going to tell it, go build something in postgres, like add this table and see what it does. And at the beginning it like tried to do it exactly like you said. Um, and we, you know, essentially trained that behavior out. So we added a check step in sort of the planning phase of it, where it'll try to determine if the task is possible or impossible. And so we just tried to train that out where it will tell you that's not possible. You should probably not do that.

Demetrios [00:13:39]: And that's another LLM call.

Vikram Chennai [00:13:41]: It's part of the original one. So the flow that we have set up is you make a request, the agent will gather context, search the web, all this stuff, and then it'll give you a plan of like, here's what I'm going to do, are you okay with this? And you can go back and forth and just make edits. And so if it's made a mistake, that's your opportunity to look in detail about everything it's going to do and say like that step is wrong or that step is wrong or actually I wanted this table or something like that and then let it go off on its own. But in that phase there's a check there that tries to say is it impossible or not? And if it is, then, well, it'll say please revise that.

Demetrios [00:14:23]: Yeah, how do you think about really big tasks versus trying to break up the tasks and make them smaller? And when you're designing your agents trying to ideally have smaller tasks so that you have more dependability.

Vikram Chennai [00:14:42]: Uh, I think breaking it down is definitely the way to go. Um, I just don't think you can get enough from a larger task and in us breaking it down in the planning step, so it won't just say, hey, I'm going to build you a pipeline. It'll say here are the steps that I'm going to execute. Is this right? It allows a lot more fine grained control over what you're going to do. And yeah, you get a lot more accuracy out of it. But it also helps on the user side because they can see exactly what little steps are going on and so they also have control. And so especially if our users are data engineers, they very much understand what needs to be done and so they can really make the edits like hey, step three is wrong, like, or it's a little bit off or you inferred the wrong table. Let's correct that.

Vikram Chennai [00:15:29]: And so it doesn't put all of the, the onus on the agent, at least not yet. And there's always going to be a little bit of a mismatch. Right. Because there's no way you can perfectly replicate everything into your brain and just dump that into the, into the context.

Demetrios [00:15:45]: Are you using the GitHub repos of the organizations as context also to help when creating the different pipelines?

Vikram Chennai [00:15:57]: Yeah, so the primary way people tend to store pipelines is through some sort of version control software and that's usually GitHub. And so yeah, the agent will connect in sort of scan and understand that repo look for specific folders that are Important, such as if you have a DAGS folder, which is airflow's name for data pipelines, and all the, all of the, the files in there, so it'll know that's where I write my pipeline code or that's where I, you know, extract from.

Demetrios [00:16:27]: But you're not specifically trying to vectorize all of this stuff that's in there. You're. You're just taking it at face value. Like, you don't need to over complicate things by saying, well, let's vectorize everything that we have, let's throw it into a vector db, and then hopefully it can give us better retrieval and also it will know better semantically, like what is being asked. None of that actually matters.

Vikram Chennai [00:17:01]: Are you talking about for the code?

Demetrios [00:17:03]: Yeah. Or any of the stuff that you're giving it as context.

Vikram Chennai [00:17:08]: So we do actually do a bit of rag and retrieval, and the main place we do that is pretty much on all the context layer. So if you ask about Postgres and MongoDB, what we're going to do is try to rip out as much unnecessary context as possible. So it's actually less of a problem of giving it the right context. What you want to do, at least for data engineering agents, or I think coding agents in general, is give it as much as possible because we have this, this policy of like achievable outcomes. So if you don't give something like database credentials, I don't give you my postgres credentials, there's no way you're going to ever get into that database and understand what's there in any conceivable universe.

Demetrios [00:17:50]: Yeah.

Vikram Chennai [00:17:51]: And so what we try to do is give the agent just enough, but remove all the unnecessary stuff. So, for example, if it does a retrieval search and says, okay, we don't need databricks, we don't need any of this other stuff, but we give it just enough so that if it did need databricks, then it can start pulling that context or make that part of the planning steps of, hey, I need to pull this thing, I need to pull that thing, then it can still do it. So we do add sort of a search layer on top, but it's more to get rid of unnecessary information like you think 15,000 pipelines. Yeah, there's no way. Right. So you do need to index all of that and understand for a user query what to pull it and whatnot too.

Demetrios [00:18:30]: So if I'm understanding this correctly, it's more about how do I tell the agent where to look if a certain Question is asked. I have that information there in the context and it knows, oh cool, airflow. I go and look in this area and then it gives me more information about how to deal with airflow.

Vikram Chennai [00:18:54]: Yeah, exactly. So it's a mix of. It'll pull stuff directly. So for the pipelines and the code, we do index that so that it can pull in, okay, Pipeline 3 and it looks like this. And here's all the code and go edit that code. And here's how it's stored in GitHub. Go do that. But let's say it wanted information that we've decided from that vector search is not relevant.

Vikram Chennai [00:19:17]: We still have enough information so then it can go and find its way to it.

Demetrios [00:19:21]: I see. So that in case it comes up later on down the line.

Vikram Chennai [00:19:25]: Yep.

Demetrios [00:19:26]: Then it, it's not just like what is data breaks.

Vikram Chennai [00:19:29]: Yes, exactly.

Demetrios [00:19:32]: Then you're in a bit of trouble. And that scenario. Oh, that makes a ton of sense. Now one thing that we talked about, I think probably two months ago with a team that was creating like a data analyst agent, they made sure that the data a analyst agents were connected directly to different databases. And if the agent was for marketing, it was only given scope for the specific marketing databases. And all of that analysis that takes place in like almost this walled garden. Have you seen that to work or is it, is it something different?

Vikram Chennai [00:20:22]: I think for certain companies that make sense, especially for larger enterprises, I think there's a good chance a lot of that comes from security. If I had it my way and I had access to everything I wanted, I would just be like, okay, give me as much context as possible. Index literally everything you own. And you know, I designed some sort of custom embeddings on top of that, which is some of what we do to make sure it's really accurate on everything. Um, so I think there's like a trade off there where I think more context is generally better actually in this case. But you again, you want to keep it pretty thin, right? Like you don't want everything about everything, you want just enough about everything. Especially if you have sort of these cross context flows. Now with pipelines it's something like sort of the call out structure where you have a pipeline that'll trigger a data processing service to go do all the heavy lifting.

Vikram Chennai [00:21:24]: So you're not processing millions or billions of rows of data in your airflow instance which will blow up like you're putting it off the databricks or someone else. So you kind of want that context. Right. So it makes A lot of sense, especially at the enterprise scale, to have it sort of guarded off like that. And it may improve the agent too. But I tend to go for, bring in lots of context and just keep that understanding so that when you have those flows that sort of peek out, you know what, what's happening along the.

Demetrios [00:21:57]: Lines of processing data in the wrong place or with the wrong tool that's going to make something blow up or be really expensive after the fact. Do you have alerts set up or do you have some kind of way to estimate? As you said, we have that staging environment. We can estimate if this is going to work, which is one vector. But then I imagine another vector is how much will this cost?

Vikram Chennai [00:22:21]: Yeah, so what we've actually found, especially for existing customers, is they actually kind of know what their cost will be because a lot of times the work is not, I don't know what I'm doing, tell me how to do it better. It's, I know what I need to do, but I Wish There was 10 of me. Like, I just don't want to take so long to.

Demetrios [00:22:43]: Yeah, yeah. It's like cumbersome. The work is very cumbersome.

Vikram Chennai [00:22:47]: Exactly. And so they offboard that work to our agent. Like, hey, this pipeline is taking 10 minutes and it needs to be a minute. Now I know how to do that probably. I really don't want to do that. So can this thing just auto scale the pipeline out? Okay, cool. It will do that and they can give it specifics on how to do that. So we haven't seen that as much, but yeah, you will be able to pull that stuff from the staging environment.

Vikram Chennai [00:23:10]: And somewhere we're looking to go more down the line is, you know, being able to auto optimize sort of at that level of like, hey, we can save you 20% on your bill if you just changed everything like this. And our agent knows how your costs are set up and all this stuff.

Demetrios [00:23:28]: The other thing that I was thinking about on this is do you primarily interact with the agents through Slack or is it via web gui? What is it?

Vikram Chennai [00:23:40]: So it's mainly through a web app and they can, yeah, they can just push whatever changes to the agent. So it's just a chat interface and then we've got sort of the, the terminal style, so you can see what it's doing as it's doing it. And then you have a bunch of options on do you want to make it more copilot or full agent, that kind of thing. So you've seen most of the Interaction go through there. But we also built out an SDK and API so you can actually build the agent reactively into your flows. So for example, a really good example is like at 3am you have a data pipeline fail and you don't want to get up for that because that's ridiculous. And so instead of having to wake up and then figure out what's going wrong and you know, write some more code, or if it's really simple, just restart the pipeline. Like, why are you doing that at 3am? I want to sleep.

Vikram Chennai [00:24:27]: You can put in code directly into your like error handler or something and it will auto trigger the agent to run and you can put in whatever text you want. So it's the same thing you would do in the web ui, but now you can do it in your code and that allows those flows to be, hey, there's an error at 3:00am, the agent has gone while you're still trying to get up and shake off the sleep. Like it's going and doing all the work. And it said, hey, I found the fix. Here it is like, here's a PR to change, you know, whatever bug has gone on, or you just need to restart the pipeline. I'm ready to do that. Would you like to do that? Just a yes or no.

Demetrios [00:25:06]: So it's not actually having full autonomy like you were saying before, you're trying to get to that point, but at this moment in time it still gets that human intervention or the green light from some kind of a user.

Vikram Chennai [00:25:21]: Yeah, I honestly don't know if we'll ever get to full autonomy. And that's mainly because, like, even if you placed it in the role of, okay, we're hiring a junior dev, would you really want them to push to prod? Just like it's fine, push to prod, just go. There are some situations where, fine, yeah, you probably would want that. Right. But you know, maybe you might have some breaking changes that are a little bigger than the initial one and you don't really want those pushed. So I think allowing people to retain control is pretty important, especially with agents, where they are, where they're. I think we're at the point where they're useful, but they're far, far from perfect. And so it's not really a good idea in my opinion, to just say, do everything for me and I'm just going to close my eyes and let you do everything.

Demetrios [00:26:10]: Yeah. The idea of cognitive load with the agents and if at the end of the day they help you take off certain amounts of cognitive load, Then that's awesome. The better question I think would probably be around the idea of cognitive load and how you are looking at not adding extra burden to that end user. Because a lot of times we've probably both felt it, we've had interactions with AI that are not great and we come away from it thinking, damn, that would have been faster if I just did it myself.

Vikram Chennai [00:26:54]: Yeah, I think the best thing we can do in that scenario is just make the product better, train it better, get it more accurate over scenarios. I think one of the other benefits of building sort of workflow oriented products is you get a lot of feedback from your customers. So when something screws up for them, you get a trace of like, here are the steps and here's everything that blew up along the way. And so as more people use it, the better you can make it for everyone. Because now you just have, you know, it goes from thousands to hundreds of thousands to millions of flows coming in of, okay, this happened and here's the evaluation. Yes or no, yes or no. And you can use that to just make it great. So I think the best thing to do honestly is just make it better and then you know, building those sort of staging bits and the context bits that allows the agent to actually do good work.

Vikram Chennai [00:27:46]: Because I think that's the only real solution to it, right? Like you can band aid fix it and say, well, it won't affect anything real, but the real thing is you want it to do good work. That's why you're buying it, that's why you're doing anything with any of these tools. And so I think that's like the only real solution.

Demetrios [00:28:04]: It is nice that you play in the field of the suggestions or the actions that the agent takes. Has very clear evaluation metrics. It's, it works, it runs like data is flowing or it's not. And on other agentic tasks, I think what I've heard from some really awesome people is the closer you can get to like runs does not run, the better it's going to be for an agent because you can clearly decide if it was a success or if it wasn't.

Vikram Chennai [00:28:43]: I totally agree with that. And that's actually part of how we train in sort of the benchmark training set, whatever you want to call it. So it's like an eval and train set there. We actually do have deterministic outcomes. So we have everything from super simple stuff like just testing if the agent even understands the database. Right? Like put something in the database, add a name and an age As a record into, you know, our postgres database. Super easy go do that or make this table. And then you validate.

Vikram Chennai [00:29:16]: Did those things actually exist? Are they there? And you add layers on top of that of like, how long did it take? And you pull all of that back. And so when you're optimizing, you're not just optimizing on the simplest, like, you know, maybe an LLM to evaluate the code. Did this look right? Are there errors? It's like we actually saw the data end up where it needs to go. We saw how long it took, we saw how fast the actual query ran. Like, this is too slow, this is too fast. And especially when you're training, that's really useful feedback, right?

Demetrios [00:29:48]: Yeah. We had my buddy Willem on here, probably, I don't know when, a couple months ago, and he was talking about how for them they're doing something very similar, but for SREs and root causing problems and triaging things. And they set up a knowledge graph with Slack messages that are happening. They also set it up with code and they set it up with, I think Jira so that you could have a bit more of a picture when something goes wrong for that product. It makes a lot of sense. I wonder if you have thought about trying to do something similar or if you're thinking that the way and the ability for you to get the context that needs to be in that agent's context window is enough right now with how you're doing it.

Vikram Chennai [00:30:49]: I think right now where we focused is all the, like, this is what it needs to work. So all of the coding pieces, you know, pulling in GitHub, pulling in database context, that's what makes it work. I think that next level of and here's all your Slack and Jira and this. And that is just. It gets better because essentially what people are doing right now is they're taking that context that's out there already, putting it into their head and then putting it down into the prompt. And so people are kind of transferring it themselves, but it would probably be something that we're looking at. So we had a few customers ask for a doc attach where they could just put in like a documentation because they'd apparently been storing all their context and all their practices in one giant document. And we heard multiple people do that where they're like, this is our master document on how to write everything.

Vikram Chennai [00:31:45]: I think they were using it for like a combination of training, new hires, but also just to document like, this is how we do things at this company, especially for data engineering. And so they wanted to be able to attach that as like a permanent fixture of like this is how you're supposed to do everything. So we've gone down the route a little bit, but I definitely think it's the right approach of trying to pull in even more. Again you get into like the context balance issue. But I think if done right and done, you know, the, the gold standard there is like if it's done perfectly, then you do gain a lot more than I think you'll ever lose.

Demetrios [00:32:21]: Speaking of that doc attach, I've. It's not the first time I've heard it and it's in a little bit different of a scenario. But the whole idea is how can we create some kind of a glossary for the agent to understand us when we talk about terms that are native to our company or when we want things done that the way that our company does them. Because maybe it's different in every company or maybe what we mean when we say I need to go get MQLs from the database. What is an MQL? It's not necessarily labeled as an MQL in the column so or so it's not like it is a clear cut and dry thing and the agent needs to know either how to create that SQL statement to figure out what an MQL is or they need to understand what that actually means and which column that relates to.

Vikram Chennai [00:33:21]: Mm, no, absolutely agree with that. Actually, I think one of our core principles is that we're not going to make you migrate anything. And so that doesn't, that's not just about like databases or you know, sort of that code level stuff. It's like we want to work the way you work. We don't want to tell you how to do things. And I think there have been products, especially in the past, before the whole AI AI wave is they've kind of tried to make decisions for you because there was a real trade off. Yeah, right. And they were like, here's a better way of doing things, you'll save a ton of time.

Vikram Chennai [00:33:56]: But now we're in this place where, you know, maybe the switching might be a little bit easier or you can sort of have cross context tools. But still migrations, changing the way you do things is like the worst thing you can ask someone to do. In my opinion. If you say, hey, switch your database, they're going to look at you like, are you crazy? Like for, for your AI is great, but not, we're not switching our database for that. Like, please leave. So we very much found that, yeah, working the way people do is like, that's, that's how you do it, that's how you do it. Right. And you're essentially allowing them, you know, even if you try to improve it a little bit, you're saying while you're doing this, and you could do it better in maybe this slight way.

Vikram Chennai [00:34:39]: But it's not like we're going to come in and tell you what to do. And actually you have to, you know, rip out your pipeline service and use our custom pipeline tool that we've built with AI in it. Like, no, just use your stuff. This thing will drop in, they'll solve your problems.

Demetrios [00:34:53]: And along those lines, do you have something that is like common asks or common patterns, common requests, requests of the agents that you've codified and you figured out, okay, this pipeline is being requested like twice a day or twice an hour. And so maybe we can just make that a one button click instead of having to have someone prompted every time.

Vikram Chennai [00:35:24]: We haven't seen as much of that directly. I actually think that's a little bit different than the philosophy we're going after because usually for pipelines, first they're set up once and then they're managed. So it's rarely like recreate this over and over and more of we've created it and now we have to make sure it doesn't break and make sure everything else doesn't break at the same time. So a lot of that and then I think a lot of our approach is also the thing that LLMs are great at is being non deterministic and solving a wide array of problems. And so even if we see a pattern in there that might be easier as like a one click, just do this. I don't think that's actually a good practice to add to the agent because you're trying to direct it yourself versus train on data. So what should be happening is if you see that happen a lot and you keep adding that as training feedback, it should get really good at it. So, you know, they might have to prompt and ask for it though usually they're not going to ask to recreate a pipeline six times.

Vikram Chennai [00:36:29]: Um, but it'll just get good at those kinds of tasks, right? Or if you want, you can just use the API and if you really want to recreate the pipeline six times in a row, you have an API, you have an SDK, just write a for loop and it'll do it six times in a row.

Demetrios [00:36:44]: I'm thinking about the uber prompt engineering Toolkit because we just had a talk on it last week for the AI in Production conference that we did and they were talking about how they will surface good prompts or quote unquote, good prompts. Like prompts, they people can create prompt templates. We could say, and so that maybe it's not exactly the same thing that's being asked, but you have the meat and potatoes of your prompt already ready, you click on that and then it's, it's there for you and you change a few things or extrapolating that out. I was also thinking about another talk that we had from Linus Lee who was working at Notion and his whole thing was with Notion AI. We just want you to be able to click and get what you need done through clicks without having to have that cognitive load of trying to figure out what it is exactly that I need. Because there are like six things that when it comes to Notion at least, and I understand it's a completely different scenario for you when it's in Notion. Maybe you want to elaborate onto something, you want to summarize, you want to write better, clean up the grammar and so they give you that type of AI feature just with a few clicks.

Vikram Chennai [00:38:10]: I think our version of that would probably be like user specific prompt suggestions or like chat suggestions. So if you are requesting for a pipeline or working with XYZ pipeline a lot, then it will be able to learn from that and give you sort of almost like search suggestions the same way in Google or any service. It'll say hey, were you thinking about asking this or this or this? And then you know, sort of go that way. That'd probably be the best version of it for us. But it would definitely help. You know, people don't have to ask for the same thing. Yeah like six times, you'll, it'll start to learn. Like maybe you do want to talk about pipelines because that's all you've been talking about.

Demetrios [00:38:50]: Simple humans. What are you doing? No, talk to me about pricing because I know this can be a headache for founders in this space specifically because like the traditional way of saying it's seat based pricing can get really not useful or not profitable for a company. If everything is on usage based pricing, then you run the risk of the end user thinking twice before they use the product. It's like, oh, if, if this is going to cost me like a buck or two bucks, maybe I should do it myself. I mean hopefully people value their time much more than $2. But I know that I've been in that situation. And I think, do I want to spend the $2 right now? I don't know. So how, how do you look at pricing? How are you currently doing it and what do you think about like, as you've learned from customers and talking to customers?

Vikram Chennai [00:39:54]: So usually when we, when we sign new customers, it's usually like a flat subscription fee that we give them. So. And they get a credit allocation through for it. So usually we evaluate what their needs are and then we essentially come up with like a, you know, we have a scaling of here's how many credits is to, you know, whatever dollar amount. And then we usually offer them a subscription on that, especially because, you know, usually they want to either build new pipelines or maintain existing ones, and so they want to remove that sort of work.

Demetrios [00:40:25]: And a credit is like a token or a credit is.

Vikram Chennai [00:40:27]: It's a mixture of token and tokens and computer. Um, and so it's all the resources that the agents are using to solve the tasks and then they get billed onto your account. Um, but with the way we price, you know, you'll have like 2,000 or 3,000 credits or something on your subscription for your company. And so like spend away. Like, those are yours to spend. Like, please drop it to zero, you know, and if you want more on top of that, then we offer sort of a, a token basis on top of that.

Demetrios [00:41:01]: So it's like, hey, you've used up all your credits for this month. If you want re up here, send Bitcoin to this wallet address. Then you're good.

Vikram Chennai [00:41:12]: Yeah. So not. Yes. Not exactly the Bitcoin. Not that far, but yes. Oh, yeah.

Demetrios [00:41:19]: Wouldn't it be nice though, I wonder. That's hilarious.

Vikram Chennai [00:41:23]: The.

Demetrios [00:41:23]: But the. Yeah, that makes sense. And that is one of the pricing patterns that I've seen because it will help folks. So you're kind of estimating as you're looking at how nasty their data pipelines are or how many they have. If it's that 15,001, you're going to give them a bigger quota or think that they're going to use a bigger quota. So you're going to give them an estimate that's bigger than if it's just one data engineer with a few pipelines they need to run.

Vikram Chennai [00:41:57]: Exactly. And I. And our primary goal, like, whenever we're on those calls is not to say like, you know, it's not really, you have three seats and there it's like, okay, what's your problem? How do we solve it? Right? And it's making sure you have the resource allocation to actually get your job done. Like, what we don't want is the hassle of someone coming in and saying, well, okay, we, you know, we got, you know, let's say we had a different model, a hundred credits, and it didn't really do the thing. It's like, yeah, that's because it's. The task you asked was not suited for that exact, you know, scenario. Like, we didn't. Everything isn't set up right for you to be able to do that the way you wanted.

Vikram Chennai [00:42:36]: Or we've set up a run that runs every three days and does data quality checks or just checks for new changes and stuff. And, you know, is there anything wrong? Like, whatever you want the agent to do, right. We've set it up in the SDK and like, it just stopped running. It's like, well, yeah, because that's not what your needs actually were. So I think it helps a lot on the customer side where they just have to worry about getting their job done. And that's it. That's the only thing you have to think of. And then as things start to scale, you know, then you're like, okay, well, you can re up credits or maybe we can change to a, you know, a different sort of subscription or something like that.

Vikram Chennai [00:43:14]: But at least for now, plus a little bit in the future, they just don't have to worry.

Demetrios [00:43:19]: Yeah, you mentioned also compute. Why is compute involved in there? Are you abstracting away the pipelines themselves too? Are you like, adding the compute or are you doing the compute yourself?

Vikram Chennai [00:43:34]: So the compute comes from the agent actually running the thing, like running code to get your job done. So the fundamental way it works is it is a coding agent. And what our thesis is that seems to be working is that coding is like coding is a language that we've already created to interact with every service out there. So why would we try to rewrite it like it's already there in a box. You want to interact with databricks. They have an SDK and an API. You want to interact with postgres, like, they have it for you. Why would you try to rewrite that? And so the way the agent worked is it will actually just code the way that a human would to get the job done.

Vikram Chennai [00:44:16]: So if you have a GitHub repo set up, it'll clone the repo and make the changes in the right place for. For DAG files, for airflow and then push you a PR for that. Or if it's just make a change in my MongoDB database. It'll write the code to get that done and then push that. And so all of that requires computer to run that VM that the agent is given, and so that gets passed on.

Demetrios [00:44:39]: Oh, I see. So it's not only the LLM calls, it is also the VM and everything that's happening in that sandbox. Awesome. Now what? Before we go, man, I feel like I want to make sure that I get to ask you everything because this stuff is super fascinating to me. And I love the fact that you're taking like, this AI first approach for the data engineers, because Lord knows every day is hug your data engineer day. They go through so much crap and get so much thrown upon them that this is a tool that I imagine they welcome with open arms. And I guess the other piece of it is like, I imagine you probably for fun or out of passion of building this product, have looked at a lot of logs or looked at a lot of stuff that the agents have been able to complete. What is one run or something that an agent did that surprised you that it actually was able to pull it off?

Vikram Chennai [00:45:52]: We did a bunch of testing to have it write spark code in databricks and to call that orchestrator pattern of have something in airflow trigger something out of databricks. And like, a lot of companies do this where they use airflow purely as an orchestrator, which is generally a good pattern. And we just had it. It was like a very simple, like, data frame calculation that we wanted to do, but it was the fact that it was able to use multiple services at the same time. One shot the code properly, like, process the data elsewhere and then sort of pull that all together in like an airflow pipeline that didn't error out. I was like, there's no way that thing just works, right? Like, that's insane. So I think that's. That was probably the biggest thing where, you know, it wasn't just the complexity of write the pipeline.

Vikram Chennai [00:46:39]: It was like. And you have to call out to a different data processing service, which means you need to understand what are the clusters that you're running. And like all this stuff that needs to go in. Then you also need to write the spark code properly to make sure that doesn't error. And it like, did it all. And I was like, oh, okay, guess we're onto something. Like, this is. Wow, this is impressive.

Demetrios [00:47:02]: It worked once. And you're like, don't touch anything. Nobody move. We need to pray to the LLM gods right now that it will work again.

Vikram Chennai [00:47:12]: Yeah, I I remember I immediately just texted my friend, there's like, there's no way this thing just worked. Like, this is insane. And then, you know, it's pretty cool when you have those moments like you're running through training loops and it's just crashing, crashing, crashing. And then it just starts working and then it works more and more and more and it's just not failing anymore. And you're just looking at that thing like, holy crap. Like, wow, this is working. Like, this is unreal. Like, you think three years ago you'd have bots pretty much that you could just say, go do this super complex task.

Vikram Chennai [00:47:43]: You also need to code, you also need to understand like an entire databricks environment. Oh, and you, you know, you should get it right in one or two tries. Like that's just unreal that it's happening.

Demetrios [00:48:04]: With me.

+ Read More

Watch More

Data Engineering for ML
Posted Aug 18, 2022 | Views 1.5K
# Data Modeling
# Data Warehouses
# Semantic Data Model
# Convoy
# Convoy.com
Data Observability: The Next Frontier of Data Engineering
Posted Nov 24, 2020 | Views 713
# Monitoring
# Interview
# montecarlodata.com