MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Open Standards Make MLOps Easier and Silos Harder

Posted May 21, 2024 | Views 422
# MLOps
# Silos
# Voltrondata.com
# ibis-project.org
Share
speakers
avatar
Cody Peterson
Senior Technical Product Manager @ Voltron Data

Cody is a Senior Technical Product Manager at Voltron Data, a next-generation data systems builder that recently launched an accelerator-native GPU query engine for petabyte-scale ETL called Theseus. While Theseus is proprietary, Voltron Data takes an open periphery approach -- it is built on and interfaces through open standards like Apache Arrow, Substrait, and Ibis. Cody focuses on the Ibis project, a portable Python dataframe library that aims to be the standard Python interface for any data system, including Theseus and over 20 other backends.

Prior to Voltron Data, Cody was a product manager at dbt Labs focusing on the open source dbt Core and launching Python models (note: models is a confusing term here). Later, he led the Cloud Runtime team and drastically improved the efficiency of engineering execution and product outcomes.

Cody started his carrer as a Product Manager at Microsoft working on Azure ML. He spent about 2 years on the dedicated MLOps product team, and 2 more years on various teams across the ML lifecycel including data, training, and inferencing.

He is now passionate about using open source standards to break down the silos and challenges facing real world engineering teams, where engineering increasingly involves data and machine learning.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

MLOps is fundamentally a discipline of people working together on a system with data and machine learning models. These systems are already built on open standards we may not notice -- Linux, git, scikit-learn, etc. -- but are increasingly hitting walls with respect to the size and velocity of data.

Pandas, for instance, is the tool of choice for many Python data scientists -- but its scalability is a known issue. Many tools make the assumption of data that fits in memory, but most organizations have data that will never fit in a laptop. What approaches can we take?

One emerging approach with the Ibis project (created by the creator of pandas, Wes McKinney) is to leverage existing "big" data systems to do the heavy lifting on a lightweight Python data frame interface. Alongside other open source standards like Apache Arrow, this can allow data systems to communicate with each other and users of these systems to learn a single data frame API that works across any of them.

Open standards like Apache Arrow, Ibis, and more in the MLOps tech stack enable freedom for composable data systems, where components can be swapped out allowing engineers to use the right tool for the job to be done. It also helps avoid vendor lock-in and keep costs low.

+ Read More
TRANSCRIPT

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/(url)

Cody Peterson [00:00:00]: My name is Cody Peterson. I'm a senior technical product manager at Voltron Data and I generally do not drink coffee. Yeah, actually Diet Coke is usually my go to caffeine.

Demetrios [00:00:15]: What is up MLOps Community? Welcome back to another episode. I'm your host Demetrios, and today we're talking with Cody, who broke down the Ibis standards. He talked to me about why open standards are a good idea, what exactly IbIs is trying to do. We got down and dirty on the data layer. So if you're looking for one of those LLM talks, this isn't the one for you. If you are a little more data engineering inclined, this is right up your alley. Let's get into it. And for those who are just joining us for the first time, we would love it.

Demetrios [00:00:58]: If you leave some feedback, drop a star on Spotify, subscribe on YouTube, or you know what else is amazing? If you share this with a friend. Alright dude, so you started cutting your teeth at Azure ML. Can you explain to me what you were doing on that product?

Cody Peterson [00:01:25]: Yeah, sure. So I joined Microsoft and working on Azure ML out of college as a product manager. I worked there for just under four years and worked on a bunch of different teams, started on the data team at some point, went over to the ML training team, was on the inferencing team, and it was actually on a dedicated MLOPS team for about two years. So yeah, working with customers, helping them deploy their end to end machine learning systems on Azure ML and really learned a ton through the experience.

Demetrios [00:01:56]: Do you feel like the Mlops or ML lifecycle? Let's just go with like the old school ML predictive ML as they call it. Is it standardized? Is there a standard way of doing that? Because someone literally just told me online the other day they were trying to pick a fight with me, saying like all this mlops stuff, it's all standardized now and there's best practices for that. And I had to kind of like ask, is there though?

Cody Peterson [00:02:26]: I'm not sure, I don't think so. I think there are definitely some common threads, you know, version control, your code. There's still ongoing debates about whether you should use notebooks in production, but it's such a broad discipline, just like software engineering, that it's hard to say, it's hard to give a dogmatic response of, you need to do this in this situation. I think the answer is often, it depends. You can have a lot of best practices for certain verticals if you're doing recommender systems or, I don't know linear regression or whatever. There's a lot of existing content you can go pull from. But yeah, I think it's very hard to say that it's standard. It's going to depend a lot on what you're trying to do.

Demetrios [00:03:09]: That is a great point, and I think doesn't get made enough depending on your use case. You have more history and more companies that have been able to leverage that type of machine learning. So you can see more content, you can see it feels more mature, like a recommender system is a perfect example of that. Are there things that you learned from your days at Azure ML, about mlops that you feel like you still carry with you today?

Cody Peterson [00:03:42]: A ton of it. It very much taught me about the role of data in kind of modern software systems and just data systems.

Demetrios [00:03:53]: It taught me the data layer that you gotta have. And what exactly do you mean, the role of data?

Cody Peterson [00:04:02]: Yeah, in these ML systems, you typically need to update them over time. You're getting new data in. How are you managing that data? How are you versioning things? If you're in certain industries, you have to be able to go through audits and say, why did we deny someone's application for something? And you need to be able to point back to the data and show that it was fair. Lots of, lots of places where data is very important in these systems, it's.

Demetrios [00:04:31]: All of that transforming the cleaning, the access, even all of that is so messy. And I think a lot of times now, when people are dealing with LLMs, all of that kind of gets abstracted away. Sometimes there still is the need for a data engineer and more and more, as you want to go and do more advanced use cases, you recognize that what you thought you could get away with and not having to learn that data stuff, all that really hairy data problems, you can't get away with it as much as you thought. But in the traditional ML systems, like you said, it was so much at the forefront, and it is so much at the forefront. Do you think that the LLM world loses that or it makes it so that you don't have to think about that as much?

Cody Peterson [00:05:25]: I think it's still important, maybe as important, and it's still going to depend. What are you doing with your LLMs? What kind of data are you running them on? One of the big learnings I've had toward the tail end of Azure ML and synths is really the role that database technology has to play. It's interesting seeing these vector databases. The founder of Lance DB gave a talk recently. I think at data council that was like, we probably will see these things converge. And databases are adding vector indices or vector search natively. Vector databases are adding more traditional database data types and things. And yeah, there's a lot I think the mlops community can learn from the database community, and maybe vice versa, is LMS and Genai is kind of going everywhere.

Demetrios [00:06:11]: Yeah. And so now you've been spending a lot of time on the Ibis project, as I was calling it earlier, and you mentioned that's not what it's called. Can you explain what exactly, what exactly is Ibis and what have you been doing with it?

Cody Peterson [00:06:28]: Yep. So these days I work as a product manager at Voltron Data, focused exclusively on the Ibis project, named after the bird.

Demetrios [00:06:37]: Not the hotel chain?

Cody Peterson [00:06:39]: Not the hotel chain. There's actually a good list somewhere that the lead maintainer put of all the other different Ibis things that are out there. The hotel is one of them. But yeah, the Ibis project was started a long time ago, actually in 2015 by West McKinney, who was the creator of Pandas, co creator of Apache Arrow, and also one of the co founders of Ultron Data. And there's a famous or infamous article he wrote called I think Apache Arrow and the ten things I hate about pandas. And it kind of listed some of the issues and longstanding gripes with pandas that he is kind of the creator had. And a lot of those have been solved through pandas is still very much alive and well. But IbIs kind of came along to take a different approach and say what if instead of tightly coupling our dataframe API to the execution engine, so pandas, for instance, is very tightly coupled to numpy kind of its own way of doing single threaded in memory execution, what if instead we compiled down to, say, SQL and sent that off to a database backend or some big distributed system and let that handle the execution? And Ibis just handles the API.

Cody Peterson [00:07:51]: So that was the kind of origin of the project back then.

Demetrios [00:07:55]: Wait, so say that again, because I'm not sure I fully understood. So basically the versatility of IBIS is because it can just be boiled down to SQL and then you can set it to wherever you need it to go.

Cody Peterson [00:08:11]: Yep. And not just SQL. So today the IBIS project supports 20, at least 20 backends. These are things like DuckdB and polrs and snowflake and Bigquery and click House and Pyspark and dask and actually pandas itself as well. And the way it does that is by, you know, it's a data frame API. You write your data frame code, there's some sort of internal intermediary representation, and then that gets converted into backend specific code. So for most of these systems, it's actually just SQL. For pandas, it's pandas code, for polish, it's polar's code, and for dask, it's pandas.

Cody Peterson [00:08:51]: Pullers and dask are the three dataframe ones, but the rest are all SQL backends. It compiles your code to backend native code, sends it off, and lets the backend system deal with all the computation and data management and all that good stuff.

Demetrios [00:09:07]: So what was the life of the data scientist, or analytics engineer, I guess, is another person that would probably be playing in this before or not using EBIs and using EBIs.

Cody Peterson [00:09:25]: Yep. So the thing I saw a lot going back to Azure ML was this problem of data scientists throwing their code over the wall to data engineers or ML engineers. So they might write some often pretty bad pandas code or inefficient pandas code, and they might be pulling, say they're working with Snowflake, they might pull like a 1% sample down to their local computer, iterate on their pandas code, and then say, hey, this works great, go make it run efficiently on Snowflake.

Demetrios [00:09:54]: Then you get the memes worked on my machine.

Cody Peterson [00:09:57]: Yep, exactly. And with ibis, instead of doing that, you actually go into ibis, you connect to Snowflake, for instance, and you just work with your data there. With your API, you could still pull your data locally and work with it in duckDb or polars if you wanted to do that to save on costs or whatever. But regardless of which backend you use to scale it up, it's the same code. So instead of throwing that code over the wall to the engineer, you just hand it off and say, here's my IBis code. It already runs on Snowflake natively, so you can just go put this in production or make it more efficient if you need to.

Demetrios [00:10:38]: One thing that I wanted to talk about was this idea of like SQL verse data frames and how do you look at that? What are some insights that you've gained from almost like living in both worlds, I feel.

Cody Peterson [00:10:52]: Yeah, I definitely started in the completely Python data scientist data frame world. In Azure ML, there was pretty much never any SQL. My first interaction with SQL was actually trying to look at a database of metrics that we had and trying to do like product analytics. And I got frustrated pretty quickly with SQL. I did the select star download as CSV and I just use pandas to look at the data instead. So that's kind of how I started. I joined DBT labs for a bit under a year. I really saw this other side of people who would never think to use Python in data frames and really just prefer writing SQL code in a query editor or in DBT, of course.

Cody Peterson [00:11:37]: And it was very interesting. So there's definitely this two world problem in data of people working completely in SQL and people working completely in Python. And I think there's a lot that both sides can learn from each other. There's a lot of overlap. Of course people use both. Ibis itself, one of our taglines is it provides the performance and scale of modern SQL with the flexibility of Python. So really bringing those best practices and performance from SQL databases to Python data frames.

Demetrios [00:12:11]: Yeah, Ibis is a completely open source project, right. And it allows you to almost abstract a layer above to make sure that the engine, you can choose whatever engine you want off the back of it. You have that flexibility to say what your system is going to look like, probably inheriting a lot of what your company already is using and what you're used to using. As you mentioned. How have you seen ibis evolve over these couple of years that you've been working on it?

Cody Peterson [00:12:48]: See, I've been on Voltron, at Voltron data for one year and I became aware of ibis about a year before that and kind of watched it evolve at Voldron data. It's, despite being around for about nine years, it was kind of contributed to on and off just as an open source project. And really in the last two and a half to three years under Voldron data stewardship, it has really taken off and become very very useful. So it added DuckDB as a default backend, which just gave you a out of the box really good local option. It's added all the things you would expect in a dataframe library for like reading CSV files, reading parquet. We actually support Delta Lake now and it's just become really rock solid as well. The engineering team did a very large refactor for the recent 9.0 release that really stabilized a lot of the internals. Yeah, it's cool to see.

Cody Peterson [00:13:40]: And as you mentioned, it's a completely open source project. It has its own governance. It's not owned by Voltron Data or any one company. And that's really what we aim to be as an open standard that ideally instead of like creating your own dataframe library that only works on your engine, if you have a SQL engine, you can just come in pretty easily, adopt ibis and make it work and give your users a delightful dataframe API on top of your database.

Demetrios [00:14:09]: So there's one thing that you said before we hit record, which was open standards are a good idea. What about working on this project has made you see things that way?

Cody Peterson [00:14:23]: Yeah, they're a good idea really on both sides of the equation. So they're good for customers or users of these open standards. And the reason there is number one efficiencies that you get. So in the past if you wanted to go from, well, there weren't as many options, but if you wanted to go from like pandas to Pyspark or various other platforms to each other, there was a pretty heavy cost in like moving that data and also learning two different dataframe abstractions. Today though with Apache Arrow, for instance, that you can seamlessly convert between pandas and polars and DuckdB and just choose the interface that you prefer.

Demetrios [00:15:03]: And Arrow, can you break that down for me? Because I've heard it thrown around a lot and I want to know exactly what it does.

Cody Peterson [00:15:10]: Yeah. So Apache Arrow, as I mentioned, it was co created by Wes McKinney and a bunch of other people. A lot of the maintainers work at Voltron data as well. But again, it's a fully open source project under the Apache foundation, has contributors from a bunch of different organizations, and it's basically a spec for an in memory data format that's used by a lot of data systems now. So DuckDB uses it as their data format, Polrs uses it as their data format. Pandas is increasingly taking a dependency on it, has some amount of dependency on it. And basically what it allows you to do is if you're building a new data system, instead of inventing your own in memory data format, you just use Apache Arrow. And all of a sudden you can communicate with all these other data systems pretty seamlessly.

Cody Peterson [00:16:01]: So if you're going from DuckDB to polar, you don't need to convert how your float and Int's and all the memory layouts are, they're just the same. You just go through Apache Arrow, it's like a zero copy conversion.

Demetrios [00:16:16]: Very cool. Okay, sorry, I totally cut you off on the answer that you were giving. I sadly can't remember the question that I asked them when I cut you off because I got distracted. On the Apache arrow side, it was something along the lines of open standards being a good idea.

Cody Peterson [00:16:34]: Right, so they're good for users because they tend to provide more efficiencies so rather than being locked into, you know, one specific vendor with their own memory format and their own data frame API, if we have an open standard that has buy in from a bunch of different vendors and different products out there, it allows you to easily kind of switch between them and choose the right tool for the job. And this kind of dovetails into the idea of a modern composable data system. Wes McKinney gave a talk about that recently at the I believe data council, which was very good. So shout out to that. Yeah, the idea is kind of going back to our mlops thing. You don't necessarily want to build a very rigid system for data or for ML. You want to be able to swap out components when the need arises. So if you want to switch from a batch processing or a batch updating ML model to a model that's updating in real time with a streaming data system, you don't want to necessarily re architect everything.

Cody Peterson [00:17:38]: And open standards bring us closer there where you can just use the same APIs, use the same interfaces and data formats, and seamlessly switch between whichever product works best for your scenario.

Demetrios [00:17:53]: So I've been thinking about standards a bunch, especially because we're doing the AI quality conference, and one thing that we plan to do at the conference is create standards for AI quality. And specifically, like you have your industries and then you have your use case specific standards. Have you seen things that have worked as far as making sure that the standard you are creating doesn't just become one more standard that now a project, if it wants to plug into this little area over here on the map or these different tools, cool. It has to abide by this standard like the IBIS standard. And then if it wants to go and play over here on this side of the map, it has to abide by a different standard.

Cody Peterson [00:18:44]: Yeah, it's a very difficult problem. People, when they hear about Itis, often bring up the XKCD of you have 14 standards. Somebody says, let me introduce another one. We actually use that in most of our slides because it comes up all the time and I don't have a great answer. I would say that part of it is just making sure you get buy in from a diverse set of organizations. I don't think Apache era would have worked if it were full drawn data didn't actually exist at the time. But just as an example, if it were only Voltron data pushing on Apache Arrow and trying to get it adopted everywhere, that wouldn't have worked. But it has buy in from a bunch of different companies and a bunch of different organizations and Wes and a bunch of other people who were kind of facing the same problems at the time got together and decided on this standard.

Cody Peterson [00:19:33]: And a lot of it's just yet getting that buy in across the ecosystem and across organizations and popularity to some extent. Like even looking at OpenAI, I think OpenAI has almost created a standard in their rest API LLMs today. And there's then like slight differences if you try to use anthropic or mistral or whatever and yeah, it's, I don't know, is someone gonna come along and like create a standard API for these things? I know there are some open source projects trying to do that and it's just, it's hard. You really have to get buy in and just gain popularity over time, I think, and eventually becomes a standard.

Demetrios [00:20:09]: Yeah, that is brutal because it is constant work getting that buy in, recruiting people to come on, and you have to have that network effect, otherwise it's just not going to work, is it? Yeah, yeah. And especially because like if we're talking about the Ibis project, if it only had out of the box two or three different things that you could use with it, it wouldn't be as attractive as the 20 plus that you have now.

Cody Peterson [00:20:44]: Yep, definitely. And it then becomes partially a maintainability and scalability issue. Yeah, we, we do have contributions from outside organizations. Alibaba has been contributing stuff. Polar's actually contributed their backend. Back in the day, single stores contributed a backend. So it's growing. But to maintain that kind of momentum, we need a bunch of contributors from a bunch of different places.

Cody Peterson [00:21:11]: And going from 20 backends to say 100 backends is quite the challenge, unless it really does become that well adopted open standard. That's where we're trying to get to next, is get buy in from different organizations, get them to maintain their own backends and just use IbIs as the front end for their data systems.

Demetrios [00:21:34]: Yeah, that could get so messy, especially if there's one or two people that are trying to maintain all of that.

Cody Peterson [00:21:41]: Yes. So we have a great group of. What are we at? Like six or seven full time engineers working on it. Voltron data, a bunch of people in the community as well. So it's fairly well funded. But yeah, to get to that next level we need a contributors welcome if anyone wants to get involved in open.

Demetrios [00:21:57]: Source pr is open. Totally. So you spoke about the front end and I think Voltron data is one of the backend. Right. And we were having a nice conversation about this beforehand, that Voltron data has an interesting open source play. It's not like open core. What did, what did you call it?

Cody Peterson [00:22:18]: We've been calling it open periphery. I don't know if that's a term that we're going to try to really push, but yeah, it's. Voltron Data is partially able to be in such a unique position because its founders have such deep open source backgrounds. So I think I mentioned Wes McKinney was one of the co founders and he was the CTO for a while. Josh Patterson, the CEO, had worked on Rapids AI and Cudi F and all that stuff, while Nvidia, so really pushing their open source software. And most of the engineers are pretty deep in open source, like the Ibis engineers had worked on dask and Apache Arrow, of course, and pandas and a bunch of different projects. And so we have a very heavy open source culture. But part of what allows us to do this is kind of the business model.

Cody Peterson [00:23:05]: So we are selling, we are a vendor as well. We are selling a distributed GPU database engine for petabyte scale. Very, very big data problems. And part of the, I guess, problem and opportunity with that is it's not for most people. 99.9% of people are never going to need an engine that runs on many different gpu's and does your analytics for you, the founders and Theseus as a product is faced with the problem of, okay, how do we get an interface to this thing? But we don't want to just like invent our own data frame API that 0.01% of people are going to use. So instead we use ibis. And if you're working with small local data, you can use duckdb and polars. If you're working with medium sized data, you can use click house and snowflake and pyswork.

Cody Peterson [00:24:02]: And if you do need theseus, if you are in that kind of top tier of data needs, then you can use Theseus as well. So yeah, Ibis is the interface for Theseus, but it's also attempting to be an open standard. And it's also why it's important that it is a self governed open source project. It's not owned by Voltron Data. It just kind of happens to be the interface for Theseus and happens to be where a lot of the Ibis engineers are employed.

Demetrios [00:24:28]: All right, let's take a minute to thank our sponsors of this episode. Weights and biases. Elevate your machine learning skills with weights and biases. Free courses in including the latest additions, enterprise model management and LLM engineering structured outputs. Choose your own adventure with a wide offering of free courses ranging from traditional mlops, LLMs and CICD to data management and cutting edge tools.

Demetrios [00:24:59]: That's not all.

Demetrios [00:25:00]: You get to learn from industry experts like Jason Louie, Jonathan Frank, Shreya Shankar and more. All those people I will 100% vouch for, they are incredible friends of the pod. Enroll now to build better models, better and faster and better and faster. And just get your education game on. Check the link in the description to start your journey. Now let's get back into the show.

Demetrios [00:25:28]: Who are a lot of the people that you are seeing using Voltron data? Because as you mentioned, that scale is gigantic. And so I'm guessing it is for very big companies.

Cody Peterson [00:25:45]: It is. We do also sell basically enterprise support for open source, and that was what Voltron data has had before Theseus launched. So if you need help with Apache Arrow or IBIs or substrate, that's something that Voltron data offers. I don't know if we have. Well, one of the companies that we work with is DuckDB labs, actually. So we collaborate with them pretty closely. Their Apache arrow usage and things. For Ibis as well, as I mentioned, DuckDB is the default backend.

Cody Peterson [00:26:16]: But then, yeah, for Theseus we see very large customers with terabytes or petabytes of data and who also have a lot of GPU's that they're able to configure. You can't just like have a bunch of GPU's. You also have to have the networking set up, you have to have really fast storage. It's a very almost HPC style problem of how you configure all this stuff. We also largely partner with companies. I think the big public one right now is HPE. HPE has their own cloud product and you're able to embed theseus in your own product and sell it sharing with full drawn data effectively.

Demetrios [00:27:03]: Excellent. So when you look at how to choose a data system, what are some key considerations that you think about?

Cody Peterson [00:27:11]: Yep. One of the first ones is whether you're doing batch or streaming. Streaming is kind of, I think it's been a bit of a joke that this year will be the year of streaming data, but it really is like kind of on the rise, slowly but surely. But yeah. What are your kind of latency requirements? Does this data need to be happening in real time? Whereas once a day, once an hour, every ten minutes? Fine. Then I look at kind of data size. Are you dealing with something that you can work on, on your laptop or a modest size vm. And if you're in that kind of almost less than 1 tb range, given how how big cloud vms can get today, I would recommend something like really duckdb or polar's just a good single node OLAP engine.

Cody Peterson [00:27:58]: You definitely want to avoid doing anything with distributed systems unless you really have to. Once you get past that kind of 1 tb threshold, you might want to start looking at pysparks and trinos and click houses and those kind of systems. So I'd go batch of streaming then what is your data size? And then questions around like governance and you know, do you need also what does your organization use often you might not be choosing that, you might just be locked into, well, we use snowflake or we use databricks and that's what jifty is.

Demetrios [00:28:33]: Have you seen use cases where people do have a certain set of lock in, but they're able to almost slide into something else quite easily because of ibIs? No.

Cody Peterson [00:28:50]: Yeah, Ibis lets you kind of try out different things. I find the use case of like, yeah, taking some of your data and just working with it locally in Ducktv very appealing. So yeah, if you're kind of locked into Snowflake, but you want to try out something more elaborate and just do it with a local sample of data, you can do that with Duck TV and then go back and deploy it. Yeah, it's definitely something that we see and it's also something back to our point earlier on SQL versus data frames. If you're using, we heard this about Druid recently. So one user has like Druid and they don't want to think about like Druid SQL and all the nuances there. So they just use ibis on top of druid. They're using a dataframe.

Cody Peterson [00:29:35]: And so to some extent it comes down to preference. If you have a SQL only system and you don't want to use SQL, you want to use a data frame. IbIs is a really good choice for.

Demetrios [00:29:44]: Oh, so this is fascinating. So basically it's like I have this SQL only set up, but actually I feel more comfortable in python. Can I just slap ibIs on top of my SQL only setup and get the same results?

Cody Peterson [00:30:01]: Yep. And there are a lot of cases where kind of this repeating story of some new database vendor comes around and maybe they have a niche or you know, they're just really fast or they have good governance models or whatever, but they're a SQL only interface and that's fine for a long time, but as you get more and more customers and more and more users, you're going to get asked like, hey, I want a data frame interface on top of this. And then they have to ask the question, do we develop this in house? Is this something we have the expertise for? How much time and effort does this take? In a lot of cases, IbIs might just already support it or be very easy to add support for. So yeah, if you have a SQL only interface, there's a very good chance ibis already works with it. And if it doesn't, there's a very good chance it's pretty easy to add. And then, yeah, you just have a pretty fully featured data frame interface on top of your what was a SQL only engine.

Demetrios [00:30:55]: So where do you want to take Ibis next? Where are you seeing things going?

Cody Peterson [00:31:00]: We are doing a bit of boring stuff. So just like working on stability, IbiS has an interesting story of it started as a mono repo and then it kind of split out into having repos for backend. We're actually back in a Monorepo right now. So those 20 plus backends are all in the Ibis Monorepo, and we probably want to split that out again. So getting to the point where like, I don't know, snowflake or click house could just own their repository and their backend in their own GitHub organization or wherever they want to store that. Theseus actually is in a private ultron data repo. So that's a kind of external backend right now that we might try to use as the model for doing that. But to get there, we have to do a bunch of setting up testing.

Cody Peterson [00:31:46]: How will these external backends run the gigantic test suite that we have for IbIs? How will they have a stable interface that they can use? How will we version them together? There's a lot of like, pretty technical questions there. So that's kind of the boring stuff. More exciting. We're working on actually an IbiS ML package. So that kind of takes the idea of IBIs and brings it to machine learning data preprocessing pipelines. Oh, so, yeah, so if you, because there's, oh, interesting.

Demetrios [00:32:17]: Since it's so disperse and you have so many different ways to preprocess data, it's like IBIS is going to create the standard around that too.

Cody Peterson [00:32:28]: Yeah. So the idea behind IbIs ML is kind of the same idea behind IBIs of a lot of scikit learn and pandas. Basically make the assumption that your data fits in memory and is running in some single threaded thing. But what if I want to run my data preprocessing pipeline right on Snowflake or right on Pyswark? Or of course there's sparkMl and things, but can IbIs just provide a standard preprocessing interface on top of these 20 backends so then you can run it on duckdB locally to do your experimentation and run the exact same pipeline on whatever Snowflake, Theseus, Qlikhouse, Trino. So yeah, we're launching that soon, which will be exciting. It's kind of already public and out there, but not something we've hyped up too much. Yeah, you can think of it as kind of a scikit learned pipeline, but again, that will run on any of these backends.

Demetrios [00:33:25]: But you're not playing in the pipelining.

Cody Peterson [00:33:29]: Abstraction layer, not the orchestration, DBT, dagster, prefect level.

Demetrios [00:33:35]: Yeah, airflow, just the.

Cody Peterson [00:33:40]: You can already use ibis for your preprocessing code. You could write a lot of this yourself, but part of it is reusing. If you run a standard scalar or something, you need to save that standard deviation and whatever the other number is to be able to reuse it while you're doing retraining. There's some ML specific stuff, one hot encoding, of course, it's something you can do in Ibis natively, but it's a little tedious to write. Ibis ML just wraps it for you and has a nice little one hot encoder, stuff like that.

Demetrios [00:34:14]: Are there things that the community has asked for that you really want to get into production or you really want ibis to be doing? And it's not quite there yet, but you have it in mind and it's going to be something that you're working on besides these, the Ibis ML, which I think is probably along those lines. But is there other stuff that you're thinking about?

Cody Peterson [00:34:38]: There's a few things. We get a lot of requests for backends that to our conversation earlier. It's hard for us to just say yes and put in the effort to add every new backend that gets requested. And that's where scaling as an open source project would be very helpful. And having people contributing a backend is a very good way for people to get started. So there's kind of a backlog of backend requests there another one that comes up all, all the time that's very technically challenging is cross back end joins or working in data across different backends. And it might not be something that we ever do. It is something that we've had a prototype for that our lead engineer doesn't like to talk about because it's, he doesn't want to support it.

Cody Peterson [00:35:24]: But, you know, like if you have data in Snowflake and data and Duckdb, I just want to join these two tables and have Ibis like, go figure out how to efficiently do this. Yeah. It's not really what Ibis is for, but it is a cool.

Demetrios [00:35:36]: Sounds magical.

Cody Peterson [00:35:37]: It makes a cool demo. Yeah. And it is magical that keeps coming up with people and.

Demetrios [00:35:43]: Yeah, maybe one day when you say it like that, it sounds ridiculous that that's not possible.

Cody Peterson [00:35:49]: Okay.

Demetrios [00:35:50]: Again, though, as you just mentioned, that's not what Ibis is for. Potentially there's scope for another tool to come in there and say we can do that for you. And that's a whole other open source project because I can imagine there are some sticky situations that you can get into when you're dealing with that. And so I don't know, you'd have to make the case as to why that should be done by Ibis and not some other tool.

Cody Peterson [00:36:19]: Yeah. And that is something that Trino kind of does with its federated. You can add different data sources and then Trino kind of figures out how to go and do that. Yeah, it does make a cool demo. It does come up a lot. I don't know. We'll see. It might not be something we ever really do.

Demetrios [00:36:35]: Yeah, not at the moment. Well, I know you have a hot take on LLMs. I want to finish with that. And what your thoughts are, what your feelings are on the whole AI movement.

Cody Peterson [00:36:47]: Yeah, I feel like I kind of joined, so I joined Azure ML in 2018, which I feel like was at the tail end of the last AI hype cycle with computer vision and things. And kind of the end of my hot take is I feel like we're at the peak, maybe I shouldn't say this, so I feel like we're at the peak of AI hype right now in this cycle. OpenAI just did their announcement. It really feels like the forefront of LLMs is saturating at the performance level. And my hope is that we see kind of the hype side of things die down and see the actual usefulness pick up. Because in the years that I worked at Azure ML, I saw a lot of real ML applications. And the ML side of things weren't necessarily that complicated. It was, you know, Xgboost or, you know, maybe a neural network, but pretty standard stuff.

Cody Peterson [00:37:40]: The hard part was figuring out how to apply that in a real business scenario. So I'm hoping we see fewer shiny, kind of look at this AI tool changing the world products and more things that actually improve our lives. And yeah, I'm hoping, you know, I think LLMs are very important and going to be used in real scenarios. But I'm. I'm personally a little sick of the hype that we've seen over the past, almost a year now or over.

Demetrios [00:38:11]: Well, I think that 100% shows that you are very much a product person because you're instantly thinking, okay, how can we incorporate this into the business to make the business more valuable or make it better for the end user? And a lot of the stuff is hype without that basis. It's just like, wow, look at the cool stuff that we can do with this now. And you have to go back and say, all right, but how can a company incorporate that into their product to make something that is more useful for that end user?

Cody Peterson [00:38:49]: Yep, it's very cool. One of the first things I did with neural networks was detect cats versus dogs, and that's very cool. I can put in an image and say whether it's a cat or a dog, but what do I do with that? Do I put it in security cameras and it's now telling me whether there's a cat or a dog at my door? I don't know exactly. Yeah, I think LLMs are heading in that way. We've seen all the cats versus dogs demos. Now we have to see what these actually get used for in production. Yeah, I'm excited.

Demetrios [00:39:20]: I think that a lot of times what you get is that, all right, well, the hello world of the LLM use cases are chat with your data and be able to. Now you're never going to have data siloed or if you need to ask questions, you don't have to bother your manager, or your manager doesn't have to bother you. And that sounds great. It's just a little bit harder in practice. And I don't know if you've found that, but just chatting with your docs AI rag, whatever it is, internal docs bot, I haven't found it that useful yet. And maybe it's because the rag is not set up properly, maybe it is because it hits a limit, or it's just sometimes that it's not. The best way that I'm looking to understand this information is through, like, asking a question and having to be very pointed with the question that I ask and know what question that I need to ask in order to get the information and that probably is on me. But thinking about the announcement from OpenAI and talking about how, how now we've got voice and vision all in an LLM, I saw something saying, all right, well, this is great, but how can you incorporate it into your company? What are some tangible ways that you can make this part of your product? Starting tomorrow, if you can hit an API and now you've got all these capabilities, what are you going to do? Are you going to create a voice bot or someone calls and complains from your customer service, are you going to throw an OpenAI agent on them? And I'm really curious to see how that all works out and what happens there.

Cody Peterson [00:41:15]: Yeah, I think it's very difficult. And I did play around a lot with chatting with your database and things like that and doing rag. We actually have a, it's called Ibis birdbrain. It's a project for doing just this. It's not something that we're contributing to, but it is out there if anyone wants to take a look. And it's very hard. And I don't think that adding a voice layer necessarily helps with some of the fundamental problems that you touched on, which is like, the LM just gets things wrong or doesn't quite understand at a human level, you know, what you're trying to ask. And the biggest issue I've seen is really in chaining.

Cody Peterson [00:41:51]: So if you're trying to make multiple LLM calls and do something that's more agentic or whatever, the second you get one thing wrong, you kind of start going down the wrong path, and it's very hard to correct that. And that's where I'm a little skeptical that we're going to get to a level of model performance that solves that problem. I don't know if that problem can be solved with a system with better rag or better kind of guardrails and prompting will be very interesting to see. But I hope that, as I said, the hype kind of dies down and we start to see, okay, how can we really make this work in practice? It'll probably take a while.

Demetrios [00:42:32]: So hard to think about the hype dying down every other week. It's just like, oh, this new model was released. Oh, now you've got this, now that. And it just gives the Internet and social media a whole lot of fodder to hype it up again. And it's almost like it's a little bit tiring. I know that as soon as OpenAI announces something or Google announces something, that's basically gonna be my newsfeed. So I try and stay off of social media for a few days and let it settle and die down a bit. And ideally I don't have to deal with it as much.

Cody Peterson [00:43:11]: Yeah, we'll see. But it's definitely, it's my newsfeed as well. And it does get, it gets old, I guess.

Demetrios [00:43:17]: Yeah. And there is another piece that you mentioned there that I guess could be used a lot with the world that you live in. Have you played around much with the text to SQL models?

Cody Peterson [00:43:34]: Yes. So that is largely what ibis birdbrain did. So as Voltron data, we get asked by customers, what are you doing with this AI stuff? Bunch of questions. And I was partially tasked with going and figuring out our answers and we saw basically three use cases for LLMs with data. One is generating synthetic data, which is actually very promising and cool. You kind of just get into like cost issues if you want to generate a million or a billion data points, like, that's costly depending on how you're doing it. The second one is, yeah, LLMs writing analytical code. And so that could be like Ibis code or pandas code, which it does okay with.

Cody Peterson [00:44:15]: But it turns out lms are actually pretty good at writing SQL. So if you're just doing like basic questions and you're feeding in like the schema and maybe a description of your data, GPT four level LLMs do a pretty good job of text to SQL. The problem is as soon as you start getting more specific and asking like vague questions, because if I say, you know, what was our, an example we used to use a lot was like, what's your revenue for selling shirts at some apparel company? And it's like, what do you consider a shirt? Do you consider tank tops a shirt? Do you consider, you know, dress shirts within the shirt carrying, you know, a bunch of different stuff? And an LLM is not going to figure that out for you. And the third use case, which is actually pretty interesting too, is using LLMs within subroutines. So that's like if you have a bunch of reviews for your product and you want to have an LLM, turn that into a rating from like zero to ten or, you know, basically sentiment analysis. But lms are so general. You can also do like translation from English to Spanish. You can do structured output.

Cody Peterson [00:45:21]: So like turn this blob of text into some structured JSON. So I think we're going to see a lot of use cases there. I think text to SQL will be more of a gimmick or just a very, it makes sense to have on basically any product where you can just, you know, type something in English and it'll give you the sequel. But I think going deeper than some very surface level questions, it breaks down, but there's definitely gonna be use cases with elons and data, and it'll be interesting to see.

Demetrios [00:45:51]: Excellent. Cody. Well, I appreciate you breaking down Ibis for me, walking me through what that looks like and all the work you're doing on it, and this has been great, man.

Cody Peterson [00:46:03]: Yeah. Thank you so much for having me. It's been awesome and great to meet you. I.

+ Read More

Watch More

Creating MLOps Standards
Posted Aug 11, 2021 | Views 292
# MLOps Tooling Sprawl
# Machine Learning Platforms
The Birth and Growth of Spark: An Open Source Success Story
Posted Apr 23, 2023 | Views 6K
# Spark
# Open Source
# Databricks