Sign in or Join the community to continue

Engineering Your AI Platform // Panel // DE4AI

Posted Sep 18, 2024 | Views 605

Share

speakers

Tobias Macey

Associate Director of Platform and DevOps Engineering @ Massachusetts Institute of Technology (MIT)

Daniel Svonava

CEO & Co-founder @ Superlinked

Daniel is a Co-founder and CEO of Superlinked - an open-source vector compute framework for building RAG, RecSys, Search & Analytics systems with complex semi-structured data. Superlinked works with Vector Database providers to make it easier to build vector-powered software.

Previously, Daniel was an ML Tech Lead in YouTube Ads infrastructure, building systems that enable the purchase of $10B / year of YouTube Ads.

+ Read More

Colleen Tartow

Field CTO @ VAST Data

Colleen Tartow, Ph.D. has 20+ years of experience in data, analytics, and engineering. She is an author, speaker, startup advisor, mentor, and DEI advocate. Her demonstrated excellence in data and engineering leadership makes her a trusted advisor among executives.

+ Read More

Skylar Payne

Machine Learning Engineer @ HealthRhythms

Data is a superpower, and Skylar has been passionate about applying it to solve important problems across society. For several years, Skylar worked on large-scale, personalized search and recommendation at LinkedIn -- leading teams to make step-function improvements in our machine learning systems to help people find the best-fit role. Since then, he shifted my focus to applying machine learning to mental health care to ensure the best access and quality for all. To decompress from his workaholism, Skylar loves lifting weights, writing music, and hanging out at the beach!

+ Read More

SUMMARY

To build a solid AI platform, it’s important to zero in on what really matters. This panel will dive into the key lessons from the evolution of data engineering and MLOps, including how the industry shifted from niche tools like feature stores to broader platforms. They'll discuss whether separate data and ML platforms are necessary or more effective when integrated, particularly for companies with smaller data teams. By taking a step back and looking at what’s actually worked in the world of MLOps and the recent buzz around LLMs, this panel will also dive into the merging roles of data engineering, analytics, MLOps, and whether the distinct ML engineer role is still relevant. Finally, they’ll share insights on designing an AI platform that’s practical, future-proof, and free from unnecessary complexity.

+ Read More

TRANSCRIPT

Skylar [00:00:05]: Awesome. I'm gonna go ahead and welcome our panelists up. So we have our moderator, Tobias. We have Colleen and Daniel. I'll let the three of you take it away, and I'll drop into the background. But welcome. Super excited to have you.

Tobias Macey [00:00:22]: All right, well, hello, everybody, and thanks for joining. I'm happy to be able to be here and host a conversation with Colleen and Daniel. So just to get us started, why don't you go ahead and give us a brief introduction. Colleen?

Colleen Tartow [00:00:38]: Hi. Thanks, Tobias. I am Colleen Tartow. I am field CTO and head of strategy at vast data. We are an AI data platform, and I've been doing data for a long time. I was a data engineer way back in the day and kind of have focused on data and really the lifecycle of data, the supply chain of data, really what it means to engineer data and get value out of it. So that's been my focus for a long time, and I'm excited to be here with you and Daniel.

Tobias Macey [00:01:07]: All right, and Daniel, how about yourself?

Daniel Svonava [00:01:10]: Hey, everybody, I'm Daniel, one of the co founders of superlinked. We are working on helping people turn data into vector embeddings and then do interesting things with the data. You know, I guess people heard about drag over the last couple, maybe months or years, but there is a lot more that you can do with vectors. And we try to advocate for the broader use cases and we have open source framework for building some of those. And then in terms of my background, I was ML tech lead at YouTube before this, working on the ad systems and yeah, trying to kind of marry the data initiatives to the world of modeling and creating value to the end users. Yeah, happy to. Excited to chat.

Tobias Macey [00:02:03]: And for myself, most folks probably know me from running the data engineering podcast, and I've also launched a little while ago the AI engineering podcast. So focusing on this space in particular, how do we actually design and build AI applications and infrastructure to support them so that we don't all go crazy trying to figure out this crazy new landscape. And so I guess with that, I'm just wondering if we can kick off having a conversation about, as we all start to get our hands on and understand the complic, yeah. The complexities and the requirements around these AI applications. What new requirements does that actually bring for people who are responsible for the underlying data systems infrastructure? What is it that is actually net new and what is just business as usual? We actually need to make sure that we have good data so that we don't get the garbage in, garbage out problem.

Colleen Tartow [00:03:05]: Yeah, I mean, I'm sure we all have lots of thoughts about this. It's a really good question, and it's one I hear a lot. I think that we've been building these data platforms many, many years, right? Like 40 something years, 50 something years, maybe. And the focus has always been on sort of that trifecta of performance, scale, and cost and balancing those three in a way that makes sense. And you can get the ROI out of your data that you need, but still makes it flexible enough that you can address new use cases and new technologies, et cetera. And, you know, over time, you know, that's turned into the modern data stack, and then it's kind of come away from the modern data stack. I think we're seeing a resurgence of folks repatriating and thinking about new ways to do things. I think the balance getting back to the performance, cost and scale, the scale is now changing a lot.

Colleen Tartow [00:04:06]: Right. We're talking about unstructured data, which is a lot bigger and a lot more complex than structured data. Things in tables, SQL. We've got that. We can do that. And the question is now how do we get value out of unstructured data? And that's, I think, incredible. And I think there's so much we can do with that. But it means that the platforms we're building and that we've built in the past won't necessarily work for these new use cases, for this new scale.

Colleen Tartow [00:04:37]: And even if you did scale them out, the performance wouldn't be there, or it wouldn't be there at a cost that is reasonable for the ROI we're looking for. So I think organizations have really tried to understand what that ROI is going to look like. And so it's less of an exploratory phase and more of a, let's think through what we can build out with AI before we build the infrastructure, because the cost is so much more. I could say more about. I want to pause. Want to hear what Daniel has to say about this, too.

Tobias Macey [00:05:16]: Yeah.

Daniel Svonava [00:05:16]: Maybe just kind of jumping from one thing you mentioned there, the focus on unstructured data. Right. So it's kind of these binary blobs that we now have in our system that who knows what's in them, and we're trying to figure out how to get value from them. And one thing we see is that those projects are almost a little bit separate from all the other data engineering, and I think that's not good. So I see the different types of data as more of a spectrum of more or less structured, if you have a cdna log somewhere, it's the individual events maybe are kind of structured, right? But the whole terabytes of data per day, it's kind of, you almost have to treat it as a binary blow, but from that kind of zoomed out level. And so yeah, I think kind of not running the data for AI project standalone, but somehow making it part of the overall effort, I think benefits because the more context the models have, the better, right? So not just having a document over here on the side and trying to understand what's in it, but also bringing in the information of is anybody in our organization actually accessing the document and maybe which parts of it they are looking at, right. Like there is a whole bunch of context there that if you just focus on language models eating some PDF's, you kind of miss out on it. Right? And that's also where the edge of your organization can come in, because you have all these other data you can integrate and, you know, the value of first party data and all that.

Daniel Svonava [00:06:56]: So I think kind of integrating those efforts together I think is beneficial. We can talk about ways to do that. And then the other, I think, interesting aspect is that modern data stack, the semantic layer, we were trying to put some kind of structure over the data. We were trying to sort of define, okay, what this column actually means, what are the values in it? And somehow when you actually look at what's in the column, it's almost like black box, and you have to look at the specific values. And this black boxing, I think it's much worse in the world of AI, because now we have some transformations basically running on our data that we cannot fully inspect what those things will do. In a way, we have to design systems that can deal with that uncertainty and unexplainability. And so that much more effort on observability and assertions and checks and always verifying what's going on, I think is that much more important.

Tobias Macey [00:07:58]: I have lots of thoughts on this as well, but I don't want to get us too bogged down. And I think that another interesting piece to bring up before we really thoroughly explore this, is the past few years of investment that have gone into all of these different systems and platforms and technologies focused on mlops and being able to train and build and deploy these various ML models using things like linear regression, random forests, et cetera, so called traditional ML to be distinguished from generative AI. And a lot of those efforts seem to have been built in a discreet way, separate from the analytics stack. And I think that that caused a lot of challenges for being able to effectively integrate into existing data platforms. That also gave rise to distinct roles of the ML engineer, where they work over in Mlops land, and you've got the data engineer working in analytics land. And I think that as we enter this new epoch of generative AI being the all consuming topic that everybody needs to integrate with, I think that that maybe gives us an opportunity to reconsider those architectural decisions, understand what are the eventual and potential consolidations of those different technology stacks. And I'm wondering if we can talk to some of the opportunities that we have going forward as we design and architect these new systems to support generative AI use cases, to be able to have a more cohesive and holistically designed experience that actually brings you end to end, from data engineering through mlops to AI engineering.

Daniel Svonava [00:09:49]: Well, I would say big question.

Colleen Tartow [00:09:51]: Yeah, that's a really good question. I would just start by saying the data engineering part is still 90% of the work in any of these pipelines. Right? Like the data engineering, taking data from a source and curating it. So that actually makes sense in a business context and can be used for bi AI, you know, traditional machine learning, whatever it is, that hasn't changed. And the thing that changes again is the scale. But like, the functionality of data engineering and curating data into, whether it's a semantic layer, data products, or just golden data sets that, you know, you can use like that. Data management and data engineering hasn't changed. And I do think that we have to focus on that and streamline that so that you can use it for these different use cases.

Daniel Svonava [00:10:43]: Maybe they're like, maybe they're like, you know, new types of tools. So I think the goals of having, you know, good, clean data to, let's say, train models or build products on top hasn't changed. Perhaps what is a little bit different or what people should be tracking are the tools to do that. Maybe the data that is being done to is a bit different. Right. There is a lot of talk about synthetic data. Right? So how do we think about, how do we think about that and the curation of it and making sure it makes it to the right places and it's consistent across different projects. Right.

Daniel Svonava [00:11:23]: If you can make up a bunch of data, then you run the risk of each team doing that by themselves, and then you no longer have this kind of shared foundation for the different projects. So I think, yeah, more important than before is for these people to actually come together and talk on the platform level and align and it's also interesting when we say that, okay, 90% of the job is data engineering, then does it follow that basically the data engineers should sort of like an ML engineer becomes a data engineer over time because the ML part kind of standardizes a little bit and the evolution is that way. So we had analytics engineer, right, from analysts like people that build more at scale, more kind of reliable systems. And does that role basically now also start to contain some of these more kind of black boxy, less predictable transformations, which are basically the ML models that we apply to date and generate downstream effects from that? So maybe that's kind of the way to think about the role progression.

Colleen Tartow [00:12:41]: Well, I think there's always been an OpS component to data engineering, and you can try to farm it out to a data ops team or an ML ops team or whatever, but I think you're always going to have that Ops component the same way software engineering has an ops component. Right. One thing you said, though, when you mentioned synthetic data, I also think about the fact that traditional bi type pipelines were really source to finish, start to finish. It was end to end, and it was literally a pipeline of data from source through transformations into some curation layer, into the use case that you're looking at, the consumption layer. And AI is not like that, right? Like there's feedback loops that you get that you don't necessarily get in those traditional pipelines. And so I do think that means the data engineer needs to be involved more at those sort of at both ends of the pipeline in the way they might note more of a traditional use case.

Daniel Svonava [00:13:40]: So it's the end of dags because dag, by definition, doesn't have a feedback loop. Yeah, you heard it here first, folks.

Colleen Tartow [00:13:49]: Dags are dead. We just made a lot of enemies.

Daniel Svonava [00:13:54]: I think the new dag, you know, with a little loop in there.

Skylar [00:14:01]: From.

Tobias Macey [00:14:01]: A platform architecture perspective, to I think that. Colleen, your point of the. I'm going to go out on a limb and say old school bi workflow of it being a continuous pipeline where you have a start and you have a finish, and there's not really anything that cares about the in between. That part has been dead for a little while, but people who haven't really made that transition, I think are going to need to make that transition to be able to support AI. I also think that the warehouse first focus, even the warehouse vendors, have moved beyond that a bit with things like snowflakes, integration with iceberg tables, et cetera. I think that the hybrid warehouse and lake architecture is a requirement for this new age of AI, because we need to be able to work through a single control plane with the structured tabular data and the unstructured data, and be able to feed that structured data through our orchestration pipelines and orchestration workflows, both into the data lake, through being able to extract some of the semantic and semi structured elements out and into these vector stores, and have a cohesive view of where all those pieces are flowing. And we need to invest more thoroughly in the intermediate states throughout these different transition points, because each of those intermediate states has value for different AI applications. And so we need to be able to tap into those when and where we want to, without having to re architect the entire pipeline to create those intermediate states in a manner that is actually consumable.

Colleen Tartow [00:15:40]: Yeah, I'd agree with that. And I'd say that also changes. There's this idea of data products that's been popular over the last few years, right? It changes what a data product is. It's not just something that's going to be consumable by an end user, but it's something that can be consumable along that pathway as well, to the point.

Daniel Svonava [00:15:59]: Of the intermediate steps. Previously I mentioned that I think it's important to integrate data about entities together. So in the kind of semantic layer, let's say you have a customer and you try to the entity of customer and you try to collect all the different signals that help us understand what the customer is and their history. And the adoption of ML in enterprise is kind of like, ok, you need to clarify what the customer is and then collect all the data and then train some kind of predictive model or whatever it is. I think one of the problems is that those projects always go end to end, that I have to, as an ML engineer, go into the logs and kind of start from scratch. Hopefully the data engineers kind of figured out how to map everything. So I know which logs to go to and what they point, what they reference and so on. But I still have to go from the start.

Daniel Svonava [00:17:01]: And that's the thing that takes very long. And so one of the things we are looking at is to take the first half of that ML project, and you can call it feature engineering, you can call it different things, but basically have a way to not only have clear representation of a customer, but also have a customer embedding. So take all the signals that we have about customer, encode them into this numerical representation that compresses everything the customer did, everything they bought from us, their website that we scraped for, contextual information, include everything into that embedding into that vector, and then ML projects can happen downstream of that embedding. So not only you collect features like, hey, how many times they clicked on something, and that's a feature into a model that I still have to feature engineer and train the model, but basically take it a step further and have a customer vector. And then every time I want to personalize a search the customer makes, or I want to group customers by behavior, or I want to train any model that's downstream from a customer, I start from the embedding. I don't start from logs or the individual data points. I start from the representation. And I think that's also how you can integrate the documents from the knowledge base.

Daniel Svonava [00:18:20]: And has anybody ever clicked on those? Or did they sort of say that this particular part of it actually is correct or not? Right. So we see embeddings as a way to, as a kind of place where the signals can get integrated and then fed into all kinds of other downstream projects, if that makes sense.

Tobias Macey [00:18:40]: And on the point of embeddings as well, I think from the work that I've been doing for my own job, an important element of the embeddings and the vector store is the ability to associate useful metadata with the vector, because the vector in and of itself is nothing terribly meaningful if you don't really know what it is. And so being able to add things like tags and additional metadata for being able to do things like filtering or just being able to use that metadata as enrichment to the context that you're providing in the LLM, I think is valuable, as well as from a reusability perspective, being able to ensure that you have metadata that provides semantic context to that vector so that you can use those different embeddings across multiple different AI applications without having to do specific embeddings for each application and then thus explode the storage requirements of your vector store.

Daniel Svonava [00:19:40]: That's right. Sorry. I'll just jump in there and say you need metadata with the embedding. Yes, but to take full advantage of the technology, you need to encode the metadata into the embedding. So because that's what we see out there in all the projects, basically they have embeddings for a part of the data and then they do a bunch of filters on top. And that basically defeats the purpose because you do like Venn diagram of a bunch of filters, narrow it down too much, and then the vector search within that set is no longer that useful. And so back at YouTube we didn't have embedding of video description and then everything else as filters. No, everything else made it into the embedding.

Daniel Svonava [00:20:23]: That then reflects how popular is the thing with which kind of users, what is the actual content of it, moderation scores and other models contributing signals. And everything is actually packed into the floating point numbers. Right. And that's where it gets really powerful.

Tobias Macey [00:20:44]: One of the other aspects of this space is that for people who haven't been working in AI research for the past decade, or basically anybody who's relatively new to this space and trying to figure out how can they actually make use of it, how can they get involved? It's very daunting because it's so fast moving. There are so many options. There hasn't been any real determination of who are the quote unquote winners in the space. And if you're a small team, you don't have a lot of spare cycles to put into research and investigation and experimentation. One of the other interesting trends in the data ecosystem over the past decade or so is that so called big data was for a long time the pure view of big tech companies. If you had huge teams, huge capital budgets to be able to invest in that, the ecosystem has brought that more into the mainstream. It's made it accessible to more people. What are the opportunities for small teams to be able to actually take advantage of this new set of capabilities? And what are the things that they should be looking to, to be able to actually build effective applications without getting mired in tech debt, cost overruns, et cetera.

Colleen Tartow [00:22:00]: Yeah, I mean, I think the, I always has to be on the prize, right? Like you need to understand what is the business value of what you're going to do because it's fun to research AI, right? Like we all love it. We all love to like, you know, listen to podcasts and go to conferences like this one and, you know, learn as much as we can. But at the end of the day, you need to be focusing on things like revenue and growth and how is that going to be affected by this AI? And so really having a very narrow focus on that is essentially in a smaller team. Like big enterprises can afford to do research projects, small ones can't, small companies can't. And so I think that's step one. Step two is to make as much as the modern data stack has made it easy to spin up a data environment from soup to nuts really easily. You can do it in 2 hours or something like that. There are tools that will allow you to do that for more complex AI driven cases.

Colleen Tartow [00:23:00]: There's open source and paid frameworks out there that can do that. I know small teams always sweat the build versus buy. I think that, you know, that's another case of seeing what's out there in open source and seeing what's out there to buy. I mean, you don't have to have a GPU, right? Like there are places, there are GPU clouds, right? The hyperscalers have them too. So you can find GPU's, you can find all these resources, but focusing on what the ROI will be is really essential for a smaller team. And then I go back to 90% of the problem is data engineering. Right? So that hasn't changed.

Tobias Macey [00:23:43]: Yeah.

Daniel Svonava [00:23:43]: I recently had a talk at the AI quality conference in SF, and I said that I think the best metric to evaluate for AI system, you know, there are all these like sophisticated ways, but our favorite one is USD, right?

Colleen Tartow [00:24:01]: Yeah.

Daniel Svonava [00:24:02]: So, you know, that's how you get promoted as a data scientist or ML engineer. You either save some money or you make some money. And I think with AI especially, you really have to be in tune with the downstream use case because you can't just throw a dashboard together and hope that somebody gets value from it. It's that much more important to be friends with the product manager that's trying to deploy the system and actually understands what it will take to make this valuable. So kind of like being closer to use case. I think it's one thing that you're almost forced to do when you do AI, because the models and their choice and parameterization literally is dictated by what they are supposed to optimize for the business. And so maybe it's like more full stack, you have to be a bit more full stack, you have to be a bit more in tune with the downstream effects and carve out a small piece, deliver that, actually get it launched, and then iterate, as always.

Tobias Macey [00:25:11]: And I think for small teams in particular, the thing to optimize for is time to learning more so than flexibility, because until you have put something out there and seen what it actually provides, you don't actually know what knobs you need to twist to get the optimal performance for the thing you're trying to do. And so maybe it makes more sense to invest in a paid platform that lets you get up and running quickly. And then as you discover what are my actual needs, what are the customizations that I need to make, then you can actually start to bring those in house and peel them off of the paid platform. As you grow your capacity and capabilities.

Colleen Tartow [00:25:50]: I think we're also at this, we're at this weird inflection point where people are frantic to use AI. Like organizations boards are saying, how are you using AI? And there's the obvious use case that's probably not great, which is implement a chatbot for support and fire all your support people. And there's your ROI and it's like, okay, but let's do this for real. Let's not do that. But I think there's an element of creativity that goes into AI that folks aren't used to, that you don't need with bi necessarily. Cause it's like, yeah, you can have people, you can have hire analysts. They'll make beautiful dashboards done right. Maybe you've got some machine learning, improving your site search capabilities or something, and that's great.

Colleen Tartow [00:26:36]: But AI, I feel like we haven't even scratched the surface of what we can do with it. And so that does require thought and research and understanding the business in a way that traditional use cases haven't. So I do think that's, I think it's very important.

Daniel Svonava [00:26:56]: Yeah, I would suggest to not make up kind of new problems. So I would basically, that's what we see a lot, is that I think the creativity should be applied to kind of reframing of known problems in the business to, okay, how can we solve it with these new tools and not coming up with a new thing to do that we just bolt on to the existing offering. That's one pattern if you want to go path of least resistance. Basically we'll say, okay, well, we have this product here and it has known issues, but also chatbot, as opposed to we have a user retention problem. What goes into that? What does the user see throughout their journey? What is the most effective point of intervention, where if we help them out in certain ways, they'll stick around or whatever it is? Because we played with the AI tools a bit and did some pocs then, now we know roughly what the capabilities are and we can actually bring it into the user journey or system interaction journey or wherever we are in the organization. And then also it's much easier to get buy in from your manager, from people with budget, because you don't have to sort of sell them on this kind of crazy new thing. No, you're just solving the old problems in a new way. And there's probably already budget for that has been established and you can just go for it.

Daniel Svonava [00:28:35]: Yeah.

Tobias Macey [00:28:36]: All right, well, being conscious of time, we're getting near the end. And so I just want to quickly bring us full circle to the impact on the overall engineering team that this AI era is bringing along. And I think that with the growth of agent frameworks, it is, I think, one of the few things that is going to force us all to realign the way the engineering works. For a long time, we had the silos between software engineering and operations that brought in the DevOps revolution. And we've had this long term siloing of data engineers, and then that gave rise to data ops, mlops, analytics engineering. That's all starting to merge back together. I think that now we're in a spot where the data teams and the application and product teams are getting pushed together and will need to collaborate more closely and effectively because this is a complete end to end requirement of integrating the data with the product, with the user experience with the operations. And so everybody needs to be working on the same page, working together and understanding the impacts that AI has on their capabilities.

Daniel Svonava [00:29:51]: Eamon?

Colleen Tartow [00:29:52]: Yeah, absolutely. Data is a product. I think that's the key.

Skylar [00:29:58]: Awesome.

Daniel Svonava [00:29:59]: Thank you.

Skylar [00:30:00]: Thanks to all of our panelists. This was super fun.

+ Read More

Sign in or Join the community

Watch More

Chronon: Airbnb's Open-Source Data Platform for AI & ML applications // Nikhil Simha // DE4AI

Posted Sep 18, 2024 | Views 2.1K

The Only Constant is (Data) Change // Panel // DE4AI

Posted Sep 18, 2024 | Views 1.8K

FedML Nexus AI: Your Generative AI Platform at Scale

Posted May 07, 2024 | Views 728

# GenAI

# Nexus AI

# FedML.ai