MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Goal Oriented Retrieval Agents // Zoe Weil // Agents in Production

Posted Nov 15, 2024 | Views 853
# Retrieval Agents
# Faber Labs
# Agents in Production
Share
speakers
avatar
Zoe Weil
Co-Founder/CEO @ Faber Labs

Zoe is an experienced AI leader and angel investor with over 12 years in AI/ML and a deep focus on building innovative search and discovery systems. She has invested in AI-first startups like Tough Day, Starcycle, and Eden Labs. As SVP of AI at Citi, Zoe spearheaded the adoption of agentic AI across the firm, and during her time as Staff Applied Scientist at Etsy, she led the development of personalized search and discovery. With three patents in large language models and a background in AI at NYU, she continues to be a leading voice in the AI revolution.

+ Read More
avatar
Skylar Payne
Machine Learning Engineer @ HealthRhythms

Data is a superpower, and Skylar has been passionate about applying it to solve important problems across society. For several years, Skylar worked on large-scale, personalized search and recommendation at LinkedIn -- leading teams to make step-function improvements in our machine learning systems to help people find the best-fit role. Since then, he shifted my focus to applying machine learning to mental health care to ensure the best access and quality for all. To decompress from his workaholism, Skylar loves lifting weights, writing music, and hanging out at the beach!

+ Read More
SUMMARY

At Faber Labs, we have built and patented the first-of-their-kind scalable specialized Agents that autonomously seek to maximize Conversion Rate (CR), Average Order Value (AOV), Return on Ad Spend Optimization (ROAS) With rapidly evolving variations on Agent-based RAG systems, the demand for highly efficient and goal-oriented retrieval systems is increasing. GORA (Goal-Oriented Retrieval Agents) is a pioneering system designed by Faber Labs to first-of-their-kind scalable specialized Agents that autonomously seek to maximize Conversion Rate (CR), Average Order Value (AOV), Return on Ad Spend Optimization (ROAS). We will explore how GORA distinguishes itself by embedding the overall system goal into its pipeline, ensuring that each component not only performs optimally but also cohesively works towards a unified business objective. Key aspects of GORA include its interactive feedback mechanism, which transitions from traditional static queries to dynamic conversational interactions, allowing for a more natural and effective user engagement. We will discuss the system’s ability to maintain low latency despite its complex architecture, emphasizing how efficient integration of components is crucial for user efficacy.

+ Read More
TRANSCRIPT

Skylar Payne [00:00:00]: All right. All right, we're gonna go ahead and bring up our next speaker. Please welcome Zoe. Zoe loves to scuba dive, apparently. So I'd love to hear from her about what her best dive was. But before I do, I just want to give a fair warning that if you Google her, you may find a psychiatrist who is not her. Do not make the same mistake I did and make sure you are looking at the right one.

Zoe Weil [00:00:32]: Yeah. Thank you so much. I guess I can take over now, right? I haven't done this before, so. All right, let's get started. So again, my name is Zoe. Zoe Weil. Not the psychiatrist and the co founder and CEO of Faber Labs, where we have designed GORA to transform subjective relevance ranking using cutting edge technology. Specifically, gora, or Goal Oriented Retrieval Agents are the first of their kind.

Zoe Weil [00:01:06]: Specialized agents that autonomously seek to maximize any KPI. From maximizing conversion rates or average order value for retailers in E commerce to minimizing surgical engagements for value based care platforms. Our focus today will be on understanding and digging into three key themes. Relevance, performance and impact. And over the next 20 minutes or so, I'll guide you through our journey and our innovations. But honestly, also through some of the pitfalls in the process and some of the challenges that we faced in designing Gora from the outset and how we overcame those challenges, because I think it's also applicable to the domain of agents in production more broadly as well. But before I do, I'm going to first talk a little bit about how Faber Labs specifically uses gora. Now, the most immediate application or obvious application of this is actually in retail E commerce.

Zoe Weil [00:02:18]: For example, Amazon's recommendation engine, which I'm sure a lot of us use daily. Although if you're in Europe, you might be using the German version, Reva, which my co founder worked at. But in any case, they use the recommendation engine to personalize user experiences and this actually contributes to a big chunk of the revenue, about 35% of its revenue. What Gora and Faber Labs is trying to do specifically is to take that underlying process and scale it for many different marketplaces and retailers and even NEO banks, acting as an embedded KPI optimization layer for any consumer facing business. And specifically in the E commerce context, we're able to provide subsecond interactive results and that are able to adapt in real time to user preferences and the most kind of minuscule feedbacks that they give to our system to optimize for conversion rate and average order value. Now, while the E commerce kind of application is the most obvious, maybe followed by applications in travel hotel bookings, etc. We've actually found applications of Quora in the medical space as well, where we are able to provide sub second results and interactive rankings to clinicians that are trying to find alternatives to ineffective surgical procedures. Here we're optimizing for an entirely different metric of lowering readmission rates to optimize for value based care.

Zoe Weil [00:04:03]: So what GORA really allows us to do has unlocked is this potential to serve subjective relevance at scale for any business that is trying to ultimately change end user behavior some way to optimize for very specific measurable outcomes. In this way, when you really think about it, subjective relevance has become a cornerstone of modern human technology interactions. And so we designed GORA with that in mind. And to do so we focused on three key pillars, which is user behavior, contextual insights, and real time adaptation in which we use, you know, obviously we want to take into account historical user behavior, but the key thing here for us is real time in session user feedback. And when you think about the importance of the user at the center of the system, you begin to realize actually one of the key challenges we actually ended up facing pretty early on was of balancing personalization real time feedback, particularly actually in the medical space as well, with privacy crucial aspect in today's data driven and I would even say data obsessed world. And so to that end, we developed what we call large event models to be able to generalize to user event data and by extension also to be able to understand event sequences that it hadn't seen before, so much in the way that large language models were able to generalize to text that they hadn't seen before. And so finally, what's more is actually I would say probably even more important than that very, very critical piece of innovation is that we're able to do this both at ultra low latency and at scale. And so let's dive into some of these pillars, these three pillars, and how we went about architecting GORA with these three major components in mind.

Zoe Weil [00:06:15]: The first key aspect is obviously having a unified set of goals, which is actually fairly uncommon in designing relevance ranking systems. Our success hinges on these unified goals. Actually, across different models, goals are defined at the client level. As I said early on, one client's objective might be to optimize for conversion rate or average order value. Another one might be to reduce surgical admissions. Another would be to reduce call center volume. So once we have that goal clearly defined and tied to a measurable outcome, a measurable set of data, we then optimize our models end to end around these very specific goals holistically and not just in a piecemeal fashion. And so instead of key thing to keep in mind that ties directly into this unified set of goals is that GORA actually centers around conversations and interactive, quick, very, very quick interactive feedback loops.

Zoe Weil [00:07:26]: And so we implemented a real time feedback processing system to integrate it with agent decision making. And this feedback influences context selection, obviously also has to modify agent behavior and optimizes the system overall. So I'll discuss a little bit about the feedback loops and their impact on the overall system performance in a little bit. But the core approach here is to train models that can be trained end to end in a very, very holistic manner through reinforcement learning and policy models. And these models include embedding generators, re ranking models and agent models. And since they're optimized jointly using the final reinforcement learning step, you know, we're able to avoid a lot of the common pitfalls, particularly of larger relevance ranking systems that use stacked models that are optimized for different objectives. Especially because a lot of the times these objectives can actually run counter to each other. A common one is that, you know, for retailers, conversion rate doesn't always mean an increase in gross merchandising sales because more users can buy more, but they might also buy at cheaper price points.

Zoe Weil [00:08:52]: So this is something that having this unified goal system with user feedback has really, really helped our downstream customers. And the last pillar, as we talked about it a little bit, is latency. You know, speed is a super, super important factor in the user experience. There's actually extensive data on this, the most prominent one being that about 53% of mobile users abandon sites that take longer than three seconds to load. Which is, I think, fairly understandable. I think we can all, you know, empathize with that, having been on the other end of it. So we set component level latency budgets and optimize intercomponent communication through parallel processing. In this way we're able to improve response times, manage the system complexity, and enhance this end user experience.

Zoe Weil [00:09:50]: Critical to that, by the way, this overall design has been our choice of backend language and libraries. I know this might be a controversial comment to make, particularly to a very AI data science heavy community, but one of our best and most important decisions has been to run our back end almost entirely in Rust. There was a huge, huge debate internally among the team as to whether or not we should go with something that had more support in the data science community. Although I have to say that rust is taking off. We're seeing a lot of adaptation of Rust, particularly on the back end, and that's also nice. Anyway, we made the leap and it's actually played a pivotal role in our performance strategy. For those of you who might be less familiar, Rust is actually known for its memory safety and concurrent processing capabilities and it's allowed us to achieve zero cost abstractions. One of my co founders, who actually strongly advocated for this, for this move to rest, was inspired by how Discord handled their migration to the language.

Zoe Weil [00:11:11]: And in any case, we've seen very significant improvements in our benchmarks. And even though I have to be honest, the transition was super painful, especially myself, I come from a Python, I have a lot of experience in Python and Scala. So it was definitely a mind bender, as a lot of the concepts in Rust are counterintuitive, at least to me. But you know what, it's been really, really, really worthwhile, the effort because it's really enhanced our privacy and security and also efficiency and I would even argue significantly reduced our costs, which is really important to us because we started out this company, we were bootstrapped. And so, you know, keeping costs down was really, really important to us. And in terms of where specifically we use it, we leverage Rust as our, as I said, code, our back end layer on the orchestration side for both user requests, but also for managing aspects of our models like key value caching. This low level integration has allowed us to quickly and efficiently move data around, which ultimately allows for the lowest possible latency. Cool.

Zoe Weil [00:12:32]: Continuing on this low latency theme, as you can see, it's really, really key for us because as I said, we don't just have a single client like Amazon, they do it internally just for themselves. We have many different clients from different industries, so latency is really, really key. And having a system that scales is really, really key. One of the biggest challenges we had to address that's actually specific to LLMs is allowing for obviously follow up prompts in a conversation and then having that whole conversation also be represented somehow. It needs to be run into a model as conversations. As I'm sure you guys noticed, they can often flow rather quickly. You end up with this larger, larger stack that can get quite unwieldy from an operations perspective to manage and to produce effective results. Particularly given that we're not primarily a knowledge discovery platform.

Zoe Weil [00:13:38]: We're trying to affect the end user behavior in some way to achieve specific outcomes. So again, that's like this is something that's really, really important to us for that aspect of it to be able to keep track of the long kind of back and forth in the conversations. We leverage intelligent caching on the GPU to prevent the need to generate the key value cache for follow up prompts. And this has been another thing that we've tried that has been really, really helpful in helping us manage our infrastructure and latency at scale. So what do these numbers look like? This is how, this is how the results are. I promise. This is the last, this is the last slide on latency. I just wanted you guys to see the actual impact that we're able to have in, you know, taking some of the, making some of the decisions that we, that we have.

Zoe Weil [00:14:38]: So let's dive into the numbers. We've actually significantly improved our load times and a lot of these enhancements, I have to say, haven't just boosted user engagement, but also positioned us against well ahead, I would even say of industry benchmarks. The you have to keep in mind that these are the response times for a conversation aware and context aware model and not just a simple single pass query embedding retrieval system. Obviously these response times would not be so great when compared to less than 2010 millisecond budgets that you have for single pass query embedding systems. But when you compare it to other conversational and I would say adaptive models and agent based systems, these numbers are actually fairly, fairly, fairly good. And if you. The thing to also keep in mind is that we've built a system that is able to improve as newer, faster, better kind of models and frameworks come out. So we're able to take advantage of innovations down the line as well.

Zoe Weil [00:15:54]: But we're not like just the lab. We're not just like an operations lab or an infrastructure lab that tries to just improve numbers around latency. The thing that we care about the most in terms of our technical advancements is do they translate into tangible business impact for our clients? And thankfully, the answer has been a resounding yes. We've actually seen a significant return on our investment, including user growth. Our clients are obviously very, very happy with us and this has translated again to wins for them in terms of not just conversion rate and average order value, but this kind of joint optimization for these metrics that are often at loggerheads with each other. The other thing to keep in mind is that the impact, the scale of the impact has not, is also pretty, pretty large. So compared to baselines, it's not just that we're seeing a 5, 10, 20% improvement. We're actually seeing consistently above 200% increases jointly in CBR and in conversion Rate and average order value.

Zoe Weil [00:17:08]: I know I'm almost at time, so I'm going to, I'm actually almost done, I promise. So, you know, aside from these specific metrics, you know, it's really, really refreshing. Again, we're a business to see our clients come and be so, so happy with us and you know, because we were able to drive very, very meaningful change in their businesses. And a lot of the times, you know, they had been rather jaded by the promises of these agent based solutions. And so to be able to see that we're able to actually stand out and some of the tough, tough, tough, tough decisions we've had to make on the architecture side have really paid off for them is very, very meaningful and rewarding to us. With that being said, I also want to thank Marcin Mayeron, our chief scientist and one of my co founders for helping out with putting this, this deck together. Thank you guys so much. And I'm going to take some questions, I'm going to go back and see if there are any questions.

Skylar Payne [00:18:18]: Awesome. Yeah, I can go through and you know, read out the questions. So again, for anybody listening, if you have questions, please throw them in the chat in the QA section. But yeah, I'll kind of like roll through them. So Dileep asks for medical applications, how do you tackle the enhanced privacy challenges due to the sensitive nature of the data?

Zoe Weil [00:18:45]: So we actually have an on prem option. Again, like what this comes out to, comes down to is like, because having on prem solutions is actually really, really challenging. Like it's a, it's a, it's an additional engineering challenge that you have. So, and you know, it can be a killer, especially early on for businesses like ours. But having this super, super blazing fast, kind of lightweight solution that, you know, honestly would have been impossible if you had used literally any other tool other than rust. I keep saying it because like I was very, very skeptical about it and I'm a full, full on convert here. But yeah, using, you know, having this very blazing, fast system has made it like very easy for us to offer this on prem solution for people in the medical space. And actually as we're now working with Neo banks as well, that's another kind of place that has benefited from having an on prem solution.

Zoe Weil [00:19:51]: And I think earlier today one of the folks at Prosis, I think it was at Euro, was actually talking about you guys and how they have actually used an on prem solution for their, for their application of agents as well. Again, privacy is key. So, and we've known this for A very long time.

Skylar Payne [00:20:12]: So awesome. Dalip also asks, I understand you train these large event models rather than use LLMs. So do you train them from scratch and how much data does it require?

Zoe Weil [00:20:28]: Very good question. So actually we do train them from scratch, the large event models, but they're not like the only. We actually also do use LLMs. Right, but they're not like the core kind of novel offering that we have. We actually rely on open source LLMs for the LLM components to actually glue everything back together and present it to the user that we use LLMs for. But for the actual meat of the ranking, it is correct that we use what we call large event models. Yes, we do train them from scratch. The data that we get is from our, from our clients.

Zoe Weil [00:21:09]: They provide us with this data. And one of the key things that again, this is another kind of novel way that we've been able to kind of innovate in this space is we can just use their messy data. We've been on the other end, we know how messy client data can be and we knew that if we didn't have a very, very kind of scalable solution to quickly be able to take advantage of that messy data as is, we'd be dead in the water. Again, goes back to bootstrapping and being able to scale and things like that. We use client data to train our large event models and the specific alignment function that we've designed in the reward layer allows us to actually take benefit, take advantage of the network effect of having many different clients and learning from that data without compromising privacy. So we don't leak any information, direct information between clients, but we're able to learn from the findings and feedback, user feedback overall, from one client to another without leaking any kind of private information.

Skylar Payne [00:22:22]: Yeah, that's good. So you talked a lot about leveraging Rust and how much benefit that gave you. So for other practitioners here that are listening, that might be thinking about leveraging Rust, but maybe have more of a Python or their background, do you have any tips on making that sort of change successful?

Zoe Weil [00:22:41]: Yeah, I think one of the benefits I first learned to code before there was a lot of open communities, communities online, and there's so, so much tooling out there around like ed tech, honestly, that you can, you can, you can take and don't like feel embarrassed to just hop in, dive in and really, really, you know, be a student against for some of us, you know, I've been out of graduate school now for a long time and even though I'm a researcher Like I tend to do research in areas that I'm already well versed in. So taking that stuff step to do something totally different because Rust isn't just a new programming language. It's like you have to rethink how you think about code in a lot of ways as well. So I would just encourage you to just dive in and take advantage of a lot of decentralized education that's available right now. I would also say that if you actually want to make the switch, you're in luck because Rust has a very vibrant, kind, helpful community. People are like really willing to respond to your questions on GitHub. And if you have a data science and ML background, I would actually encourage you to look at some of the hugging face kind of back end, you know, open source available code out there that you can look at that's kind of related to your domain. So that was really helpful to me because I already understand what the code is supposed to do.

Zoe Weil [00:24:09]: I just don't know Rust. Right. So having some of that context already available to me was really, really helpful in being able to learn.

Skylar Payne [00:24:18]: Awesome. Cool. We're running right at time, so maybe one more fun question. What's your favorite dive you've ever done?

Zoe Weil [00:24:26]: Actually, it was in the Cayman Islands. This is going to sound really, really. But it was a very specific experience. There was a nursing shark with, with her babies and it, it was a really, really intimate and adorable moment. It was just like myself and my husband and so it was a very, very adorable, cute moment. Yeah, it wasn't in the Barrier Reef or anything. It's very beautiful there. But it was that it was a very personal experience that I very much enjoyed.

Skylar Payne [00:24:56]: Awesome. Well, thank you so much for coming and sharing your insights and your work.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

25:43
Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 2K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
Building Reliable Agents // Eno Reyes // Agents in Production
Posted Nov 20, 2024 | Views 1.3K
# Agentic
# AI Systems
# Factory
LLM, Agents and OpenSource // Thomas Wolf // Agents in Production
Posted Nov 15, 2024 | Views 1.2K
# LLM
# Agents
# OpenSource