Building Robust and Trustworthy Gen AI Products: A Playbook
Faizaan is a Product lead working on Personalization and Generative AI use cases for Creators and Content Understanding at LinkedIn. He's been in the field of machine learning for 8+ years now and recently launched some of LinkedIn’s first user-facing Gen AI products in early 2023.
A practitioner's take on how you can consistently build robust, performant, and trustworthy Gen. AI products, at scale. The talk will touch on different parts of the Gen. AI product development cycle covering the must-haves, the gotchas, and insights from existing products in the market.
Slide deck: https://docs.google.com/presentation/d/1STE9e5ByZVK5CKrzIVqjs4XKHa5q2kxdY0iXFDRpIIM/edit?usp=drive_link
Faizaan Charania [00:00:10]: Everyone these days wants to build genai, either new products or a company that's focused on Genai, or established companies and established products want to add genai to their product experiences. Good. Great. But then with time, this whole experience is getting commoditized. A lot of people are using establish much bigger foundational models and building on top of it, and there's nothing wrong with it. But how do you differentiate your product from everything that's out there? The first more advantage only lasts so long. And that is exactly where building with trust comes in. And this is what's going to help you differentiate your product offering from something else that a competitor might bring or might build two weeks from now, two months from now, or maybe they're already working on it anyway.
Faizaan Charania [00:00:57]: So before I jump in, who am I? I know who I am. Okay, cool. The slides are up now. My name is Faith. Right now I'm a senior pm at LinkedIn. I've been at LinkedIn for two and a half years now. Lately I've been working on building genei experiences for consumers on LinkedIn, other parts of my charter, things that I initially started with before Gen EI took up a lot of my scope was just video, understanding content, understanding AI for creators. This includes recommendation systems and other experiences that we built for creators as well.
Faizaan Charania [00:01:33]: Before that, I was at Yahoo. I was focused on large scale machine learning problems, building ML platforms. I was a machine learning engineer, a lot more focused on solving the core technical machine learning problems. And that's where I also transitioned into being a product manager. Before that, miscellaneous ML research roles, some internship over there, some research lab here and there. And fun fact, when I was in grad school, that is when I was working on that times gen AI, we were just working on Gans and attention is all you need. And all of those other papers. I remember seeing some of my work from back then.
Faizaan Charania [00:02:09]: And we were creating images based on captions, which is something very similar to what we do with dalitude today. But the images that we were generating were thumbnail size, very pixelated, not good quality. But then in the past year or two, the technology has reached a scale where we can productionalize this and where we can commercialize this. And that's where we have to start thinking about, okay, now that the technology is ready ish, how do we commercialize this and how do we build good products? Okay, cool. So I said trustworthiness is one of, is going to be one of the core aspects of how you differentiate your product from your competitors. This is the rough structure of how we'll go about it first, even understand what is trustworthiness? What do I mean when I say trustworthiness? Then two aspects. One from the AI perspective, how do we build the AI model or the AI system pipeline a lot more robust? And then on the other side is when we are building the user experience, how do we make that absolutely awesome? And how do we win the user over? We look at some real world examples, a conclusion, and then I'm also here for questions. Okay, cool.
Faizaan Charania [00:03:19]: Let's jump in. So, first of all, Genai, like I said, until very recently, it was just creating thumbnails, and now it's creating full fledged images, articles, whatnot. But overall, when it comes to it being commercializable, Genai as a technology is in a very nascent stage right now, and there's hallucinations. Everyone knows about hallucinations. But when you're building products, the users want something that's a lot more reliable. So hallucinations can make users skittish. So we have to watch out for that when you're building new products. And trustworthiness is crucial for AI adoption and success.
Faizaan Charania [00:03:56]: There's a mini flywheel over here. Once you build trust, you get user acceptance. People will start using more of your products. Before I even go forward, there have been many instances of, in the early days, at least, when even chat, GPT or Gemini. Gemini had a recent scuff up with the AI overview. When users see that the model is not working well, they lose trust. And in this very competitive field of Genai right now, where everyone's building products, losing that trust is very costly. And that's why my whole focus on building with trust first.
Faizaan Charania [00:04:34]: But, okay, so once you build with trust first, you get user acceptance. Users see your product use it, they expect it to work fine. Once they know that your product will work as expected, it moves to users, engaging with it more often, deeper engagement, using it for other use cases as well. And then obviously that leads to downstream user growth. They will come back more often. There's word of mouth publicity, there's network effect, and overall, that is how you grow your product while delivering and solving the user's core problems. Okay, so what do I even mean by trustworthiness? The two ways I'm going to divide this is one, is the very technical performance. This is where all of the model eval comes in, which is accuracy, reliability, on topic, response rate, hallucination rate, and these are some of the common ones because we are not looking at any particular use case right now.
Faizaan Charania [00:05:30]: But when you define your use case, there are going to be specific metrics for your particular use case as well. So those are the technical metrics, something that we are very actively aware of and think about when we are building models. But then the second set is the overall user experience. So when you build your product, how does it influence user satisfaction? How is it influencing user loyalty, user retention? Is it actually solving for the user retention problem? Sorry, the job to be done that the user is coming to your particular product for those are the things that we should be measuring. There's a lot of conversation around, just like building Gen AI for the sake of adding genai to the product and that can reach the prototype stage, but it won't last for that long. Okay, so now that we've defined trustworthiness, let's look into how we actually build robustness over here. First and foremost, high quality data is the backbone of any AI system. I've given other talks where I just focus on high quality data.
Faizaan Charania [00:06:33]: Eval, auto eval. How do you even structure your evaluation? Prompt engineering. There's so much more over there, I could just spend a total hour on that. So I'm not going to do that right now. I'm just going to skip over that. But think about it. And I'm sure there were other speakers who also touched upon this topic, so happy to talk about it in the Q and a section afterwards if you're interested. But anyway, so high quality data.
Faizaan Charania [00:06:57]: To take a very simple example, let's look at a particular use case. So when annotating customer service requests, let's say you're trying to build a customer service bot or an agent. The training and eval data should include a variety of sources. So the default mechanism, or the default choice we might do, we might take, take is just random sample previous conversations and train the model on that, or evaluate the model based on that. And that's the first one, frequently asked questions, because when you do a random sample, it is going to look at the most common questions that come up. But that's just the first one, and that won't sustainably solve your problem. You don't have to solve for just today or today's distribution or how users use your product today. You have to build for the future and account for potential data drift.
Faizaan Charania [00:07:46]: And that's why if we have to keep maintaining users trust, we have to look at some other examples as well. And some other examples that I've listed over here are conversations with people who speak English as a second language. Like is your model robust enough, or is your solution doesn't have to be just a model, but the prompt engineering or like how well you're evaluating your system overall, is it robust enough for people who speak English as a second language? Maybe they don't interact with the customer agents right now, or there's a lot of lead time or close time. The close time is much longer for people who are not as fluent in English, and that's why you don't see it as well represented in your current data set. But once you launch, it's going to be one of the big use cases. So that's that. Some corner case, complex questions. And the last one is like some examples of users trying to jailbreak the system.
Faizaan Charania [00:08:36]: Because red teaming is extremely important. You don't want your product or your company to become a meme on the Internet. There was some airline company and the chatbot offered free tickets, and then the person was trying to claim, because the chatbot is representing the company, the company should. It's just not good publicity for the company. So don't add Genii for the sake of Genai. Don't just prototype with just half assed efforts. Let's make it comprehensive and build a reliable AI product. Okay, so this is just a primer on the things that you can do on the AI side.
Faizaan Charania [00:09:12]: Well, let's jump over on the UX user experience and UI side as well. So a few things to keep in mind is you want to make your users life easy. So three things over your first, demystify AI. You can use tooltips or feature explanations within the product interface. One thing that I really like, that notion did very early on. We had a speaker from notion earlier today as well. But notion did very early on is they integrated a lot of their AI features right into the user flow. So where people make notes, the AI features were then and there, it wasn't a separate chatbot that you then have to go and talk to and then come back and try to add it to your notes.
Faizaan Charania [00:09:51]: It's just not efficient. So, and what the AI does and what those GenaI features were doing was also very well listed. So that's demystifying AIH set expectations. Share the model's limitations and performance metrics with the users. This, the most common or most basic way of doing this is you would see in all of these chatbots, like chat, GPD might hallucinate. Gemini creates AI generated content. Please review things like that. But there's a lot more that can be added depending on your particular use case and the third is give users control over the AI's actions and the data it accesses.
Faizaan Charania [00:10:30]: And users, once they provide consent and then they're using their product, they feel a lot more involved as well. And that leads to long term, consistent usage, because now that they have provided that consent, they know that the AI has access to, I don't know, my emails or my work files or something else. So the next time I'll ask more complex questions, I'll ask better questions, more informed questions that will lead to your AI model giving me better answers, which just makes everyone's life easier. Okay, so that's for solving the problems in the, in the now when the user is using your particular product, but in the long term, you also have to think about the ecosystem and how you're building the whole system. Okay, so this is where another aspect of gaining the user's trust comes in. First, user privacy. The last time, in the previous slide, I mentioned user control. Look at what kind of data people, what kind of data your AI is going to access.
Faizaan Charania [00:11:33]: But once you do access that data, make sure that it is encrypted, it is stored in a particular way that is safe, robust, will not be leaked, hopefully, but also at the same time, it's only being used for its intended purposes and not for other pieces. That's one. Fallback mechanisms, a lot of time. Like I said, jenny, the technology is nascent. It's not always going to be foolproof. So what you should be doing is build in default fallback mechanisms where at some point you would add human in the loop. Depending on what your flow looks like, there could be certain aspects where you say, hey, you should review this. Now, if it's an agent trying to take actions for the user, it creates a lot of trash.
Faizaan Charania [00:12:15]: If the agent, let's say, if it's a LinkedIn example, I'm just making these examples up. But if the agent applies for me, it's trying to help me, and then it finds a job, but it applies for me with the wrong resume, not good. Creates a bad experience for me, creates a bad experience for the recruiter. The next time, if I actually want to apply with my correct resume, it's just going to make me look bad, because now there are two applications with my name. That's one, if you're editing a video somewhere, if you're editing an image in Photoshop or any of these video editing software, if the agent goes ahead and tries to edit or add some layer or add some aspect to your video, spends a lot of cycles, uses your gpu or uses your cloud credits, anything like that, and creates something that is nowhere close to what you wanted. Not a good experience again. So this is where fallback mechanisms are going to be very important. Bring the user along with you in the journey.
Faizaan Charania [00:13:12]: And the third, very obvious but still very important is bias assessment. Regularly evaluate your system for biases. Do audits create specific data sets that make sure that your product is robust, inclusive, and is not just reducing bias, but also promoting equity? So these are just high level areas that I am trying to focus on right now. Let's go over that and depending on your particular use case, we can discuss more details as well afterwards. I'm not going to any of the LinkedIn use cases as well. Just trying to keep it short and sweet over here. Okay, cool. So real world example, just a few ones up over here.
Faizaan Charania [00:13:53]: Just trying to see what products are out in the wild and are using the principles that I am talking about over your demystify AI, set expectations, give users more control or thinking about user privacy, fallback mechanisms, removing bias, things like that. So the first one that we look at is co counsel by Casetext. This is a particular product that's used for, that is available for anyone in the legal industry. This can be used by lawyers, paralegals, anyone. This helps do a lot of research and analysis and prepare documents that can help with creating legal documents downstream, creating contracts, making it robust, making sure that contracts don't have some particular clause that could cause other issues or have caused other issues. And there are precedents for anything like that. So the first piece is they demystify their AI product. They're very actively showing they're using rag and they're actively showing what sources were used for each and every analysis or each and every takeaway that the product is generating for them.
Faizaan Charania [00:14:52]: And the fallback mechanism that they've added is once all of the analysis is done and the final contract is supposed to be generated, there's an active call out on, hey, please review these sections because these are parts that were picked up or these were parts that were generated by the AI and then it goes into the legal document, the contract, or anything that you're supposed to be building with co consul. The second example is Microsoft Copilot for work. One thing that I really loved is initially I wasn't using Microsoft Copilot a lot because I was on the consumer product and I was afraid, oh, I can't ask questions about my work because then I leak my work data and I don't want that. But the enterprise version of it very conveniently has the user privacy feature which explicitly says that no conversations will be used for model training. That's good. Becoming an industry standard now which is awesome. And user control it only accesses publicly shared documents and files to generate answers. Like even within work it only access public files.
Faizaan Charania [00:15:49]: There's always confidential files and we don't want that to be mixed in the general population. Google do it for AI similar set expectations. They started with Gmail and there were very particular features and now they're slowly expanding to the copilot version and now that I trust that they are going slow and they're building the right products right features I am more open to using it and bias mitigation. We know that Google works with a lot of partners as well. Anyway that said, in conclusion, trust is the cornerstone of building long lasting brand and product loyalty and lost trust is difficult to earn back so keep building. With that in mind. Thank you.