Databricks Assistant Through RAG
Neha Sharma, based in San Francisco, CA, US, is currently an Engineering Manager at Databricks, bringing experience from previous roles at Slack and Premise Data. Neha Sharma holds a 2007 - 2012 Bachelor of Applied Science in Computer Engineering, Co-op @ University of Waterloo. With a robust skill set that includes Agile Methodologies, XML, AJAX, JavaScript, Java, and more, Neha Sharma contributes valuable insights to the industry.
Neha Sharma, leader of the notebooks and code authoring teams at Databricks, discusses their journey with the Databricks Assistant powered by large language models (LLMs), showcasing its ability to interpret queries, generate code, and debug errors. She explains how the integration of code, user data, network signals, and the development environment enables the assistant to provide context-aware responses. Neha outlines the three evaluation methods used: helpfulness benchmarks, side-by-side evaluations, and tracking user interactions. She highlights ongoing improvements through model tuning and prompt adjustments and discusses future plans to fine-tune models with Databricks-specific knowledge for personalized user experiences.
Neha Sharma [00:00:00]: Hi everyone. I'm Neha, and I lead our notebooks team and the code authoring teams at Databricks. And today I want to talk to you about our experience building the Databricks assistant and using LLMs. So I'm going to talk through three parts. The first, I'm going to talk through how we've leveraged large language models for our products. I'll talk about how we built these and how we think about evaluation. And then I want to talk a little bit about what we're thinking about the future and things that are upcoming for us. We have a tremendous opportunity to make our product better by leveraging large language models.
Neha Sharma [00:00:37]: Even before LLMs, there were thousands, sorry, hundreds and thousands of users using Databricks on a daily basis for their work. But what we found is that leveraging LLMs, we're able to make them even more productive. So they're able to get even more use out of their day to day work streams. I'm going to show you some examples of how this works. The Databricks assistant sits on pretty much every page at Databricks, and so you are able to converse with it and it'll have all the contextual information about Databricks. So here, for example, I'm asking what is DLT? And it is able to know that DLT you're talking about delta live tables, which is a Databricks-specific concept. The assistant can also help you author code or write SQL queries. So I have a dataset here that represents some Spotify top songs.
Neha Sharma [00:01:27]: And in English I'm asking list the top ten artists based on the number of songs. And the assistant is actually spitting out a SQL query, which you can then run. And if this looks good to you, you can import that into your Databricks notebook. So without even knowing SQL, I can just ask it and it'll generate a result for you. I can also diagnose and fix errors. So in this code snippet, I have an error. It's not obvious what the resolution is, and traditionally what I would have done is gone line by line, try to understand what's happening, maybe use print statements, perhaps the debugger. But now I'm actually just going to invoke the assistant and within seconds it's able to tell me that I'm actually missing a parentheses in my two pandas call.
Neha Sharma [00:02:10]: So without writing code, in a matter of seconds, I'm able to fix my error. Now, let's say I had a snippet of code that I wrote a really long time ago, or perhaps it's a code that somebody shared with me and I want to use it in my work. I'd like to verify what it does before I actually use it again. Traditionally, I would walk through it line by line, try to piece together what's happening, but instead I'm going to ask the assistant to explain and the assistant very helpfully breaks it down for me and I can quickly understand what this code is doing and whether it'll work for my workflow. We've also launched the Databricks assistant autocomplete. So this is really fast code autocomplete suggestions as you're typing. So over here in the first line, it autocompleted that I wanted to read the CSV into a data frame, and just from a comment that plot a bar graph, it understood what I wanted and very quickly gave me the code to do so. So again, it's a something that can really be a helpful pair programmer, if you will.
Neha Sharma [00:03:08]: And finally, we also have the ability to rename your data assets. So you can rename your notebook to be more contextual. You can also rename cells, and you can imagine if you have multiple cells in your notebook at a time, as I always do, it's always helpful to name them and you can come back to it and understand, okay, this is what I was trying to do. So this is also a really nice quality of life improvement. So I've gone through the following use cases that's available for Databricks Assistant conversational this is akin to a typical chatbot experience that we've seen in the industry, code generation, both Python and SQL queries. You can explain your code, you can fix code, you can have fast autocomplete suggestions, and we can rename your data assets. We're seeing a lot of traction on this. We're already seeing several hundred thousand weekly assistant users using this, and we're also seeing millions of assistant interactions on a weekly basis.
Neha Sharma [00:04:04]: So, so far it's working. Now let's talk about how we built this and how we think about testing it and evaluating it. One unique thing about Databricks is that we have four key ingredients that make this sort of contextual awareness possible. So we have the code, we have your data, which is sitting in your Databricks platform. So this is your tables, your schema, any other data that you may have. We have what I call network signals. So you can think of this as the relationship between your data. So think about popularity of your tables, usage, ranking, recents, and we also have your dev environment are you using Python or using particular libraries? We have all that information.
Neha Sharma [00:04:44]: So together it becomes a really powerful combination for the LLM to give you a very specific response. This is roughly the flow that happens when a user makes a response. So user types in a question for the assistant or invokes fix or whatever other action. And we actually collect two types of context, which I call remote context and local context. So the remote context, this can be thought of as a retrieval step where we're making calls to Databrick services to get information like tables that the user has access to, or external docs that may be helpful. In this particular query. We also fetch local context. So this includes the user input, the system context, and any relevant context that's on the page.
Neha Sharma [00:05:37]: So for example, on a notebook, we'll send the cell that you're focused on, the cell, prior cell, after the language of the notebook, any error messages that you may have received in memory, variables, and anything that might be useful. One note, by the way, we never send row level information, so we're never sending results to the LLM. So together we combine this and make the call to the LLM, which then streams back the response to the client. Here's an example prompt, so you can see the first is the system prompt. Then we have a bunch of contexts, some of them local error messages, and then our retrieval piece for external docs. Okay, so does it work? How do we know that it works? All this information? That sounds great in theory, but evaluation is a key thing that we need to think about. For us, success would look like if users actually find assistant responses helpful in their workflows, and if users actually use the assistant, and of course if they give us some positive feedback on their experience. So keeping these in mind, we have three different evaluation strategies and we actually use them in combination.
Neha Sharma [00:06:50]: So these are complementary to each other to understand a holistic picture of whether the assistant is helpful or useful. So I'm going to walk through each of these now. The helpfulness benchmark is a quantitative measure of the question of how helpful our assistant responses. So this is an evaluation set that we've curated with questions and context. So the way we curate this is through internal usage. So in our databricks, usage logs. So internal employee logs and our field workspaces. Field workspaces are essentially our customer success workspaces, and they're a really good proxy to how customers would potentially use Databricks.
Neha Sharma [00:07:33]: So we look at, we use those logs to create a question and context pair, and we also create a response guideline so this is a human expert curated guideline to say, this is what good looks like, and this is what you should respond with. And by the way, the human experts are us, the engineers and internal employees. So we package all of this and we send it to an LLM judge, and the judge then spits out a quantitative percentage of how helpful the response was. So this is what it looks like. There's a question in context which gets sent to the assistant and that generates an assistant response. We then send the same question context along with the response to our LLM judge along with the response guidelines, and we ask it essentially that given this pair of question context and the response, was this response helpful using the response guideline, and then we'll get a final evaluation. There's one big challenge with this approach, and that's that it does not scale. So first, judging correctness requires domain expertise and domain knowledge, even for us, like understanding what kind of data pipelines a user is trying to set up or what are they trying to explore in their data, I don't know.
Neha Sharma [00:08:58]: So this is a very tough problem and does not scale, even for people who are technical in nature. And second, it's hard to automate this type of workflow because unsupervised LLM judge doesn't work, at least not yet. For example, we found that GPT really tends to favor GPT answers. So in our test cases, GPT four judge found that GPT 3.5 got the answers 96% right, and all of GPT four answers were correct, which did not reflect reality. So to augment that, we also employed this strategy of what we called side by side. So this is an online eval strategy where we show users two options for responses and let them pick which one they prefer. And here we're really just leveraging the millions of assistant interactions that are already happening and help us give us a direct signal of what's useful. So let me show you an example of what this would look like.
Neha Sharma [00:09:56]: So here I'm asking a question, what is the most danceable album for Taylor Swift? And I get two responses, and in this particular case, I picked the second one because it was faster. So that's an example of how we can get a direct signal from users. So this scales really well because we're really using our users and we're able to leverage that kind of scale. We get a direct signal. So this is an indrag. This is users directly telling us what they prefer. And we're also able to test attributes like latency and verbosity, which is not always clear in an offline benchmark. And the other thing, the other aspect that we can scale here is the actual experiments that we can run.
Neha Sharma [00:10:41]: So we're constantly making a lot of tweaks to the model itself. We use several different models, some that we've built in house, some open source models we fine tune. So just changing those, we're able to use this kind of experimentation to test them out. We also constantly make prompt adjustments, adding context, subtracting context, and our rag as well. So this has been a really helpful framework in just testing out all of these different config changes. All right. And third, we do track accept rates, and there's roughly three categories. So the first one is thumbs up and thumbs down.
Neha Sharma [00:11:16]: At any assistant response, you can give it a thumbs up or thumbs down, which we track. We also, within a cell, when you invoke the assistant and you ask questions, it'll actually generate a diff, which you can then accept or reject. And that allows us to track. Okay, is this something that the user will find useful, and are they actually adopting it? And finally, we do also have this metric now where we're tracking code generated by AI. And this is perhaps not a direct measure of usefulness, but it is a really good measure for engagement. So this also covers the autocomplete cases and just general system cases, like how much of your workflow has actually been generated by Aih. I'll walk through a few examples of some quality improvements we've made and how eval has helped us in making those and verifying those. So I think I mentioned that we actually employ a lot of large language models, some of our own.
Neha Sharma [00:12:16]: And the helpfulness benchmark that I mentioned is a really good starting point to help us gauge where on the spectrum we think this model will perform. So it's a really good starting point when we do model comparison. So, for example, we recently ran a test between GPT 3.5 and GPT four turbo, and what we found was that GPT four answers are actually higher quality. However, GPT 35 was largely preferred because of the latency piece. So that's something that we would not have been able to catch if we were just using one evaluation strategy. Similarly, we make a lot of prompt tweaks, so we play around with verbosity. Do people just want to see code? Do they want an explanation? So we're playing around with that a lot. And the way we're able to balance reducing velocity without sacrificing quality is, again, using this holistic view of both online and offline evaluations.
Neha Sharma [00:13:15]: Same thing with adding more context. Um, generally adding more context will make things more relevant, like, for example, in memory variables. And we're able to capture that in our metrics. And one thing about conditional prompts, one thing we found is there's a certain class of interactions that can be improved with additional info. However, if you apply this, generally what we can run into is adding too much context can sometimes confuse the assistant, if you will, and so that degrades the quality of the response itself. So this kind of balance has also been really critical to see that in our evals. And I mentioned assistant autocomplete, measuring usage and engagement as part of our metrics. I want to talk about retrieval for a second, because this is a really big piece for quality improvements for us.
Neha Sharma [00:14:02]: So we've done the following things. First, we're expanding our knowledge source. I think there's been some recent news about OpenAI and Google doing partnership with stack overflow to expand their knowledge source as well. So we're doing something similar with our Databricks community, which is a Q and a forum as well. So including that and being able to see what kind of lift it has, has been something that we have verified with our eval strategies. We've also tuned our parsing logic to retain useful information. Particularly code blocks are useful from docs. So that improvement is something that we've done.
Neha Sharma [00:14:41]: We're making refinements in our chunking strategy. We're testing out different embedding models to see which ones would work best. And we're also tweaking prompts to ignore docs that may not be irrelevant, because again, if you include too much irrelevant context, your quality suffers. So all of this helpfulness, benchmarks side by side, has been crucial in making sure that we're ahead trending the right way, and we're iterating in the right way. Okay. I also wanted to show an anecdotal, uh, example of the improvements we made. So here, um, there's a question. How do I create a markdown cell in notebook? On the left is before all of our, our retrieval improvements.
Neha Sharma [00:15:19]: And this is actually hallucination. This is not true. There's no control m in notebooks, but on the right, it captures it perfectly. It. We actually have a new cell UI that we are launching. So some users see the old, the original Ui, some people see the new, and this captures that. So it depending on, hey, if you want to do with the original one, this is how you do it with the new. This is how you do it.
Neha Sharma [00:15:40]: So very contextual, very, very useful. Anecdotally, we're seeing the improvements and we're also seeing them in our eval metrics. Edge, hurry up. Okay, all right, let me skip the challenges slide then. And let me talk about what we're thinking about doing beyond rag. I talked about rag. But one big area for quality improvements for us is fine tuning. We roughly think about this in two types of fine tuning.
Neha Sharma [00:16:04]: One is injecting more Databricks knowledge into the model, and we leverage our internal mosaic research teams APIs for this. But there's also general instruction fine tuning. And here's where we want to avoid overfitting. We don't want the model just to spit out databrick syntax for the sake of spitting out databrick syntax. So we employ these two together. And one application of that has been the system autocomplete that I talked about. So this experience has actually been developed entirely on databricks, all the way from remodel to fine tuning to model serving. So we do use a smaller model for this.
Neha Sharma [00:16:43]: We fine tune using our own model serving model served through databricks APIs. And just wanted to mention that the low latency was a key requirement for us to do this. So you can see here as you're typing, it really understands your schema. It's giving you really specific information about what you need. It knows about your tables, it knows about your syntax, it knows what you want. So this is the result of fine tuning. And lastly, this is the direction we're going. It's a bit of a bold vision, but imagine a Databricks that's catered to you, that's tailored to your experience per customer.
Neha Sharma [00:17:20]: So it knows your business lingo, it knows all of your abbreviations, your data itself. And everything is built on Databricks platform without having to leave the platform itself. So privacy is taken care of, security is taken care of, and this is something that we're working on next. All right, thank you. So very early on, questions stuck in my mind. So when you are doing the context of like what is this code or error? The limit, like, yeah, there is a limit. So we do a lot of, we do a lot of preprocessing before we send all the context to our services to make the call. The limit really is dependent on the model.
Neha Sharma [00:18:05]: So there's no like one we depending on which model we're routing to. So we do, we do a lot of processing on that. But I mean like limited line on quote, like how many lines? Oh, you, oh, sorry. I thought you were talking about context limit. Limit of line of code is one thing. Also important. But I mean, like, literally reading, like, error, find the error. Help me with fixing.
Neha Sharma [00:18:23]: Yeah, we limit it. So the way we do it is we have, like, ranks of, like, okay, what's important per cell? Like, a current focus cell is probably very important. And so once we put the context together, then we start the trimming process if we need to, and then we'll trim based on what's less relevant or what's less important in a hundred lineup code or 1000 line of code or a million. Oh, that. We send the whole line. There is a limit for, like, cell numbers itself. So. But we send the whole context there.
Neha Sharma [00:18:54]: We send the whole cell there for sure.