Unleashing Code Completion with LLMs
Monmayuri is an advisor, data scientist, and researcher specializing in MLops/DevOps at GitLab in Sydney. She builds creative, products to solve challenges for companies in industries as diverse as financial services, healthcare, and human capital.
Along the way, Mon has built expertise in Natural Language Processing, scalable feature engineering, MLOps transformation and digitization, and the humanization of technology. With a background in applied mathematics in biomedical engineering, she likes to describe the essence of AI as “low-cost prediction” and MLOps as “low-cost transaction” and believes the world needs the collaboration of poets, historians, artists, psychoanalysts and scientists, engineers to unlock the potential of these emerging technologies where one works in making a machine think like humans and be efficient automated fortune tellers.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
This is a talk on the learning building Code Suggestions, my team has taken in reference to Model, ML Infra, Evaluation to Compute, and Cost.
Rolling for our final talk. And the best part about this talk, well, maybe not the best part, but one thing a, a nice little, uh, trick about this talk is that our guest, our presenter today is actually tomorrow. What's up mon? How you doing? Ooh. Hi. Sorry. That's my dog at the back. Um, your dog wants to go out?
Yes, it's uh, tomorrow for me. So it is 16 then It is. It's late for you too. Yeah. Wait, what time? So you're in Australia. We're just gonna say that right now. You're gonna, you're gonna take us home starting your day. We're finish. I'm finishing mine. There's a lot of people in California that are. Uh, just finishing up lunch, I believe.
But what, uh, what time is it where you are? It is 8:10 AM So it's a good, decent, it's a decent time. Yeah. That's not too bad. That's not too bad. Eight. That's respectable. I am glad that I didn't like, give you a time that was 3:00 AM or anything like that, so we did All right. Anyway, you've come to talk to us about some really awesome stuff.
I just saw your screen, but now I don't see it anymore. What happened? You might have to share it again. Let me try. That's the fun part. Yeah. Ooh, ooh. Where'd it go? Where'd it go? If it makes you feel better, I'll tell you the stories that I, uh, have these big lights right next to me, and I didn't realize that.
For the last like hour there. I see your screen now for the last like hour, I've had my window open. And so now I've got these gigantic bugs that are just flying around. So in case you see me swatting while I'm talking to you, that is uh, because I've got these gigantic bugs flying in my ears. Anyway, I'm gonna leave it to you.
You're gonna take us home, bring it on strong, and let's see, uh, I'll be back in like 10 minutes. Thanks. Thanks and nice to see you back again. So, yeah, thanks for having me. Um, all right, so, um, I will be going through a little bit of my, our journey of building code completion tools within GitLab. I am a engineering manager who looks at a, um, group called Model Ops and, um, We started this journey on building code completions through, uh, through a tool called Code Suggestions.
Uh, pretty much around 4, 6, 7 months. And, uh, and yeah, we've learned a lot through it. And, um, and just wanna, uh, share through all the learnings together. So, um, before we go into exactly what completion tools do or not do, um, Or the architecture of it, code completion tools. What they are, they write complete, recommend the code you want.
Um, it is fundamentally, I. Very, very close to the heart of any developer, and using AI as assistance in decision making for developers, uh, to help them provide that judgment. There's a lot of talks on how do you use it, what do you consider it usable, what part do you demand? But in general, when you are deciding on the outcomes of these LLMs for code completions, um, There's basically three areas.
Um, will this code that a AI assistant, uh, completion, uh, agent or recommendation, uh, be honest, um, which is consistent with facts. This is beyond just evaluating for is the code correct this, this honest to the developer who's developing. Is this harmless and is it actually helpful? So accomplishing exactly the goal of the coder so the coder can code fast, and instead of writing 500 lines, can actually write thousand lines of code for a usable feature.
Um, so, um, that's sort of what a code completion tool and how we actually decide on LLMs and based on these outcomes of these three different metrics. Now I wanna actually focus on the last part. Does it actually help, um, To get to the, uh, the goal of the coda. Um, to do that, I do wanna talk about how you, a framework on how you can do evaluate it.
Billing and decision time framework to make these, uh, LLMs in, uh, production useful. Um, and keeping in mind a lot of the LLMs, um, other than, uh, chat chfi, whether it's third party or, or, or, or, um, or open source. Um, we don't necessarily constantly feed, feed them the training data. Off, uh, to judge the quality of the good, bad, and ugly.
So then how do you actually then take these third party open source LLMs, make them right code that are much better than an average coder can actually do. So that's sort of a goal that we wanna achieve together. Um, okay. So some fun fundamentals, obviously of choosing the right role. Uh, raw LLMs, um, as we probably call it.
Um, things we wanna look at is obviously the objective is good completions, uh, code generation recommendations code, uh, for developers. So, This is all probably everybody by now probably has a better sense. We wanna look into the parameters, uh, the training data. This is really, really important, uh, to just get a sense of, just on a raw side, uh, how much data, how much of the internet has been gone into, into this, um, uh, building this model and getting some smart into this prediction machine.
Um, When you go through a completion, uh, to, we wanna look into what kind of evaluation benchmarks are already there with the LLMs that you've chosen. Um, if it is open source, uh, we wanna understand the model weights, how this flexible, what kind of tuning, uh, frameworks can we do, uh, or even for third party.
Um, cost is obviously another thing. And latency, I see a lot of posts where we say we wanna also assess quality, but I do. Think that this is a really hard one on how you can actually assist quality without really evaluating in, in scale and, and, and how do you actually then use these efforts of work on prompt engineering at scale and do almost like continuous training, continuous prompt engineering.
So to get to that, I do wanna first look into what an L L M architecture from a perspective can look like, where you basically, in this world of LLMs, you're not just choosing one, you're choosing many based on the factors we discussed before. And let's say you have an open source, pre-trained l l m, that you can also feed in further data to tune it.
And then you also have certain third party, um, LLMs as well. And what you're then doing a full architecture of. Starting from the left, you're taking the data, additional data to enhance your pre-trained LLMs. You're downloading, let's say, from hugging face, wherever, full raw data sets, publishing, pre processing, tokenizing, all the way into an environment where you're training it, tuning it, having those layers of checkpoints.
Then it goes further into what we. There's two engines running parallelly into it is also then the prompt engine almost where you are instantly with every code a person has written, you are going through the same layer of what we call a prompt library or prompt DB of then looking into how do you deconstruct that code into tokens, uh, understand what was then finally committed.
Uh, And then get a better sense of a scoring mechanism of ranking. Is this a good, average, good developer understanding? And then going into some sort of a gateway where we have a prompt engine, pro processing, a validator, um, calling the models based on the input the users doing, whether it's through the third party.
Or, or actually through, through your pre-trained model. Um, now the orange boxes are something we'll, we will go to in the next few slides as well as to how do you actually have a continuous evaluation and inference as part of this architecture. So, Focusing on the evaluation, um, how do you kind of do it in scale, um, without the human eval benchmarks?
Uh, and starting with that, and let's say you start that and you've also done, uh, few more of. Um, uh, few shot examples. Uh, and you've done a little bit of user base. And then you wanna understand how to version control it, scale it, do it continuously. Have a ci cd, continuous training, continuous feedback to consistently keep on dialing up the accuracy for your use case.
Um, now to do that, Those orange boxes on the left is a way we would probably have a mechanism to continuously loop that, which would include prompting input to tokenizing, to understanding the similarity based on a developer to storing it, to evaluating it, and then using that as a smart to, uh, to actually have the output, for example, Let's say we have on a historic database, uh, we've dissected a large model code to understand a token of completion.
What a, uh, uh, to, to understand how you build a XG use model. Now, we've seen that historically in a code base of a company, uh, that. Many developers have double one, double two, double three, like that. Um, they have different ways of writing the same code. They have different, uh, parameters, evaluations, and everything.
And then what we've done is we've ran these algorithm based on the code, commit and similarity, and agreed on what is an actual agreeable developer output. Um, that then is then evaluated with how the LLMs Open source third party tuned, uh, is outputting through a technique. Uh, based on the objective, we've used cool sensor similarities.
You can use Spearman or however, based on the prompt as well as the code through, um, now this engine. So the prompt is here, model X gb. We then know what an actual agreeable developer output should look like to understand how to map that to a quality. And then we know. How each of these elements are giving, uh, outputting it and matching that, and then sending the right output, uh, to, to the right, uh, uh, to the, to the user.
Now, this makes probably sense in just one line, and a lot of times I see people having. Manual spreadsheets to looking into prompts and how to evaluate it. But imagine this having this in scale as engines, and that's the beauty of having this sort of a microservice engines added to your layer of full architecture where you can do this in a infinite loop on an instant basis to keep on dialing, let's say your code completion, uh, usefulness.
Um, To, uh, harm, uh, harmfulness, usefulness, all the benchmarks, and then instantaneously keep on, um, helping with, uh, with the evaluation part of it. So this is something that can help just doing it at scale thousand times for in that continuous factor fashion. Now on the serving then is basically what we call in the reinforcement learning of it.
So imagine from that evaluation then we go into. Yes, we know what is the actual dev. We know what the LLMs are giving. Now we wanna know if there are prompt templates of prompt tunings that we can do on an instant basis. Have a prompt validator rate, linka output. If that can't be done, stringed it to a fine tuning of the LLMs.
Again, continuous loop, having it version based. Control base. Each of these can be microservice connected through, uh, ci. And uh, having that prompt validator rate limiter becomes a key to really understand the user input as well as the port to then put that full continuous slew where you may start on the fact that.
At a certain level, you've started on the raw LLMs of a combination, giving you a 10% acceptance rate for quota. Then you sort of take this evaluation benchmark and this continuous prompting through reinforcement learning in a loop. And just like a Amazon recommendation engine, you're constantly dialing up this accuracy of usability, uh, to, to then get to that output.
So then let's go back again. Now, putting that inference and the prompt of how this hole becomes that loop training data completion data. Continuously and with code completion, you will always have code. There will always be a coder writing code, so you can continuously add this and keep on adding to this data for evaluation, for prompt inference, for ranking, for understanding what is.
What is, how, what are those three decisions I frame as, how does that impact every code to get better in using LLMs for, um, code completion? Um, That's all I actually have today. Um, I do wanna end by saying that it is, uh, if one of my lines that in this day and the age of elements data is still the oil. It is very difficult to imagine this, that with sufficient data they will remain things that only humans can do.
So, uh, in the journey of LLMs, this is still the key. This has not changed. Uh, thank you so much for having me. Um, I haven't put my details here, but I can put it on the chat of if any anyone has questions, feel free to, uh, reach out to me in the MLO Slack community, uh, channel as well as on LinkedIn. There we go.
All right, Mon, thank you so much. That is so cool and so valuable. I love to hear this journey and I appreciate you sharing it with us and being very transparent. And now you get to go off and enjoy your day while we, hopefully it's sunny where you are. I know. It is. Uh, it is winter, right? That's another It is.
So maybe you're getting nice winter day and we are going to close. The party for now, but we'll be back in action tomorrow. Same place, same time. And we've got a whole nother lineup. Uh, but in case anyone would like to know, I will be singing more songs. I, I know you, you probably wanna know I'm on, I I will have more songs.
You missed it cuz you were probably sound asleep. But I created a few improvised songs and. Now, uh, hopefully I'll, I'll sing us all lullaby and put us to sleep and we will get outta here. Uh, actually no lullaby. I'm ready to go to sleep right now. To be honest, my brain has stopped functioning. I'm calling it quits.
I'll see everybody tomorrow. And mom, thanks again. Thank you.