Linguistically-informed LLMs Perform Better
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
It’s silly to think of training and using large LANGUAGE models without any sort of input from the study of language itself. Linguistics is not the only field of knowledge that improve LLMs, as they are the intersection of several fields, however, they can help us not only improve current model performance but also clearly see where future improvements will come.
I'm going to just bring on our next performer. Where are you at, Chris? Where are you at? Ah, there he is, as the others are. Yo, on their way off. Here I am off stage. There he is. Dude, I gotta. See what you do. What is this beard routine you have? Give me some of that magic. I need some of this beard, man. Wait, can we bring this up?
Full screen real fast? Dude, I'm in, in Murray, let you full picture. I'm in, I'm in Murray, Utah. Salt Lake City, Utah. Salt Lake City. That is awesome. That's that's what it is. You just live, you go there and the beard grows. That's how it works. That's right. That's right, dude. That's right. Aw, epic, man. Well, I'm gonna let you get Kraken with your, uh, with your presentation.
Thanks. I know that you've got all kinds of good stuff for us, and, uh, I'll see you in a little bit. Maybe, maybe I might not jump back on here, but, um, thank you for doing this. Yeah, bro. Whatever works for me. Maybe I'll see you in 20 minutes. All right. See ya. Cool. So, uh, awesome. I, I'm trying to figure this out.
Uh, welcome. So I'm Chris Brusso. I'm a lead data scientist at MasterCard. We're gonna be talking about linguistically informed LLMs. And, um, this is, again, not, or I haven't even said this once yet. This is meant, uh, to be a help to teams. It's not meant to make anyone feel a detriment or anything like that.
Um, all of this insight is coming out of a book that I'm co-authoring right now with, uh, Matthew Sharp. He's a clean code for data scientist guy. He was recently on the ML Lops Community podcast, and I would really recommend if you haven't go and listen to that episode. And yeah, our book is about to hit the Manning Early Access Program, so look out for it.
If this stuff appeals to you, we're gonna be deep diving into all of it in the book and going even further into. LLM ops and ML lops and how they kind of mesh. So let's get right into it. Um, Demetrios mentioned the beard, so I wanna address that real quick. Um, there are a lot of different lengths that my beard has been, and there's a lot of different states of neatness that it has, it has been in, um, basically everybody's got their own personal preference for their beard.
I, I'm talking about my beard cuz my hairline's receding a little bit. But if you want to think about it as your hair, you can as well. Uh, the idea though is that whether it's long or short, there's always a growth period in between where you are and where you want to be. And, um, if only somebody existed that could, that has experience with hair, that has experience with lots of different types of hair, that would be able to help you.
Not only set your goal for, to figure out what looks good on you with your type of hair and your color of hair and all of that, but to help you get there in as painless a way as possible. Because as you can see in that top right photo, I hate that photo of me. I hate it so much, but like I was just in, I was in a growth stage and my beard wasn't neat.
And um, so now we're gonna get into some fun stuff. Let's keep going. So this is a quote that many of you're probably familiar with. This is from IBM in 1988 from Fred Yell Neck. Um, every time I fire a linguist, a performance of our speech recognition system goes up. I think this is true, but considering they were working on shoebox at the time, and, uh, I have never used shoebox and I don't know anyone else who used a shoebox for ASR on digits.
Um, I don't know if they, I, I think they probably ran outta linguist to fire. Um, the idea though is just like with a beard, just like with hairstyle, you can do however you'd like. You can, you can figure this out on your own. Um, the barber though is just gonna, like, they're gonna help you get there quicker and more painlessly.
Um, you can even trim your beard on your own. You can fire your barber at any point. Um, they're just there to help, you know. Uh, one of the mistakes that I've seen when people are growing beards, in this case, LLMs when they're training LLMs, is they don't have a goal in mind. And so they end up solving for metrics, right?
They're solving for precision, they're solving for recall, they're solving for an F1 score. Uh, the problem that I see with that is you aren't hitting your goals when you're. When you're measuring for KPIs, if your KPI is your goal, you're either number one, uh, late stage, which that's fine, that's really good.
Or number two, you don't know where you're going. You're just growing out your beard and you're probably going to have to trim it back at some point. And with LLMs as opposed to hairstyles, this is a lot harder to do. So, uh, let's talk about LLMs real quick then. This is the focus of the conference, right?
Um, What are LLM solving for? I, I'll just wanna hear that in the chat. So like, give it a guess. What do you think LLMs are solving for?
Looks like there's a bit of a delay, so I'm just gonna keep going. Um, LLMs are solving for language. They're large language models. They're not large mathematical models, they're not large statistical models. I think they, they are, they're fundamentally mathematical, they're fundamentally statistical, but they're solving for language, you know?
So let's talk about the features that we solve for, to get to, to our goals instead of trying to get hung up on metrics, right? We got our. Exactly the next word, chicken or egg. First, um, features of language. Uh, we have number one syntax. We have number two, morphology, semantics, pragmatics, phonetics. And just in case you're unfamiliar with these, syntax is gonna be your grammar.
That's your structure of language. Morphology is your structure of words. You know, how do you, how do you get from nothing to a word? It's your morphines, your smallest units of meaning. Uh, semantics is your literal definitions. Of words. Uh, for example, uh, just a quick example. If I say I'm married to my ex-wife, right?
Married an ex-wife, their literal definitions contradict each other, but there could be pragmatic explanations as to how that could work. I'm not really, but I'm, there could be pragmatic explanations for that. Pragmatics is the definition surrounding the words, so it's not the words themselves. And then phonetics is the actual production.
Of speech, it's your places of articulation, it's assigning meaning to sounds. It's the relationship between orthography and your actual sounds. Um, yeah, these are really complex problems. Um, let's, uh, I, I don't wanna pretend like I'm giving you something novel here that nobody's ever thought of before.
And in fact, as soon as I go to the next slide, you're going to see. There's been a lot of thought put into this, and so we're gonna talk about some problems and we're especially gonna talk about some solutions, right? So what have we already done to solve for these, you know, for syntax, I'm of the opinion that LLMs solve it.
You know that when you're looking at transformational generative grammar, Chomsky, LLMs do it. You know, they, they allow you to generate infinitely and they allow you to generate infinite combinations. It's very, very cool. I think we might be done with syntax. We're not done with the rest of them though, from morphology.
We have tokenization. If you're, if you're looking at this from a computer vision perspective or any other classical ML perspective. Your morphology is vectorization. It's making sure that the model has something that it can view as opposed to trying to parse your words. So, Um, and then on the other side of that related is embeddings, where you take your vectors and you turn it into something that has inherent meaning to the model.
Um, I think that for morphology and embeddings, we are like maybe 75, 80% of the way solved there, but we're gonna talk about 'em a little bit more. The bottom two are the ones that I think we've gotten the least solved, but we're working on it. And a lot of people are seeing that these two sections are the places where we can really complete our star.
You know, with pragmatics being, uh, giving the model context, whether that's in training or in inference, we found that it works really, really well. And I'm, I'm actually gonna demo that for you a little bit, whether that's through chain of thought or through instruct fine tuning or, you know, just giving the model rules that it can work with so that it has a way of interpreting the real world and all the stuff that's around the words, around the semantics and the composition, the, the.
Cognitive ness, right? Morphemes can be part of words. Morphemes are part of words morph. Morphemes are exactly part of words. Um, so yeah, I'm, I'm sorry there's a huge delay here for me reading in the chat. But anyway, um, the last bit phonetics. I love phonetics so much. This is one of my favorite parts, and so I'm saving that one for last.
But let's, let's dive right into it. We're gonna talk about some problems and some solutions. This is the hardest one, the dictionary problem. This is, if you are familiar with dictionaries, if you've ever seen or heard of a dictionary before, a lot of people end up using dictionaries as some sort of an authority for semantics.
Uh, an authority on what is the literal definition of a word. Um, I don't think that we need to use in that way because dictionaries are not intending themselves in that way. They're more of snapshots in time of popular usage. And this is, this is helped by the actual dictionaries themselves. You know, you have dictionary.com, maryam webster.com, and the Oxford English Dictionary.
All of the, all three of these dictionaries have weekly updates to their corporate, to their vocabularies, and I can think of something else that starts with an ll and ends with an M that has to manage vocabularies over time. You know, they have weekly updates. They have monthly soft updates to their entire dictionary, and they have yearly hard updates to make sure that they are representing language as it currently stands.
So, Um, beyond that, uh, ba basically, I'm just gonna stop there with this, with this problem. The idea here is w when you consider your LLM like a beard, you need to think not just getting to the, about not just getting to the length that you want and the style that you want, but how do you stay there? How long can you stay there?
Is your l l m going to just be a period piece for 2023 or is it going to go further? Can you make it last 2030? And how do you do that? Um, do you need it to last that long? Maybe your L L M is so specific. Maybe it's a financial one, like the ones that we work with at MasterCard where we're working with language that doesn't really change all that fast, and so we don't have to worry about that as much.
This is just a piece of consideration that can really level up your ability to create these with skill. The next problem, uh, this is the YE problem, and this is, this has to do with morphology. There, there are two things here, and both of them have to do with predictability. You know, when we use popular methodologies for tokenization, like by pairing coding, or like sentence piece, or even the chat g PT encoding, we have.
Some problems, right? Uh, there's a reason why GOAT 7 billion is able to outperform GPT four, which is 1.7 trillion token or 1.7 trillion parameters on every bit of arithmetic, whether that's addition or division, multiplication, subtraction. Um, the reason being that GPT four used statistical, uh, tokenization methodologies that had to do with frequency.
We don't determine morphines based on frequency. And so they ran into problems with numbers where it's grouping commonly grouped numbers together. And so when you ask it to do math problems, sometimes it can't adequately see the problem the way that you need it to. Um, this happened in 2014 with the word ye seemingly came out of nowhere, popularized on vine.
The thing is though, um, English has sets of sounds and letters that are allowed to be together, and it has other sets that are not allowed to be together. And those allowances don't change as fast as the words themselves. So ye is a predictable word for English. In fact, it's been in English before. If you look at old English when you yielded someone, it wasn't throwing them or anything like that.
It was, um, Using the word ye it was using that level of formality for them. So piggybacking off of work that other people has have already done to show you what's predictable, can really level up your ability to tokenize and your ability to utilize morphology to make your models better. Uh, next portion.
This is the kimono problem. This is basically the same thing, just understanding deep down into it that when your, when your model is splitting words, it's trying to find basic units of meaning. Mono is a unit of meaning, right? And so kimo to a model that is tokenizing English, it looks at mono as a unit of meaning, and ki is not a unit of meaning, right?
It, it, kimo is a borrowed word. In English. And so, um, there are a lot of places where we end up using token, uh, tokenization and morphology and syntax as if our language is completely in a vacuum. This is one of the reasons that multilingual models tend to outperform monolingual models on the exact same tasks.
You look at MMS versus Chad, BT you look at, uh, m Bart versus Bart. You, you know, it. It keeps happening over and over again. The more languages you're able to ingest, the better your tokenization is, which is the better your model can see what it is trying to work with. Let's go to the next one. This is the Socratic problem.
I, I am kind of annoyed with the way that I named this one, but it's, it's dealing with pragmatics. It has nothing to do with Socrates or why, it's just the entailment, the pragmatics. And in true m ml, lops fashion, we're gonna talk about speed here. Uh, this is for a company that, um, they, they basically wanted a system that queries chat, G P T.
And it queries it for 20 college level biology questions. Um, when I did this vanilla, which at G PT using the OpenAI framework, it got seven outta 20, which is pretty remarkable, but it's not where they wanted it to be. And it took about four minutes to do it. And admittedly, that's unfair because the average amount of time running this several times, it took about one minute to do once the API had warmed up and warmed up.
Once I recognized my connection and everything. I'm gonna show you though, doing it locally with Falcon seven b instruct how long it took on average using guidance, which provided pragmatic instruction for chain of thought, allowing it to prompt itself multiple times, allowing it to do system prompts, and basically coaxing the, uh, coaxing the knowledge right out of the model that I don't even know whether or not Falcon trained on biology textbooks, but this is what ended up happening.
Oh, is it gonna, is it gonna play? Ah, shoot. It looks like it's not going to. So, um, I'll just tell you, it took about two seconds to play or to go through all 20 questions, and it ended up getting 17 out of 20 instead of. Seven out of 20. So it increased by by 10, which increased by 50% accuracy and increased in speed, like almost a hundred fold just from using some pragmatic, uh, instruction on the inference side.
It's amazing. This is one of the things that I'm really looking to explode in the next little while. And if you're not using it, you should be. There are other tools like Lang Chain, there are other tools like vector databases that can add even more to this like document retrieval. Amazing. You should be using this stuff.
Um, the last little bit, which I'm really hoping is going to play.
Oh geez. Well, I'm really sorry. This is, this is not playing. I would, I would love to be able to show you these. Basically this is the, I never said I loved you problem. This is getting into the phonetics where, um, I never said I loved you, is a sentence that's impossible to accurately. Um, Uh, I never said I loved you, is a sentence that's impossible to accurately understand through text.
Because you can put emphasis on every single word in order and it will change the entire meaning of the sentence. Uh, compare. I never said I loved you to, I never said I loved you. Right? And that is information that is immediately lost when we reduce it to text. Um, for this, I basically, I, I tried on the left we have two text to speech models, which is, um, no less people using Falcon does not make it faster because Falcon is local.
It's downloadable, so it's just downloading that makes it faster. Just giving you a little, like wrap it up warning. Oh yeah, sure. Well the videos aren't playing, so we're good. Um, tortoise in 11 labs are text to speech, so they are completely reduced to text and I wish that I could show this to you. I wish it was playing, but tortoise, it basically, um, it does it, there's a lot of stuff wrong with it, but it ruins Trump's accent makes him sound like he's from all over and he's speaking French, so it's even worse.
11 Labs has it pretty trumpy, but it's not, it's still very not good. And neither of them capture any of the melody. Neither of them capture any of the phonetic information that gets lost. Uh, S V C is a speech to speech model, so it's only phonetic and it does a pretty good job, but it makes it sound like Trump has a severe cold and that he is French.
Whereas the Phonetic plus model, which this ingests both text in the form of the international phonetic alphabet and the uh, And phonetics, it ingests an actual audio clip as a reference. It sounds really, really fantastic. In fact, I, I posted it on my LinkedIn, so if you wanna, if you want to get a view of what this sounds like, um, you can go there and I'll actually give you this here.
Here are the three QR codes to connect with me. The left one is LinkedIn, the right one is my portfolio site, and the bottom one is my YouTube. Thank you so much. I'd love to connect with you all and talk more about this. All right, Lily. Awesome. Thank you so much, Chris. Yeah. Please join folks in the chat. I feel like they're gonna want links and questions and all that stuff.
Thank you so much. Sure. All right. So long.