LLM XGBoost - Can a Fine Tuned LLM Beat XGBoost on Tabular Data?
As a Senior Data Scientist and Team Lead, Sebastian fills the modeling role in teams of multidisciplinary developers and guides the translation of business requirements to statistical models. He has 5+ years of experience building 6 products and 5 POCs in retail and manufacturing using cloud solutions. Following current research, Sebastian regularly experiments with NLP models.
LLMs are tremendously flexible, but can they bring additional value for classification tasks of tabular datasets?
I investigated if LLM-based label predictions can be an alternative to typical machine learning classification algorithms for tabular data by translating the tabular data to natural language to fine-tune a LLM.
This talk compares the results of LLM and XGBoost predictions.
We got our next speaker on deck. Um, it is Sebastian with a very good question that he will address LLMs versus xg Boost. Can a fine tuned llm beat xg boost on tabular data? Tell us Sebastian. Uh, can it be done? Thank you. Hi everyone. Alright. Uh, today I will talk with you about, uh, what LLMs can do for predictive analytics and if they can beat XT Boost on tabular data.
So at I w t we are doing full stack data science consulting, everything from architecture development to predictive analytics tasks. And most of our data comes in a tabular form. So, I asked myself, can we use those large language models on those tabular data sets and apply them to standard predictive analytics tasks?
And yes, today I will present you the, my, my findings from those experiments. So I will start with, again, a short introduction, findings, suitable data, task, and dataset. Then how you, we would go for fine tuning those LLMs the results and then a few experiments, conclusion and next steps. Um, I, I looked at.
Yes. So large amounts of structured and tabular data is already in use and our around and in used all the time. And the question is, do the LLMs know something about the data that's maybe not in the table itself? So the large language models are trained on vast amount of data. Um, maybe they have read the finance subreddit and know something about, uh, it that, uh, about some dataset that the data itself doesn't know.
And, um, yes, that's kind of what, what I wanted to evaluate. Uh, in the past, uh, I was already experimented when, when Bird came out and how Bird works compared to XG Boost, and that was still a clear winner for, for XG Boost, but was the starting hype and, and, uh, great results that the LLMs created. I wanted to look at it.
Uh, Again, and then see how, how the performance goes and going for that, the first step was to find a suitable task and a suitable data set for that, and that proved much harder than expected. So when you look at the UCI machine learning repository or, or some other standard dataset, You can just copy the CSV line into chat G PT and ask them, Hey, will that customer churn yes or no?
And chat gpt has, uh, or G PT and the large language models have read the dataset 200 times during training and can use a hundred percent accurately. See, say, If the customer will turn or not. So it needs to, needed to be a data set that's not in the training. So we looked for a data set that was published after the, the training period, and also it needed to be a data set where, It's form a domain where, where we can expect data and to kind of know something about.
So going with some iot device with, or some, some other, uh, sensor dataset where we have 200 columns, X one to X 250 with all flow data. We can't expect that and to kinda know anything about that. So we ended up with uh uh, Customer churn prediction data set from a telecommunication provider that got, uh, recently uploaded.
And then how do you go about fine tuning those tabular data for the, um, and basically three steps. The first one is to translate the CSV file into natural language. I guess you could even go with the comma separated list of each line. But having a data set with, uh, with a code book and then translating it into real sentence is more natural than, than what the other end are, are used for.
Fine tuning itself then, uh, works with, uh, cause there are those sequence to sequence models and they're predicting the next word. So you are providing basically only the word yes and the word no, and then fine tune the model to predict that word, uh, outcome instead of the zero one you have in, in a standard machine learning model.
And in the end we compare those desires with so, Train tests split and evaluated. So the whole pipeline basically is you have your, your CSV data, you translate it to natural language. We used, uh, chat g PT for the whole experiment. We use the open ai, uh, APIs because it's really easy to use. Everything's there.
You have the translation with Chad G P T to get your data prompts, uh, file, and then you, they have a fine tuning endpoint. That gives you the trained model and, uh, that you can then use for the, for the predictions in the next step, and then to compare them to XG Boost and with the churn dataset. Uh, uh, classic metric is the, the A u C, uh, you can use to evaluate your, uh, classification.
And on the right hand side, you can see the plot of that. The blue line is, is the XG Boost, which has an an a u C of uh, 93, and the l l M comes with 91.6 not far behind. So we can see that the L l M actually performs reasonably well on this churn prediction task. So I was quite happy to see that the l l m, even though it's not.
Good. It's not better than, than XG Boost or, or at the same as you see. It's, it's quite net negligible with the small difference. And we can see that, uh, m really performs well on this, this dataset. Uh, I also tried it with chat g pt. You can see the red line here, uh, when I adjust without any fine tuning us.
J g pt, Hey, he is a customer from the telecommunication company, so one prompt from the translated dataset. Do you think it'll churn or not? And after a hundred, uh, tries, I, I stopped because the ides were worse than random. So some fine tuning is necessary. Uh, one caveat to the whole thing is that I also tried a logistic regression, and you can see that's even better than the l l m and has an incredible high a auc.
So maybe the, the dataset wasn't the best to really gouge or you would need a bit of more, another dataset or something to see how it goes. Um, one problem was, or one thing was maybe does the LM actually know about the dataset? Um, or, uh, not. And, um, to kind of understand that, we try to add the customer ID to the prompt and see if that improves the fine tuning.
So if the data knows about, if the LM knows about the data, maybe that would improve, but it was not the case. So adding the. Customer Id, uh, changed nothing in the reside. It was just noise that was added. And then, uh, another. Point was because the initial idea was that maybe the LM knows something, the, the table doesn't.
What happens when we remove, like the most important feature for xg Boost from the XG Boost and from the LM prompt? Uh, will that actually change the performance or it will decrease the performance of the, of the models. But will the magnitude of decrease be kind of the same between the um, XG Boost or maybe.
Can the LM compensate for that a little bit. Uh, but there, uh, it was roughly the same decrease when removing the most important barrier with so no wind for the LM there. Um, also I tried a bit with the train test split, trying the different seeds. So each time only, only one change, but, uh, uh, variations changed.
Uh, not at, as the results changed. Not at all. So I think the results for this one data set are, are. A reasonable, uh, robust for, for first experiments. So at the conclusion, I think we can say that the LM performs reasonably well. Uh, on the classification tasks, um, of course it's a bit more data preparation than directly using XG Boost with the table.
But when you think about, uh, for, for mixed data sets, classical tabular data, and then you have the text. Normally you can try to and extract features from the text and then put them into your, into your machine learning model. But now you could actually try it the other way around. Uh, translate your.
Tabular data to, uh, to the sentence added to the, to the other text data. You haven't used the fine tuned LM to do the predictions and be, I guess, a bit more flexible, uh, with the whole, uh, uh, yes. With, with the approach. I think the cost of the whole experiment was, I don't know, a hundred, uh, bucks or so, so it was cheaper than expected and perfectly well for, for like first experiments.
Working with the Open AI API was slower than expected. So when I did my predictions, it took like five minutes for a thousand predictions. I was quite surprised, but it um, Not sure if that was, uh, if that's reasonable to, for, for actually their production use case or if that was something I, uh, can still improve upon.
Um, next steps. Open questions, uh, are of course always there. So one I think is very interesting. Can I concatenate all churn data sets? Uh, And find and, and translate them to natural language and see if that joint dataset fine tuning is then better than every individual one. Uh, could we use chap values to identify what part of the input point was used for the classification and then compares with XG Boost or.
Also not sure if that could help or not. Can we extract any relevant information from the LM to enrich the tabular dataset itself? So maybe those, them hallucination everybody tries to get rid of are good for something after all, when we can enrich our datasets, uh, with some reasonable, uh, guesses from that.
Yes. Yeah, the, the links of, of the, um, open AI API dataset. And thank you for your time. Awesome. Thank you so much. Uh, we are going to kick you off the stage and get ready for our next speaker. Thank you so much, Sebastian. This was great. Bye. Cool. All right, so long.