Exploring Cloudera's LLM Integration in Hue
Björn Alm Mangell, Senior Staff Software Engineer at Cloudera, has been coding since 1996 with a focus on business value, AI, and beautiful user interfaces. He began his career as CTO of a Dutch AI company but has since transitioned to coding with one foot in the startup scene and the other in large tech companies (IBM, Cloudera). Lately, he has been one of the initiators and developers behind Cloudera's SQL AI Assistant, an NL to SQL tool that brings the power of modern NLP models to your organization's data.
Björn Alm explores the integration of Cloud Era's SQL AI assistant, called LLM, in the application called Hue. This integration aims to revolutionize the way SQL is handled, moving from manual, hands-on SQL writing to a more automated process. Björn demonstrates how LLM can generate SQL queries using natural language prompts, modify existing SQL statements, and provide explanations for complex queries. The talk also delves into LLM's ability to optimize poorly written SQL queries and auto-correct syntax errors, ultimately making the process of working with SQL more efficient and user-friendly. Björn's presentation offers a deep dive into the capabilities of LLM and its potential to transform the workflow of SQL analysts.
Björn Alm [00:00:00]: My name is Björn Alm Mangell and I'm one of the developers behind what we call the cloud era and SQL AI assistant. And it is currently integrated in an application called Hue, which I'm also working on. And as of yesterday it is actually publicly available to our clients. It's behind a technical preview flag, but it's there, it's in production and you can try it today. I'm going to show you what it's all about. But first, it's currently integrated in an application called Hue, which is like a swiss arminite for the SQL analyst. And up until now our end users, they have written all their SQL by hand. So we have a parser and it has an auto suggest and on the left side you have all the tables and metadata and you can browse the language, reference, et cetera.
Björn Alm [00:01:03]: But it's still been a manual kind of hands on job. And as of now you don't have to do that much yourself. You can ask DLLm to do most of the work for you. The AI assistant is activated by clicking the blue dots. And we have a couple of options here related to SQL. And depending on what you're currently doing in the editor right now we're doing nothing. It's empty, there's no SQL there. These different options will be available or not.
Björn Alm [00:01:35]: So right now the only thing we can do is start to generate SQL using natural language. So let's list the most common countries of origin for online sales customers. So this is a prompt. I've typed it before. That's why it was available in a drop down history lister. Now let's execute it. A lot of stuff is happening here in the background and this is actually by the way, this is actual data running in actual hive database somewhere in the US. I have my demo server.
Björn Alm [00:02:13]: This is pretty fast. We can see that the LLM comes back with some SQL. For me, I see that it's using the Cbirth country. And if we look down here, there is an assumptions box as well, which is good. And it mentions the seabirth country there. And of course this is an assumption, but it's a really good assumption. It's the same assumption that I would make as an analyst if I had access to the metadata and was looking at how can I sort of write this in SQL. But it's still good to remember that there are a lot of assumptions here and they are listed.
Björn Alm [00:02:49]: It also finds the web sales table. That's cool. It groups by countries, what I wanted, it orders and it limits because I wanted the most common countries, right? Not all the countries. Let's insert this. It also inserts as a comment the actual prompt that I was using. If we execute this, it could be fast. It could take some time. Now hopefully it doesn't take too long.
Björn Alm [00:03:21]: I can tell you a little bit about what's happening behind the scenes when I execute this. All right, well, it's actually return already. This is good. I get a list of numbers, but there's also a null value there. I don't want that. Let's do something else. Let's edit this existing SQL and exclude all the null values. Exclude null rabbits.
Björn Alm [00:03:48]: I do the same thing again, but currently this setup is kind of different because I'm actually passing in a SQL statement and I'm asking to have it modified. But behind the scenes similar things are happening. We have a rag architecture here as well. We extract the relevant tables and all the metadata needed and we pass it in as the context using a specific template for this particular task. It's actually just not one template, a lot of templates, different calls that we stitch together. But basically this is what we're getting back where seabird country is not null. Perfect. I can insert this, we don't have to test run this.
Björn Alm [00:04:27]: It's a fairly straightforward thing, but I want to edit it some more. Now I can pick up my previous prompts by just tabing here and I can continue and say, and focus on, yeah, we got it here as well. Focus on year 2001. If you wonder why 2001 is because my demo and test data is old. So that's where there's actually some data. We hit enter. We're doing the same things now and it's generating the SQL query. And here again, since we passed in a SQL statement and we expect some modifications, the result is actually shown to us as a diff.
Björn Alm [00:05:09]: And here it's quite obvious that the diff here is some date in an interval. Okay. There's an assumption saying that the date range for year 2001 is from and then these numbers. Okay, so I might want to double check that because the LLM sometimes delivers more readable SQL for me. And sometimes it just takes a shortcut and leaves out certain tables that were put in this data set that enables human to write more readable SQL. So in this particular example, I insert this. I know it has looked in this date dimension table. We can execute this and I can talk while it's executing.
Björn Alm [00:05:57]: So this date dimension table is a way to allow human analysts to use dates in this data set. Because dates in these data sets are actually represented by big ints. You can see here, the date type here is big int. And the LLM has to figure out that it doesn't have to use this data mentioned table because it somehow knows that this is 1 January 2001. That's pretty cool. When I actually first saw this, when I was playing around with this, I was kind of surprised and it took me a little digging to find out what was happening. And I can show you if you're interested. That's it.
Björn Alm [00:06:40]: I get the result. It's slightly different here now when we're focused on the year 2001. But let's look at this number. So I ask the LLM, what is the big int value of January 1, 2001? That's like, okay, if I would approach to try to figure this out, what I would write, I'm asking the LLM this for it to come up with a SQL statement for me. Okay. It's using this date SK, which is good, and it's using the date dimension and it's adding my condition there in the wordpost. Execute this. Now, all of you, of course, remember exactly those numbers in the date interval from the previous SQL.
Björn Alm [00:07:26]: Oh, sorry, my bad. Execute. Now I can tell you it was this number. So that was a way to verify that the LLM actually created a correct SQL query, even though at a first glance I didn't actually understand it. But it can also help you understand the SQL. And that brings us to the next section. We've looked at how to generate and how to edit existing sqls, but let's look at explaining an existing SQL statement. So I have written this, but I'm supposed to use it.
Björn Alm [00:08:03]: But I do want to know what it does. I could spend time on this trying to figure it out. I think most of you could, but it's time consuming and you need to think head might hurt. So let's just ask the LLM to explain what it does. And the LLM again then is getting the complete relevant context it needs to understand what this query does. So this query, it has some acronyms here, abbreviations remove some vowels there in the names, but doesn't matter because it has the context and all the table names and it knows how to explain this to me in plain English. So this SQL query retrieves the custom ids of customers who have returned products to a store in New Mexico in the year 2000. And total return fees are greater than 20% above the average return fees of customers who have returned products from the same store.
Björn Alm [00:08:56]: Okay, so that's pretty advanced. Just to say, difficult to get your head around, especially just looking at a SQL statement, but the LLM can sort of condense it for you. And at least now I have a much better feeling for what this SQL statement does. And there's also a longer explanation that in more details shows me what's going on with this SQL. I can now choose save this as query and give it a name. And I can share this with my colleagues and when they pick it up the next time they have this common to explain. So this is really handy. Another thing that we looked at was optimizing existing SQL queries.
Björn Alm [00:09:45]: This is SQL query written by a human. It's not very well written. It can be optimized. We could probably do it manually, but again, why do that when we can ask CLM to do it in a matter of seconds? So now, again, getting the whole context for element tables, packaging that in a specific prompt template, asking it to optimize. And we see here in the diff, it removes a lot of code, makes a few adjustments, and we get something that is much easier back. We're also getting an explanation of what it has done. So the original query had unnecessary sub queries and joins. By removing sub queries and joining the table directly, the query becomes simpler and more efficient.
Björn Alm [00:10:35]: So it goes on explaining now what it has done in this optimization is really good. We can insert it and we can execute it. The result would be the same, but it would be easy to read the query and it would be faster to execute it. Now here is a SQL statement that I've written. I'm going to execute it. You can tell that I wrote it because there's a parse exception here. It's marked here. We see some red lines, but there are red lines all over the place.
Björn Alm [00:11:05]: And I can debug this and I can fix it, but Dellm can do that as well, way faster than I can. So I click fix. And this is the crux as well. The fix button is only available if there's actually something to fix because there's an online parser in this editor and it has a consultant look at what's happening in the editor and if there's a parse error there, it will enable the fixed button. Taking this further, we can do stuff like suggesting this, hey, we're seeing this, do you want to take an action? But right now we're enabling a fixed button for you. And we see that I made a couple of mistakes, I forgot the end parenthesis, I misspelled the table and it changes this because it knows that there is a store sales table. So there's a very big probability. That's what I actually meant.
Björn Alm [00:11:57]: And I misspelled a couple of keywords here. So now I can insert it, I can run it. I think it will take some time to actually get the result of this back. But we can see here the editor does not warn us underline anything with parse error indicators. And if there would have been a parse error on the server side by now, I would have received it. So the LLM saw my incorrect SQL query for me. That's what I wanted to show today. Thank you.
Björn Alm [00:12:29]: It was keeping trying.