MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Visualize - Bringing Structure to Unstructured Data

Posted Sep 03, 2024 | Views 181
# Data Visualization
# RAG
# Renumics
Share
speaker
avatar
Markus Stoll
CTO @ Renumics

Markus Stoll began his career in the industry at Siemens Healthineers, developing software for the Heavy Ion Therapy Center in Heidelberg. He learned about software quality while developing a treatment machine weighing over 600 tons. He earned a Ph.D., focusing on combining biomechanical models with statistical models, through which he learned how challenging it is to bridge the gap between research and practical application in the healthcare domain. Since co-founding Renumics, he has been active in the field of AI for Engineering, e.g., AI for Computer Aided Engineering (CAE), implementing projects, contributing to their open-source library for data exploration for ML datasets (Renumics Spotlight) and writing articles about data visualization.

+ Read More
SUMMARY

This talk is about how data visualization and embeddings can support you in understanding your machine-learning data. We explore methods to structure and visualize unstructured data like text, images, and audio for applications ranging from classification and detection to Retrieval-Augmented Generation. By using tools and techniques like UMAP to reduce data dimensions and visualization tools like Renumics Spotlight, we aim to make data analysis for ML easier. Whether you're dealing with interpretable features, metadata, or embeddings, we'll show you how to use them all together to uncover hidden patterns in multimodal data, evaluate the model performance for data subgroups, and find failure modes of your ML models.

+ Read More
TRANSCRIPT

Markus Stoll [00:00:00]: I'm Markus. I'm from Renumics. I'm founder and co-founder and CTO from Rainomics, and I like my coffee as a latte macchiato, sometimes with vanilla flavor.

Demetrios [00:00:15]: Welcome back to another mlops community podcast. I'm your host, Demetrios. Today, talking with Markus has been enlightening on the car simulation front, well, just simulations in general, and how he's been doing some really cool stuff with embeddings. We got into it at the end on how messy this data can be and what's been taking up a ton of his time and why he feels there aren't the right tools for the job when it comes to some of these data pieces that he has. And it's not only for simulations, but just for the immense amount of data that he's been gathering with these use cases that he's working on in partnership with car companies. So we got to geek out a little bit on car companies and roads. And if you're italian, I mean no harm when I say that you have shitty roads. It's just a fact of life.

Demetrios [00:01:13]: Also greek. I threw you under the bus. Let's be honest, there's some places that have really nice roads. And me living in Germany, I'm going to look down on all of you that do not have the nice roads. Cause Germany's got some nice roads. I'm gonna give it up to Holland. Holland. They might have some of the best roads in all of Europe.

Demetrios [00:01:35]: They spend a lot of money, and those roads are super smooth. All right, let's get into this conversation already. Enough about them roads. If you like it, share it with one friend, please. All right, real quick, I want to tell you about our virtual conference that's coming up on September 12. This time we are going against the grain and we are doing it all about data engineering for ML and AI. You're not going to hear rag talks, but you are going to hear very valuable talks. We've got some incredible guests and speakers lined up.

Demetrios [00:02:12]: You know how we do it for these virtual conferences. It's going to be a blast. Check it out. Right now. You can go to home dot mlops.com munity and register. Let's get back into the show, and let's just start from square one. What are you up to these days?

Markus Stoll [00:02:33]: Currently, I'm working on getting our company comply. So I'm CTO. So my role actually is to take care of scalability at the current correct time and correct level in the project. And currently we are working on getting some certificate for our projects. And that's what I am working at the moment at the most. But aside also working on a lot of projects. For example, I worked very hard and had a lot of fun working on it for visualization for reC. So I applied um, and other visualization techniques for Reg to, to give the possibility to take a deeper look.

Markus Stoll [00:03:31]: Visualization into rack data on the document side, but also on the query side, published data.

Demetrios [00:03:41]: Visualizing the embeddings.

Markus Stoll [00:03:43]: Yeah, I visualize the embeddings. So I reduced the embeddings using a UMAP to a two dimensional map. And this helps a lot when you try to get a good overview over your data. So you can see forming clusters of different topics, for example. Or you can find a lot of time, you can find some anomalies where you see that you have included some text that you actually didn't want to be in your database. For example, in this demo I wrote, Ize thought that Wikipedia footer and header texts were also included in the database and popped up like little clusters with a lot of points. And I solved that only because I was using this visualization technique.

Demetrios [00:04:42]: And you're exploring the datasets through this visualization, basically the embeddings. Visualization on a, I think it's a 3d graph, right? And so you can see a lot.

Markus Stoll [00:04:58]: Of people are using a 3d graph, so they reduce the high dimensional t one of like think almost 1000 dimensions to three dimensions. But I prefer to use a 2d presentation because that is a little bit easier to overview. So it's reduced more. So maybe you can see a little bit less, but it's much easier to present. So if you are creating slides from it or show someone specific clusters, it's much easier from my perspective if you have reduced it to 2d because you can use a plain image and it's very hard to represent the 3d graphics in a presentation or in a team and work together on it. On the team, that's a little bit harder. But yeah, I just saw, I used this two d, two d map with a lot of points on it and check out the different clusters. Look for clusters that are somewhere outside the other clusters and check if, is this expected? What's going on there? Is this data that's actually useful or is this some anomaly? And it gets even more interesting if you include them the questions is it.

Markus Stoll [00:06:21]: So you can for example, project the question the same space, all your questions, reference questions or questions from your users to find out which part of your document space is, for example, comported for your users or is important when you evaluate your model. So you can find, for example, also where is clusters in your documents that are not covered by your reference questions for your information. And then you can think about it and maybe add more questions, or maybe this cluster isn't not that relevant and you can drop it from there. Your document store.

Demetrios [00:07:07]: And if I'm understanding this correctly, you're also creating embeddings for the questions, or you're just looking at which embeddings the questions are pointing to.

Markus Stoll [00:07:18]: Yeah, you can use both. Usually you use, do you embed a question to find the relevant documents for this question? So you often bring heavily embedding for a question. But yes, of course, it's also very important to find that actually the nearest neighbor of this question in the embedding space, which are the documents that are relevant for this question. Maybe the five or 20 next neighbors are usually taken into account when answering the question.

Demetrios [00:07:50]: And then you're getting a better idea of what documents are getting high usage and what documents are lower usage.

Markus Stoll [00:07:58]: Exactly. You can maybe use that to color your similarity map so you can see how often the documents are used or referenced by questions and color code that to find some kind of a heat map and see what area in your embedding space is important or is referenced a lot and which is not.

Demetrios [00:08:24]: So how have you been using this? Like, what are you able to do once you have this heat map or once you know that the documents are being used more frequently? What do you do with that information?

Markus Stoll [00:08:37]: I think the most interesting part for me at the moment was to see how my, how the project evaluation questions are represented there. I have seen a lot of cases where the reference questions only span a very limited area in the embedding space and there are other document and then documents in the space. And it's very important to talk to the customer or whoever is interested in the data is here something missing? Do we don't need that data or should we try to add new questions? So I think the coverage idea for the reference question is very, very important. It's a very efficient step in the project to do, to get a better understanding for the expectation of the customer. And maybe I haven't used it yet in production to see how it works out with the presence of the end users, to see what documents are important for them. But yes, that's also a very interesting point. You could use that to optimize your model or to optimize the whole approach, or to reduce the data for the use case based on this information, how it's actually used, and maybe also compare the reference questions that you have worked out with the, the customer, and compare it against questions that are really asked by the users in the end system.

Demetrios [00:10:20]: Yeah. You're able to figure out what are those 20% of the documents that are getting 80% of the queries, and really make sure those are pristine. I also like this idea of, as you're building a product, you gain insights into what the expectations of your users are.

Markus Stoll [00:10:39]: Yeah.

Demetrios [00:10:41]: So this visualization is going to help you see in a different way this data that is coming through as people are interacting with your rag.

Markus Stoll [00:10:54]: And I think also what is important to also collect information from the, maybe to customer or in the end, the user. How could the results of the react system were for specific fetch to get this feedback, and also use this feedback in the visualization, again, to maybe find them areas or clusters in the document space where the system doesn't work as, as expected, not only to find which is covered, but also find out which cluster in the document space has switched quality level based on the feedback.

Demetrios [00:11:46]: So you're taking it a step further. It's not just what questions are coming through, what documents are relevant, what are the best ways to answer these questions, but also what is the actual output, and then using some kind of evaluation criteria on top of that, and being able to see what the output was and visualize it differently than just a bunch of question answer pairs.

Markus Stoll [00:12:14]: Yeah, exactly. And I think the important step is not to use some automatic measurements or criterias, but real user feedback. So we always try to include very early in our projects, some feedback mechanism for the users. Is this answer good or bad? Why should we discuss maybe this answer again together with you? I think that's much more important than using some rag metrics to find, to decide if the quality is okay.

Demetrios [00:12:54]: Yeah. So this just so everybody knows, this is an open source project, right? We can all go and download this and start playing around with it.

Markus Stoll [00:13:04]: Today we have different projects, so we have spotlight. Spotlight is our tool that we use to look into a different kind of data. So it has support for 3d data, like 3d meshes. That's where we actually come from. We did a lot of development for simulation engineering, so that's supported there. And video and audio and images, but also text. And it has support for visualization of embeddings as a Umap. And you can use different embeddings and visualize, for example, also rec data and visualize the questions in the embedding space.

Markus Stoll [00:13:55]: And the document documents in the embedding space. And that's what I have written about turbot data science a few months ago. And you can check it out, it's all open source. And there's also a rack project from us, which includes some limited, very simple rack techniques. It's a vanilla rack, and it doesn't include the feedback mechanisms, actually. But you can use it for basic projects and visualize embeddings of documents and questions with it, this article.

Demetrios [00:14:38]: But it's not only for Rag, right? You can also do classifications. Or is spotlight specifically for Rag?

Markus Stoll [00:14:47]: So spotlight is specifically for, I would say, machine learning data. So it has support for text, has support for images, videos, a lot of audio, because we have customers that use it for audio and also for 3d animations. Actually, we used it for exploration and classification of simulation data of car crashes.

Demetrios [00:15:21]: Oh, wow.

Markus Stoll [00:15:22]: So where you can compare the end result of a car crash based on similarity of the deformation, for example. But that's something you have to prepare in the project. So the idea of spotlight is only to visualize different kinds of data, like 3d data or audio or images and a lot of different stuff, but together with embeddings. And you can use this embeddings to create some kind of a similarity map. And then you can use the similarity map to browse through or go through your data. You can zoom in, select a specific point, which can be, for example, the result of a car crash, and then you can compare two points in the similarity map that are next to each other by visualizing the actual simulation. So like the 3d animation of the car crash. So the idea is to have, have a tool that combines the machine learning view of the data, like UMAP, or statistics of each points in the dataset, but also have the possibility to very deep zoom into each point to find out what the actual difference is.

Markus Stoll [00:16:44]: So it's the idea to bring the machine learning engineer on the same page together with the domain expert, for example, a simulation engineer for our cases.

Demetrios [00:16:57]: Okay, and let's talk a little bit more about the simulation engine, because this is fascinating to me, the idea of creating a 3d space and then simulating car crashes and creating embeddings out of those simulations, so you can know which car crashes failed in which ways and similar ways. And if you wanted to then search for failures or car crashes, I guess you could search through the embedding space.

Markus Stoll [00:17:28]: Yep, that's the idea. So if it works well, you have to, of course, you have to do some work to get this embeddings it's not the best idea to just take the whole animation, the whole part or the whole car into the embedding space, but often very easy to just define. This is our region of interest defined by maybe this thousand points, and use this thousand points to create embedding for this area of the mesh. And then you can cluster the result of the, of the simulations using a Q map, for example, based on the embeddings, and find clusters of similar defects where the results of the cartridge were very similar. And maybe you can find some outliers when you have found a very, very good configuration for some material.

Demetrios [00:18:33]: So that's, this is working with car companies on testing new materials or testing new designs?

Markus Stoll [00:18:43]: Yeah, that's a use case for that. For example, something we don't have this specific case I'm now talking about in a customer use case, but that's a showcase from us. And there are people that work like that, not with spotlight every, all of them, but. But some of them. But there are, of course, also different areas. We have a lot of projects in this interviewing space of car companies. This starts with maybe this simulation post processing, where you can organize the results using embeddings, but also in the, for example, pre processing of the simulation. There's a lot of manual work to do.

Markus Stoll [00:19:32]: For example, you have to define how if you combine hundreds or thousands of parts together in a simulation, how these are connected. And you do not only have always have the information how these parts are connected, you have to look into and compare it. Okay, this is part a, this is part b. And this is usually connected with this kind of connection. And we have in a very early old project now, we had built a classifier that automatically detects the different parts and gets the idea what the connection of this part should be. And engineer can use this tool to create hundreds of connected automatically and reviewed afterwards and then uses in a simulation.

Demetrios [00:20:31]: Oh, nice. And so this is, again, going to this specific use case. It is basically so that when someone is creating a new style of connections or new parts in a car, they're able to keep that going so that the simulation can understand how the dynamics would be and how physics would work with that. Or is it so that the next person who comes and tries to do stuff with it knows the way that those parts were put together?

Markus Stoll [00:21:04]: In this project, it was about creating the correct representation for the simulation. For example, if you have maybe a screw in reality, you would, in the simulation, just tie the points where this screw actually would be together using a constraint. So we wouldn't model the connection itself as how it is really is a very simple model. And you can have a catalog of this simple models. And I train classifier to find in which situations this connectors the cord wants to take and warm that up.

Demetrios [00:21:56]: I've got to come clean with you. I absolutely am addicted to these simulation videos on TikTok. When you watch, but only when you get somebody good, that's giving a really nice voiceover. And they're making it really funny because you have different cars getting smashed by semi trucks and it's like, would you survive? And then you have people giving voiceovers that are hilarious. But for the most part, you're not surviving when you're getting hit by a semi truck.

Markus Stoll [00:22:27]: Yeah, yeah.

Demetrios [00:22:28]: That's the key takeaway there. As if people didn't know that one already. But so I. Let's talk more about what some of the other pieces that you have been working on are. What some like cool use cases that you've been seeing out there are a.

Markus Stoll [00:22:45]: Very cool use case for us at the moment is the evaluation of test data. So if you build, for example, a car, then you test it on test tracks with test drivers and you get a lot of results. And it's very hard to get all the results into a representation that really allows you to find anomalies very easily or find good setups or find errors. It's actually, I think at the moment, a lot of manual work. You have an expert that has to go to a lot of data, and.

Demetrios [00:23:27]: That'S just because there's so much data that you're getting.

Markus Stoll [00:23:31]: Yeah, so much data. So it's a lot of work to do and I think the tools are not that good how they could be actually from a machine learning engineer perspective. So yeah, we are currently working on projects where we use anomaly detection or classification to classify bad behaviors or arrows during the test drive of cars, where we can use automatic approaches and reduce the time that is necessary to actually review the data.

Demetrios [00:24:22]: This is basically like, well, you and I are both in Germany, so let's take some german cars. Right? We'll go with Porsche just cuz so this is Porsche saying we've got a real car with a real driver, we're gonna go take it onto the track. That has a lot of different styles of roads. It's got the italian roads with all the bumps in it. It's got the german roads with the smooth sailing where you can go on the autobahn 200 km an hour. It's got many different scenarios. Maybe dirt roads, whatever. You have that test drive in real life, it's no longer, we're no longer talking about simulations, right? So it's in real life, it comes back after and it probably does various of these.

Demetrios [00:25:03]: I imagine it's going through and doing many of them, and you're trying to just take it to the limit sometimes, or you're trying to just see how it drives normally and change gears, make sure it does everything that are table stakes. Well. And then when it gets to those edge cases, see how it performs there. I can see how if you have the minimal amount of sensors on a car, that would just be so much data that you have to sift through and see if there's any kind of anomalies. But now I can only imagine that Porsche has more than the minimal amount of sensors, right? There's probably sensors in everything that you now are looking at and you're trying to see, is this working properly or is this working less than properly? Because I imagine if you're a human, you don't necessarily feel if something's off. But I, you want to make sure that, all right, it works for 10,000 km or 10,000 miles, but we want it to work for 100,000 miles. So we want to be able to catch things early and fast. And now you're trying to sift through the data and see if there's going to be something that breaks.

Demetrios [00:26:23]: How do you even go about that?

Markus Stoll [00:26:25]: It really depends on the data that you have already. So of course, you have very long records, you have a lot of channels, you have different modalities. We have sensor data for acceleration, for example, or sound, maybe also a camera. So a lot of different data also. And the first thing to do if we set up the project like that is to see is there historical data that is already made. Do we have some data that we can use then? We can, for example, go for a classifier for specific arrows that we have already seen in the past to find them again in the new data. No, actually, automatically. Or do we have to? If we don't have the data, then we could go approaches for anomaly detection and find where some very, very rare cases and look deeper, deep into the point in time, in all channels and all videos and find out what's going on here.

Markus Stoll [00:27:34]: And of course, you have this idea that I often prefer to do some visualization of it. So you use embeddings again for the different signals and maybe for the images, for the audio data, and project that into different similarity maps, go for your clusters and find out what's going on in a different cluster, then you can filter out one cluster if that's not interesting and find out what you end up with being interesting for you. And maybe you can use simple concatenation of the different embeddings from the different modalities into one visualization. Or you can use multiple visualizations that are linked together. So you can select on one side the data points based on the audio signal, and on the other side you see, okay, if the cluster of the audio signals also form the subcluster in the, the different sensor space and use that together to find some specific clusters again. So you can actually use the embeddings for data exploration. But I think in the real projects at the moment we are doing, we are focusing on very specific cases to find where we can use classification or maybe also anomaly detection, for example, to find out that the simplest thing is that something is wrong with your setup. So that's a very common case.

Markus Stoll [00:29:30]: Just to find out before you start the actual measurement to some anomaly detection. Maybe you have connected your sensor in the wrong way, you have switched to cables. Is that some other.

Demetrios [00:29:47]: Yeah, yeah, I can imagine that happens more than we would like to admit. And it's good to find that early so that you don't waste your time on bad data and you're trying to figure out what's going on here. Why is this data not giving me these answers and you recognize, okay, well, there was some cable switched or there was some type of sensor that was broken or whatever.

Markus Stoll [00:30:11]: And one thing that we are currently starting looking into is the combination of the direct world and this time series data world, so that you can use questions not only for your tokens, but also you can include the data. Bye. The simple generation of SQL statements maybe, or by cogeneration to give also the engineer or the test driver the possibility to explore the data very easily.

Demetrios [00:30:47]: Wait, so explain this more because this seems awesome. So basically you're combining these two worlds, but I didn't quite understand what that entails.

Markus Stoll [00:31:01]: For example, you could ask them the system, which was my best round on the test drive. And the system creates an SQL statement to query this data from the store and find the fastest round. Or you can try to find, or try to ask more difficult questions, maybe to find what could be the reason for something. I heard something was different than usual. There was an anonymous sound in round five. What could this be? And then the system can try to bring the relevant data to you.

Demetrios [00:31:48]: So it's interacting with the tabular data in natural language.

Markus Stoll [00:31:54]: Yeah, that's the idea. Maybe by simple SQL statements, maybe by code generation, but the idea is to make the tablet data accessible for test drivers or test engineers, not only for ML or data engineers.

Demetrios [00:32:14]: Yeah. So it's that test driver who says, heard when I was going around that fourth lap, there was something that was making a weird sound or the brakes weren't working as well as I would have hoped. Can you pull up why that could have been? And so then you have that type of. Yeah, that's very cool.

Markus Stoll [00:32:38]: Yeah. And the big point I think is what is then the result of the system? At the moment, maybe you get an answer in text, but I think it would be much better than to get some visualization of the different channels that were relevant for this event and maybe a highlight of the specific event in this channels, and maybe also the visualization of the whole data point in your similarity map with the UMAP representation. So I think that just answering the question with text is okay, but a very good user experience would be that you get then the automatic zoom it into the data. That is interesting for you.

Demetrios [00:33:24]: Yeah. More than you could ever want. And then you can go and you can use what's relevant to you. So then now talk to me about how you're seeing multimodal being used. Because you mentioned there's audio data, there's visual data, there's also the sensor data, and it feels like all of that is playing a part to give you a much better and robust picture. And so what do these things entail? That's almost like what you were just saying, where at the moment when you're asking this text to SQL type of question, you're just getting text back as an answer. But in reality what you really want is all the relevant data. And so talk to me about the multimodality.

Markus Stoll [00:34:17]: I think at the moment for us it's a common approach to have different models, so also different embeddings for the different modalities. Maybe it would be in the future also possible to have all the different modalities combined into one model so trained on the behavior of the car simultaneously in the signal space, in the audio space, and in the corresponding image space. But at the moment, maybe that's a little bit too tough for us. I usually try to keep the things simple, so I prefer to use one model that is specifically trained for audio and use this maybe for embedding the results or for classification. Add a second model trained on video or images for the second group of embeddings or classifications. Again, of course for classifications, it would make sense to maybe train a classifier on the combined output of both models, the embeddings. Actually that could be a good idea, but yeah, currently not working on one model for old modalities together, but on different models, each pre trained on data from, from the Internet and maybe fine tune on data from the different projects, but usually specific for one modality for audio or other signals and a different one for images.

Demetrios [00:36:05]: Yeah, it feels like the ROI of trying to figure out if one model can rule them all might not be there, especially if you're finding success with each model being very specific to the specific modality. I could see that.

Markus Stoll [00:36:23]: I think that the big problem in this domain we are currently working in, industrial AI domain, is that you can't easily use a pre trained model that is trained on data from the Internet because the sensor data or the audio data that is relevant in the project is very, very different from the data you can find in Internet and the pre trained models on that data. This something I think is very clear if you look into, for example, image processing. So you get a classifier pre trained on Internet data. It's for cats and dogs and persons or all the large datasets. Travel industry. In the industry, you then have very, very maybe boring images from production machine where you want to compare products, if there classify products, if the result looks okay, well, does have a defect. So that's a very different use case then the models are trained on.

Demetrios [00:37:42]: So you're creating all your own custom embedding models, or you fine tuning different models that already have, like you just grabbing base models from hugging face and then fine tuning them both.

Markus Stoll [00:37:56]: So usually we try to find you models again under very different data. But I'm not sure if there's one strategy for us that always works. We always have to figure out what's best for the specific project. It's always different. Sometimes we can have very good results based on pre trained models, but sometimes it's clear that we have to do a lot of custom fine tuning or even trained from scratch. We also train sometimes models from scratch for specific use case. For example, this detection of part connections for the simulation was completely trained by us without the initial weights.

Demetrios [00:38:48]: Yeah, it sounds like it can get messy, especially because you're not getting that much data out there on the free Internet.

Markus Stoll [00:38:56]: Yeah, exactly. It's also always tough to get enough data. In the projects, there's often a lot of data available, but quality is always challenging. And of course you often want to have enabled data. And it's rare.

Demetrios [00:39:20]: Yeah, that's expensive and rare.

Markus Stoll [00:39:24]: So usually we have to set up a label process in the projects at the moment, not for the rec projects, of course, it's a little bit easier, but in the engineering domain with test data or 3d mesh data, there's always a lot of to do on labeling side.

Demetrios [00:39:44]: Well, yeah, it's a little bit counterintuitive because as we were just talking about, you get so much data when you go for a test drive in one of these cars. But then you need to label that data, you need to clean that data. And just one test drive, I imagine, will take you so long to get actual gold standard data from.

Markus Stoll [00:40:07]: Yep, exactly. And also the data you have is often hacked into different systems based on the measurement. For example, for your sensor, maybe you have the data in a different subsystem then for your audio data, because the device is from a different vendor. So you have to combine the signals again, and you have maybe to synchronize, to re synchronize them because you don't know how to align them.

Demetrios [00:40:36]: No, wait. So what do you do there? Because I just think about when I make music and I have two microphones, I clap so that I can easily synchronize them after the fact in post production. But I don't see you going before a test drive and clapping so that you can easily synchronize them.

Markus Stoll [00:40:59]: Yeah, you have to do something like that to get them synchronized. Or maybe you can try to have some really good clocks where you can use the timestamps for the synchronization, or you can, if you don't have that, then you can go for specific events where you can use two to ray range signals, because often you also not only have the mismatch between them, we also have a drift. So, because the frequencies of the measurements are not perfect and rewrote sometimes, and then you have to do this not only in the beginning, but again and again.

Demetrios [00:41:46]: Yeah. And so that feels like just a whole lot of time that you're spending trying to get that data so that it's synchronized and then you label it, and then maybe you can do something with it.

Markus Stoll [00:42:02]: And a different problem usually that is occurring is also that you have not always a good standardization of the test drive. So you have different test drives with signals and maybe some different captions for the signals. And you have to combine this all into, if you want to train from it, into some data set that is standardized. And I, this can also be very challenging if you have recorded it for a long time, that's actually something we also want to try using LNM's. So to find out which is the current name of the channel and what could be the last name, maybe the test drive from five years ago to do this matching which column has what meaning is something we try to figure out with nlms at the moment.

Demetrios [00:43:09]: Yeah, it's like, hey, given these attributes, given this type of data, given this last data set, give us your best guess.

Markus Stoll [00:43:19]: Yeah, exactly. If that doesn't work out, maybe also by comparing the data looking what ends up similar embedding results, for example. So based on similarity, you can try to do this match just to get a good representation at the beginning of the dataset.

Demetrios [00:43:41]: Yeah. So now I'm understanding more clearly and vividly why you were mentioning at the beginning of this conversation how you felt like there weren't proper tools in place for this type of thing. It feels like you're probably doing a lot of stuff manually. It's taken you a ton of time. There's a bit of pain, whether it's the types of data being unsynchronized or coming from different vendors. So the sensors are going to be completely different. You have that data drift, and being able to get that data in order is the first step before you can even think about any of this other cool stuff that you're doing with it and the embeddings and all that fun stuff, you gotta just get that data in order.

Markus Stoll [00:44:33]: Yeah, usually we try to focus on specific questions in the projects, and we only have to clean up the data for the specific use case. So I don't think it's a good idea to start with cleaning up the data until it's perfect, but try to clean up the data and get all the data you need for a specific use case, like a classification for a very specific thing in the data. Because if you try to get the perfect data in the beginning, maybe you take a lot of time to create a very good dataset. And I think often you don't know enough how the perfect data set would look like until you start trying to use it. For example, by trading for classifier and using this classifier also. So I think it's very important to go through the whole steps of the projects, through the whole pipeline very fast, and then iterate at the beginning again. So don't create a perfect data set, but just give it a start and use it and find what is the real problem in the data set and fix that, because you can't fix all.

Demetrios [00:45:57]: The programs there, so that's fun to think about. It's taking that same mentality of having an MVP. And I think I've heard somewhere someone saying, if you're not embarrassed by your first version of your product, you waited too long to ship. And so you're saying, hey, just get something for one use case so you can start to understand what the data should look like and then slowly expand your use cases. And by the time that you have more and more use cases, then you'll understand what the larger corpus of data should look like. Because if you aren't doing it use case driven, you are at risk of spending a lot of time to curate that data in a way that isn't going to be useful for the use case.

Markus Stoll [00:46:49]: Yeah. And maybe also that I think very important point is that maybe in the beginning you don't even understand the use case. I think it's very important to close the loop to get an understanding what the customer or the user actually need in the end. That's also true for our rack projects at the moment. We usually start with vanilla, very simple rack system, just that can be implemented in a few days or a few weeks and then give it to the customer and look how they use it, what questions are being asked, and then we can try to optimize into some multi step rack or all the fancy stuff you can use to increase the quality. But be really stuff with a vanilla rack, keep it simple. Yeah.

Demetrios [00:47:42]: Get something out there fast, see what is hitting, and then learn and iterate from there. It's a great strategy for doing that. I also like that. I think you can do that with just about anything. And if you are coming into it with that mindset of let's just test and iterate and test and iterate, and eventually we're going to start to hit the direction that we want to be going in. Or as I've heard other people put it, it's almost like in the beginning you don't have to know exactly where you want to go. You just want to be directionally correct.

Markus Stoll [00:48:21]: Yeah.

Demetrios [00:48:23]: So as long as you're moving in the right direction, you're good. Well, this has been awesome, man. I appreciate you coming on here and taking the time to have a second conversation with me because our first one didn't get recorded, but I'm just lucky because I get to chat with you two times now instead of warm.

Markus Stoll [00:48:42]: Thanks for the invitation.

+ Read More

Watch More

Bringing Structure to Unstructured Data with an AI-First System Design
Posted Jul 12, 2023 | Views 451
# AI-First System Design
# LLM in Production
# Coactive AI
# coactive.ai
Monitoring Unstructured Data
Posted Nov 22, 2022 | Views 686
# Unstructured Data
# Embedding
# Arize
# Arize.com
Bringing DevOps Agility to ML
Posted Sep 05, 2022 | Views 793
# DevOps
# Agility
# Infrastructure
# OctoML
# OctoAI
# Octo.ai