GPU Accelerated Data Curation for LLMs // Ryan Wolf // DE4AI
Ryan is a Deep Learning Algorithm Engineer at NVIDIA, focusing on data curation for foundation models. Previously, he graduated from UIUC and worked on AI systems at the healthcare technology company Advent Health Partners. Right now he is focused on the development of NeMo Curator, an open-source library that enables GPU-accelerated data curation.
Scaling large language models is a well-discussed topic in the machine learning community. Providing LLMs with equally scaled, well-curated data is less discussed but incredibly important. We will examine how to curate high quality datasets, and how GPUs allow us to effectively scale datasets to trillions of tokens with NeMo Curator.
Adam Becker [00:00:01]: Ryan, I think you're already. You're tuned to go. To go next. Maybe you'll bust in song. Maybe you'll start telling us, what are some songs from Hamilton? I can't tell you. All right. The correct answer is Ryan. Oh, my God.
Adam Becker [00:00:22]: Ryan, maybe it is true. Maybe you are gonna bust into song. Okay. I learned a lot from this, and in particular, I learned that. Ryan. Ryan, are you with us?
Ryan Wolf [00:00:42]: Yeah, I'm with you.
Adam Becker [00:00:43]: Is it true, or was that.
Ryan Wolf [00:00:46]: It is true. It is true, and I'm a little bit disappointed. You thought that was an embarrassing fact. I'm quite proud of it. I think it's. I think it's a skill. It's a skill I refined over many years.
Adam Becker [00:00:54]: Well done. When did you. You saw it in person, or you saw it in the. How did you first watch it?
Ryan Wolf [00:01:00]: No, I didn't see it in person. No. I listened to the soundtrack first for the first couple years, and then when they came out on Disney, I watched it there, and then I finally went and saw it in person a couple years ago.
Adam Becker [00:01:08]: You saw it. Okay. Was it good just as you imagined, or even. Yeah.
Ryan Wolf [00:01:12]: Oh, better, better, better. So much better.
Adam Becker [00:01:15]: Did they change the cast? I heard they changed the cast from the first set, right?
Ryan Wolf [00:01:18]: Oh, yeah, yeah. It's been going on since, like, 2016 or something. If the same cast was doing it this whole time. Oh, man, I'd feel bad for them.
Adam Becker [00:01:24]: For people who are not well versed in it. What's the best song to look up?
Ryan Wolf [00:01:29]: Oh, man.
Adam Becker [00:01:30]: Or one of your favorite ones?
Ryan Wolf [00:01:31]: I would go with my shot. It's one of the opening songs, one of the first songs in the musical. It's very good.
Adam Becker [00:01:36]: My shot. Okay, Ryan, this is your shot. Now, do you have your screen up? You do?
Ryan Wolf [00:01:43]: I think I do, yeah.
Adam Becker [00:01:44]: Today we're gonna talk about gpu accelerated data curation for LLMs. Take it away, Ryan.
Ryan Wolf [00:01:51]: Great. Thank you so much. I really appreciate it. So, yeah. Hello, everyone. My name is Ryan, and I'm a deep learning algorithm engineer here at Nvidia. And for the past year or so, my team has been working on developing tools to assist researchers in curating large datasets for training LLMs and multimodal models. And I'm coming here today to share some of the challenges that we faced and some of the lessons that we've learned when scaling our data curation efforts.
Ryan Wolf [00:02:18]: In particular, I want to focus on some of the steps of our pipeline that we were able to accelerate with GPU's and share how we did it. Let me give a brief overview of what I want to cover. First, I want to give some motivation to data curation as a whole to explain why we're doing all this in the first place. Next, I'll walk through an example pipeline of what a typical data curation pipeline looks like for LLMs and discuss where in the pipeline we start seeing bottlenecks. And then we can talk about some of the individual sips that we've looked at and GPU accelerated, starting with fuzzy deduplication, then moving on to semantic deduplication, and then ending on some classifier models. Okay, let's talk about the goal. So data curation is all in service of the model. We evaluate the effectiveness of data curation methods entirely by how much they improve the downstream accuracy of the model.
Ryan Wolf [00:03:12]: As a bonus, if we can get some improved accuracy with fewer training steps, we'd like to get that too. And improving downstream accuracy through data curation typically comes from two things. The first is removing data that is actively harmful for your model to see during training. This is data that is just better off not seeing in the first place. Second is getting the best subset of data possible for your compute budget. Often you'll have more data than you can actually train on, so it's important to train on the best data that you can with the amount that you have. So an example pipeline kind of looks like this at a high level. Let's start from the beginning.
Ryan Wolf [00:03:49]: At the very left, you start off with your raw data hosted on various sources. This could be like cloud storage. It could be your local workstation or your cluster. You first need to aggregate the data and then extract it into some usable form. For web data like common crawl, for example, this typically involves extracting plain text from the raw HTML pages. And next you might apply some basic cleaning and preprocessing steps. This would be something like fixing any character encoding issues that you might have in the data that are artifacts from the raw data itself, or from the extraction process, or doing some simple language identification or preliminary quality filtering using some tiny n gram models. You then might apply some kind of simple heuristics like the gopher rules to remove documents that are too repetitive in your data set, or documents that might have too much punctuation for their length, or something very simple heuristics.
Ryan Wolf [00:04:44]: Then we in the later stages, we get into some of the beefier, more compute heavy operations like deduplication. Typically you want to remove duplicate records from your data set. This is because research has shown that repeating tokens during training is usually not as beneficial as seeing new tokens. But what counts exactly as a duplicate is a little bit subjective, and there's a couple of ways we can categorize them. So, exact duplicates are kind of character by character duplicates in text, and you can easily identify them by just hashing a document and hashing all your documents in the dataset, and then comparing hashes to see if they're the same or not. Fuzzy deduplication is also character based and designed for text, but instead tries to capture near duplicates as well. With documents that might only vary slightly. This could be some legal boilerplate that is common across multiple documents, but with like, slight variations in the name changes.
Ryan Wolf [00:05:33]: Or it could be an article that heavily quotes another article in your data set. We'll discuss the exact algorithm that's kind of used for determining what exactly a duplicate is under fuzzy deduplication later. So many deduplication is the final deduplication method that I'll mention that instead of looking at character level similarity of the documents, it instead tries to assess the semantic meaning of each document or image. Um, and, uh, compare the semantic meaning via embeddings. So it'll embed each document in the data set and then remove documents that have too similar of embeddings. Um, semantic duplication is unique in the fact that it can be applied across modalities, and we'll discuss more about that later and kind of the benefits of, of, uh, cross modality duplication later. So, um, data annotation is the last stage where you might also want to annotate and filter data based on other signals and more complex classifiers like BerT models. And then finally, at the very end of your pipeline, you can aggregate all of your data from the original sources and from each individual curation pipeline and blend it together.
Ryan Wolf [00:06:36]: And through this pipeline, you might see a pattern kind of format. We started beginning doing some of the less computationally heavy stuff, some of the simple heuristics, some of the very tiny n gram models, and then slowly build up to the duplication and data annotation with bigger models. This is simply just because the scale of datasets today is so large that you have to really be mindful of the data that you're processing and try and minimize the amount of data that you process with some of the more compute heavy methods. And so, speaking of scale, scale is really the primary concern with data curation right now. Llama 3.1 was trained on 15 trillion tokens, but the raw data set size was almost certainly much larger. And several open data sets that I've listed here are based on dozens of common crawl snapshots totaling hundreds of terabytes in raw size. And for this reason, we can't apply the heavy classifiers on all of our data. And even though BerT models are thought of as small in terms of modern LLMs, they're usually way too big to run across hundreds of terabytes of data due to the limited compute requirements that you might have.
Ryan Wolf [00:07:34]: But I do want to focus in on these final few steps of the pipeline that I brought up, the de duplication and the data annotation, because that's where we saw the best ways to heavily speed up our curation. So we found that exact fuzzy ansemanity duplication can be accelerated with GPU's. And while Bert classifier models are typically already on the GPU, we still found other ways to optimize them for offline inference and similarly for semantic duplication. We've optimized the embedding creation along with the rest of the deduplication pipeline. So now let me walk you through the first step of the pipeline that I want to talk about, which is fuzzy de duplication and the process that our team went through when discussing this. So we knew that fuzzy deduplication was an important part of the curation process. It allows us to identify documents that are similar on a character level, like I mentioned before. And the process for identifying fuzzy duplicates goes like this.
Ryan Wolf [00:08:30]: You take your data set, which is a collection of documents, and you generate a min hash signature for every document. Min hash signatures are basically a list of hash values for a given document, and these hashes are computed in a special way, such that if two documents overlap for, say, 80% of their min hash values, the character level overlap is likely 80% as well, meaning that around 80% of the content of each document is the same between them. Now, it would be great if we could pairwise compare all of the min hash values in our dataset with all the other values in the dataset. But again, scale is an issue here, and that is impossible when we get to the billions of documents. So instead, we approximate the comparison by grouping the min hash signature into buckets. So remember, a signature is just a list of hash values. So grouping in this case just means segmenting the list. So if you have a signature with 200 hashes and you have 20 buckets, you'll have ten hash values per band or per bucket, I should say.
Ryan Wolf [00:09:29]: And so then in this Lsh step, we combine all of the hash values in a single band with another hash function, such that we're left with 20 hash values total. We then consider document duplicates. If any of these resulting 20 hash values are the same between two documents, we also consider, like the network of duplicates here. So because a document has multiple bands in it, it will still have multiple hash values. It can collide with multiple documents at once. And so you can imagine, let's say, the first and second documents in your data set share the same hash value in one band, and then the first and third share the same value in the third band. We consider all three documents to be part of one connected component of duplicates that we then try and identify, and then remove and remove their duplicates, and only keep one for the training set. So our team wanted to do this at the multi terabyte, at large multi terabyte scales, but we were running into bottlenecks on the CPU.
Ryan Wolf [00:10:22]: We ran our original CPU implementation on a multinode CPU setup, and we benchmarked it. And we had originally planned on scaling up this deduplication for orders of magnitude more data, but with our estimate, it would take weeks to deduplicate it, and our research team couldn't wait that long. And so my teammates built in a GPU accelerated version that does the entire process that I previously outlined on the GPU. We use dask CuDF, a distributed data frame library that's also GPU accelerated to parallelize the min hash and LSH computations across multiple nodes. And Q graph provided the GPU acceleration to compute the connected components at the very end of the pipeline. So the entire end to end pipeline now can run entirely on the GPU. And as we can see here, the numbers I've given, we were able to drastically reduce the time that it took for the duplication to occur. And then the figure on the right demonstrates some of the ablations that we've done with this method, and demonstrating that this deduplication provides a significant benefit to the downstream model performance compared to not deduplicating your data at all.
Ryan Wolf [00:11:30]: So, yeah, that's kind of the overarching view of fuzzy dedupe and one of the modules that we have accelerated with GPU's. Now, I want to go into a different form of deduplication, semantic deduplication, that we've become really popular in. Data curation, kind of recently cemented duplication, like I mentioned at the beginning of the presentation, is very popular in multimodal models, and it works like this. You take all of your elements in your data set, whether that be text, images, or whatever modality you're working with, and you generate an embedding for each one of them in the diagram. This generation of embeddings is just represented by populating this plot with dots. Now, we'd ideally like to compute the pairwise similarity between all of the embeddings in our data set using something like cosine similarity or l two norm, something like that. But just as with fuzzy dedupe, the pairwise computation across an entire data set is just impossible. So as a preliminary step, we then we use k means clustering to cluster the embeddings into different segments.
Ryan Wolf [00:12:30]: And so here in the diagram, that's represented as assigning a different color to each of the clusters. Now, we can compute the pairwise embeddings within each cluster, thereby reducing the computational complexity and actually providing a scalable solution. And we say that if two or more embeddings share are within a certain distance of each other, you can kind of tweak via a threshold. Then we typically keep the one that's closest to the cluster centroid and then remove the rest.
Adam Becker [00:12:59]: And.
Ryan Wolf [00:12:59]: Yeah, so let's discuss kind of the importance and the impact of this method. So, in the chart on the right, you can see that for image curation in particular, semantic deduplication is incredibly valuable for improving your model's downstream accuracy. In this case, it's able to match the performance of a baseline model with only half the number of training iterations. And in terms of the time spent on semantic deduplication, most of the time is spent on the embedding creation. And most folks understand that you need to run embedding creation on a gpu. But we have found some unique ways to squeeze more performance out of the embedding creation for text, in particular that people might not be aware of. I'll discuss more on that later when I get to the classifier models. But besides that, we also found that moving the k means clustering step from CPU to GPU and accelerating it with QMO provides a pretty good speedup as well.
Ryan Wolf [00:13:51]: The thing that we're the most proud of, though, is not really any individual speed up number, but it's the scalability of this and the fact that I think we were the first open source solution that can scale to multiple nodes and multiple GPU's for semantic duplication. And so we're currently seeing really how far we can take that in our experiments. And lastly, let's talk about the classifiers. So in particular, I want to focus on single GPU classifiers in the 100 million to 1 billion parameter range. These classifiers are typically already run on the GPU compared to smaller like tiny n gram models that might have only like hundreds of parameters or something. And so these classifiers are only used, are really often used at the end of the pipeline to help annotate your dataset with some other quality signal. It could be a domain label. For example, if you're trying to train an LLM that's very proficient in sports, you might use a domain classifier to find all the sport data in your data sets and only fine tune or train the LLM on that.
Ryan Wolf [00:14:50]: Or you can use a simple quality Bert model again to just add another layer of filtering on top of it and say, hey, I only want the highest quality documents that this more sophisticated model teams is high quality. And these classifiers, again, well, small in comparison to state of the art LLMs, are still large enough that this is, they really need to be run at the end of the pipeline and run in as little data as possible. And since these models typically already run on the GPU, you might be wondering what can be done with them at all to speed them up. And there's actually a few speed as we can get kind of for free, that doesn't involve changing the underlying model at all. First, by profiling the memory usage of the model at variant sequence lengths and batch sizes, we can dynamically adjust the batch size based on the sequence lengths in the batch. This generally looks like saying, hey, if you have a bunch of small sequences, let's just increase the batch size. So we utilize all the GPU memory that we have available, and for some BERT models, we can actually accelerate them with GPU tokenization and with these methods. This is also how you can get going back to the semantic deduplication, a speed up in the embedding creation again for free, without changing the model at all.
Ryan Wolf [00:15:58]: Now, as you can see here, we've abolated on a quality classifier that we have that I mentioned in the previous slide and demonstrated that hey, this is a pretty good performance boost with using the quality classifier. And again, using our techniques, we can accelerate the inference of the model by 40% without changing anything in the model, just plain Pytorch. You can download the model and then just strap on this profiling techniques and just get a free performance boost. And so, yeah, that's pretty much it. We've learned a lot about curating datasets, and my team is really trying to share all the knowledge and the tools that we've had. I provided a link to this Neemo curator repository. This is where all the methods and tools that I've discussed so far are being developed and have been developed and will be developed. We're adding new stuff all the time.
Ryan Wolf [00:16:46]: And I've also provided a link to a documentation for Neemo curator to help you get started. And there's also a link to the docker container to help you. You can pull it down and immediately start running the code on your machine right now, right after the presentation if you want. And yeah, I'll be here for questions, obviously, at the end of this, but if you don't get your question answered or you think of something later, here's a link to my email and other ways you can contact me. So, yeah, finally, credits. I want to give a huge shout out to all my teammates. My teammates deserve all the credit for everything that I've mentioned here. They've been fantastic and have.
Ryan Wolf [00:17:23]: It's been a huge effort over the past year. And so many people have made so many great contributions to this to get this to where it is. And also, thank you to the organizers for letting me come out and give a good talk about this. So, yeah, I'll now take questions.
Adam Becker [00:17:35]: Thank you very much. This was excellent. And I was actually struggling with a very similar problem for a very long time. And this is a very tricky and pernicious problem to solve. So I want to see if other people have questions, and until they do, can you go back a few slides, the one with the Minhash?
Ryan Wolf [00:17:55]: Sure.
Adam Becker [00:17:56]: And I want to know how deep I can go in terms of. Okay, so the.
Ryan Wolf [00:18:00]: This one?
Adam Becker [00:18:01]: Yeah. So let's just, like, break it down step by step. We have a bunch of documents, and we suspect that there might be some duplicates there, but they're not duplicates in the typical sense, where it's not the ex. Not necessarily identical document, where you can just produce some kind of hash of the document and then compare the hashes. There's going to be something that is different here, right?
Ryan Wolf [00:18:22]: Yeah.
Adam Becker [00:18:22]: Okay, so. And then let's just imagine that we're dealing in just like. Like the single, like, modal kind of sense of. So let's say just text. Can you remind us what might be one example of such a near identical document that isn't exactly identical?
Ryan Wolf [00:18:41]: Sure. Yeah. I think a good example of this would be a legal document, for example, where you might have a license file or something that is very similar, has a lot of the same boilerplate across many different things, but might have different people in charge, you might say, hey, this copyright is applied to. To me, Ryan, but not to you, Adam. And so it's the exact same boilerplate that comes after that, where it's like, hey, here's all the terms of service and such. But at the very beginning, you have a slight difference. And so that's one example.
Adam Becker [00:19:08]: And so. But even in that situation, then, considering those duplicates is a function of the business requirements. Right. Like, the business would have to tell you kind of like, for what purposes? Because I can imagine situations where you wouldn't want to de duplicate both of those might sound totally different. Right. But if all you're doing is looking for templates of documents, well, then, yes, in that case, they're, like, virtually the same. So is that how. That's one way for us to think about it, right? Like, there isn't just like, a one right answer here.
Adam Becker [00:19:40]: There's like a business dependence.
Ryan Wolf [00:19:42]: Sure. Yeah, yeah, sure. This is a very business dependent. What I'm talking about in this presentation. And what I focused on was more for the foundation model, pre training kind of side of things, where it's like, okay, when you're feeding trillions of tokens to an LLM, you don't really want to see repeated tokens more than, like, four times, because then you're just kind of wasting compute when you could be seeing fresher tokens and get better performance out of your model.
Adam Becker [00:20:03]: Yeah, great. So it's almost like just increasing diversity.
Ryan Wolf [00:20:06]: Yes, exactly.
Adam Becker [00:20:07]: And then, like. And, like, minimizing cost of just. You don't want to just train on the same thing over and over again.
Ryan Wolf [00:20:12]: Exactly.
Adam Becker [00:20:13]: Okay, great. So now we took those documents, and we computed Minhash for them. Do you know how that's done? Yeah, tell us a little bit about that.
Ryan Wolf [00:20:22]: Yeah, yeah. So the Minhash, it's a very interesting idea. So, essentially, what you end up doing is the minash is kind of an approximation of this Jacquard similarity. This is what it's called. And so what Jacquard similarity does is it basically takes a sliding window over your text, let's say of, like, 20 characters or something. And so what you might have is you might have, like, a sliding window over your text that's 20 characters long. You start with the first slide. You start with the first window.
Ryan Wolf [00:20:48]: You slide it over one character, then you have a new window. In the perfect world, you would save all of these windows. You'd end up with a giant list of windows of characters. You do this for every document in your data set, then what you would do is you would compare each list between the two documents. You'd say, hey, I have this 20 character element in this list. Is it in this other document as well? And you would do this for each document in the set, or, pardon me, each window in the set for a given document. And then the amount of windows that are the same or common between them is the amount of overlap that you have between your documents. And so what min hashing does, it says, okay, I can't do that because that's a really expensive operation if your documents are big.
Ryan Wolf [00:21:38]: What I'm going to do instead is take a hash value of each window. In particular, I'm actually going to take several hash values. I said 200. In my example, I'm going to hash each window 200 different times. And then cumulatively, every time I do that for each window, I'm just only going to retain the minimum value of each hash across each window. Ok, so that looks like you took each window.
Adam Becker [00:22:02]: For each window, you came up with a, like multiple hashes.
Ryan Wolf [00:22:07]: Yes. You hash it using two different 200 different functions.
Adam Becker [00:22:10]: Oh, different functions. And they. How do they. Ok, so you just run like different hashing functions. Yes, totally different. Ok, and then, ok, so now you have, let's say like 200, let's say you have 200 different hashing functions per window. So now you have 200 values per window and then you have 200 values per other window.
Ryan Wolf [00:22:32]: Yep.
Adam Becker [00:22:33]: And then you try to find that. Yeah, go ahead.
Ryan Wolf [00:22:34]: And then, so for each, each one of these hashes that you've generated, each one of these 200 across all of your windows, you just take the minimum one of each one of each hash. So each hash is just an integer. And so you just take the minimum one of each one. And this apparently is mathematically proven that this is a good approximation for the jaccard similarity, because then you can do a set comparison between, instead of all the windows, just of the min hashes of each set and say, hey, how much overlap is there between the min hashes of one document versus another document?
Adam Becker [00:23:07]: Okay, so the min hash might turn out, let's say, for, like, for my 1st, 20 characters might come out to be five.
Ryan Wolf [00:23:14]: Yeah.
Adam Becker [00:23:15]: Right. And then we go to another document and you look at any one of those windows and it comes out to be eight.
Ryan Wolf [00:23:22]: Yup.
Adam Becker [00:23:24]: Right. This is the min hash for each one of those.
Ryan Wolf [00:23:26]: It's one of them in hashes. Yes.
Adam Becker [00:23:28]: So wait, wait, it's the min hash for each window? It's the minimum. It's the lowest number for each of those windows.
Ryan Wolf [00:23:38]: Yeah. So the important thing to remember here is that at the end of the day, what we're getting is a list of hash values. So each document is going to go to a list of hash values. And so. Yeah. So for a given document, you might have a. One of the hash values is five, and for the other one, it might have. One of the hash values is eight.
Ryan Wolf [00:23:53]: And so in that case, you'd say, okay, that's not uncommon, that part of the document, or I say part of the document in a very abstract sense, but that part of the document is not the same. But you might have another hash value that says, oh, this is ten, and this is ten. And so in that case, okay, these are the same. And so you have some similarity between the documents here.
Adam Becker [00:24:09]: Got it.
Ryan Wolf [00:24:10]: Because you know that for some window of text, they share the same value.
Adam Becker [00:24:15]: Right. Okay, so it's identical. We're looking for almost like identical. They're unlikely to be the same if they're a. If they're different. If any one of the windows is. If all the windows from one document are totally different from Windows and another document, then we're saying they're just, like, maximally unsimilar.
Ryan Wolf [00:24:34]: Yep.
Adam Becker [00:24:34]: Okay. That's right. So it's because there are other algorithms that might play a similar game, but with, like, string edit distance. That's not what this is.
Ryan Wolf [00:24:44]: No, not string edit distance at all. Nope.
Adam Becker [00:24:45]: Great. Okay, this is just straight up overlap that we're comparing. Great. Now you took this and then you put it into these buckets.
Ryan Wolf [00:24:56]: Yep. So the idea here is that, okay, we would ideally like to say, hey, you've got 200 values for each documented data set. Let's compare all of them with each other and compare the overlap. So you have document a and B. We know that, okay, 50% of the hash values are the same between the two of them. And so we're going to say that they're 50% equal. We'd love to do this, but we can't because it's too big, the data sets too big. So what we do is we do an approximation where we try and figure out.
Ryan Wolf [00:25:23]: Okay, yeah. We do an approximation where we split up the Minhash signatures into individual components. So this would be like we group it, basically. So let's just imagine taking one group of this. So we have 200 signatures to begin with. Now let's just take the first ten and let's hash them again. Let's combine them together and hash them again. And then we do this for the other document, we combine them together and hash them again.
Ryan Wolf [00:25:49]: Now, we're only gonna get a collision. If they're all the same, right?
Adam Becker [00:25:53]: Mm hmm.
Ryan Wolf [00:25:55]: But we do this, like, ten unique times. And so, at this point, we use this approximation to say, ok, if any one of these ten collisions occurs. Then we considered a duplicate.
Adam Becker [00:26:07]: Got it. Got it.
Ryan Wolf [00:26:10]: And the intuition behind that is that we're saying, like, okay, if at least a good chunk of the first part. Or of some part of the document overall. Is the exact same as a good chunk of another part of a document overall, we can say that. Okay. They're probably overall duplicates.
Adam Becker [00:26:30]: Lsh. But there's the local sensitivity hashing.
Ryan Wolf [00:26:32]: Yeah. Locality sensitivity hashing.
Adam Becker [00:26:34]: Nice. Okay. And then you put that as you're looking for connected components. And then you keep the one that is closest to, like, the centroid. And based on some kind of radius. Yeah, go ahead.
Ryan Wolf [00:26:49]: Yeah, there is no centroid here. Sorry. So, we get the connected components where we say, hey, we know that these two documents overlap in one band. They have the exact same value in this one bandaid. But there's multiple bands in each document. And so you may have document a. That has the same band. As document b in the first spot.
Ryan Wolf [00:27:07]: But then, like, the 7th spot or whatever. It shares the same band value as document c. In this case, it's like d zero shares the same band. First band of D four. And D zero also shares the same band. The first band with dn minus one. But here you can see dn minus one shares, like, the last band with D five. Theoretically, you actually could connect these.
Ryan Wolf [00:27:28]: Or you probably should connect these in the diagram to be here. And here. Have this connection right here where you're saying, okay, within each bucket, sure, you have a connection. But across buckets, too, you now have a connection. Because you know that one document was the same as this document. And this document is the same as this document. So by transitivity, they should all be roughly the same.
Adam Becker [00:27:45]: You ever have a situation where you have to tweak, maybe even, like, the number of buckets? Because otherwise, you're just going to get, like, massive connected components. Where all of them basically share enough in common with another one. And so all of them end up gluing in.
Ryan Wolf [00:28:01]: Yeah, yeah. That is definitely a parameter you need to tweak and be aware of. I think we've typically found that. I think it's like 20 buckets is what we use. Or 20 bands, roughly, is what we use. And that's good enough. Because each band, to be clear, is like, a 32 bit integer or something. And so you need an exact collision in that 32 bit integer in order for it to register as like a duplicate.
Ryan Wolf [00:28:23]: And so when you have 20 bands with 32 integers in each one, usually that's good enough for like the multi terabyte scale we found.
Adam Becker [00:28:31]: Makes sense. This was a. This is fascinating, by the way. I was. I mentioned I had worked with a similar problem and I was trying to use this like a similar kind of solution. And I couldn't quite get that to work because I think for me, the use case that I had, I simply couldn't tolerate any inaccuracies. Yeah. And so I really needed.
Adam Becker [00:28:58]: And I needed to fully explain every one of the steps. And it was a smaller data problem, but I bumped up against the combinatorial challenge of having to make these comparisons. So I was dealing with like 100 million records having to make comparisons between all of them. Is this for like an entity resolution kind of deduplication situation? The moment you try to make comparisons across everyone now it's just like. It's an n squared problem and blew up. So I needed to find different ways to. I think they called it like blocking, like coming up with these blocks. You can then make comparisons.
Adam Becker [00:29:34]: And I think it's similar to your buckets a little bit.
Ryan Wolf [00:29:37]: Yeah, it's very similar to the buckets. Always have a very similar. It's similar to the buckets and probably very similar to the clustering step. That's right. And many duplications.
Adam Becker [00:29:48]: Very cool. Ryan, thank you very much for sharing this with us. This is fascinating.
Ryan Wolf [00:29:53]: Sure. Yeah. Thank you so much for having me. I appreciate it.
Adam Becker [00:29:55]: Yeah, thanks.