MLOps Community
+00:00 GMT
Sign in or Join the community to continue

CV with Fashion Colors

Posted Jul 01, 2024 | Views 152
# RGB Values
# Multimodal Models
# Capsule
Kelsey Pedersen
Co-founder & CTO @ CAPSULE

Kelsey Pedersen is a Co-Founder & serves as Chief Technology Officer at Capsule.

+ Read More

Kelsey demystifies extracting and applying fashion colors from product images to enhance e-commerce search algorithms.

+ Read More

Kelsey Pedersen [00:00:00]: So a little bit more about Stitch fix. Stitch Fix was really interesting because I think it was one of the first companies that really was able to dial into specific attributes about fashion products. And at Stitch Fix, even though we had over 250 data scientists, the large majority of merchandising attributes, as you can see, like color, pattern, style, length, was still manually inputted by our merchandising team. Some of that was beginning to be automated. I'm sure a lot of it has become much more automated since I've left. But really, with the advancement of computer vision models and generative AI, it now makes it a lot easier to be able to parse out and identify these different fashion attributes and tag them in our database. And so today I'm going to be talking about how we at Capsule have started thinking about our, like, first iteration of extracting fashion attributes and specifically color in our mobile application. So a little bit about capsule.

Kelsey Pedersen [00:01:04]: Capsule is a visual e commerce search engine. And like I said, the core part of our app right now is Shazam for shopping. And so how it works is you're able to upload, you can see upload an image into our app, and then we show you all the most visually similar products across the entire Internet. It's similar to Google lens or Pinterest, but really focused around shopping. The main use case that our users see in the app right now is being able to find visually similar options at the price point that they shop at. So being able to upload a photo of a product that's hundreds of dollars and being able to find a dupe at H and M or Zara for the same look. And really the foundation of the tech that we're building is being able to extract these various attributes out of the user uploaded image to be able to return these visually similar options. And so one kind of thing that became very evident after we launched is one of the most important kind of distinguishing features of a product from the user side is being able to understand the color.

Kelsey Pedersen [00:02:12]: You can imagine if we recommend two similar products, but one's hot Pink and one is blue, it's going to look really bizarre in our results. So, yeah, like I said, there's a lot of kind of various options. Google obviously has Google Image search, but still, the way that the majority of users begin their shopping journey today is being able to do a text search in Google. So you can see up here, most people, if they're looking for a blue dress, will start with a pretty basic query like blue dress. And Google returns a lot of very generic options and many people too, like, even if they know exactly what they're looking for, often like, lack the specific fashion vocabulary to be able to refine their search with only text. The other kind of thing to note here is if you're searching as a user with blue dress, obviously you don't have to be as nuanced. And Google doesn't even really understand what the user is searching for. So they're able to kind of get away with a much more varied color response.

Kelsey Pedersen [00:03:19]: Another example of color kind of in our existing shopping ecosystem today is on dot. So you can see up here the way that Nordstrom buckets their colors is at this very high level of blue, purple, green, yellow, orange. And users are able to click in and filter by blue. And this is pretty similar across most ecommerce search engines where they keep the color level high level. Here's a few other examples. Shopbop, the real real Macy's, all kind of keep it at this higher level hierarchy, but with capsule just to set the scene is like if a user searches this blue dress and we return these options, it's nothing close enough to the exact color of that initial dress for users to feel satisfied. So today we're going to be talking about two different strategies that we've explored to really be able to extract color from this product. So today we're going to be talking mainly when we talk about color, we're going to be talking about rgb values, which I'm sure most of you guys are familiar with.

Kelsey Pedersen [00:04:31]: But it's a pretty common way to represent color within web development. And just like the world in general, and it's split into r means red, g means green, and b means blue. And the really cool thing about rgb values is there's over 16 million different options that you can get from rgb values. So the level of precision makes our search significantly better. The other kind of call out too, is within ranking and thinking about our embedding models, we're storing everything as vectors. And the nice thing that just kind of comes with RGB values is RGB is already like a vector data type, so it makes it really easy to query off of that as well. So as probably all of you guys are waiting to hear, how do we actually extract this RGB value from our product images? So I'm going to be talking through two different ways that we are doing that today. First is using more traditional computer vision machine learning algorithms to understand the frequency of pixels within the image.

Kelsey Pedersen [00:05:45]: And then the second is using multimodal models to extract the RGB value along with many other attributes at the same time from the product image. Um, first is, um, this is how we're thinking about, and this is how we've implemented it thus far. Um, so what we start doing is there's kind of a couple steps that we kind of have to do before we actually start extracting the color. As you can see here, um, first we start by cleaning up the image. So we take out the model to decrease the amount of noise with the color. Um, we shrink it down, we resize the image of, and then we're using the k means clustering algorithm, which is a very popular kind of like computer vision machine learning model that just comes. Basically you can import the library within python. It's really fast and easy to use, and that helps us get the dominant color within the image.

Kelsey Pedersen [00:06:42]: A couple kind of interesting things with that is, within this algorithm you can determine the amount of nodes that you want to parse out of it, the number of color pixels, and kind of the number of color options that you can pull out. And I would say the main thing with kind of fidgeting around with this was determining the number of nodes that we needed to make the most dominant color most accurate. So this is kind of like a fun prompting exercise that we do in a lot of other ways, but with this algorithm to really determine what was the most accurate, and for us it was five nodes where that often resulted in the top node being accurate. So you can see here, this is what we end up with. We have the initial image, we have the altered image, and then we get the top colors. This top RGB value of R 543 155 is the color that we end up extracting, and it's about 30% of that image. Okay, part number two, this is kind of the other way that we've been thinking about extracting color and playing around with the new advances with multimodal models that have been released more recently. The other way that we've kind of played around with doing this so far is using GPT four to pass in the image and extract a text description of that image.

Kelsey Pedersen [00:08:07]: So this I was just playing around with. I'm not actually trying to extract the RGB value. GPT four doesn't give that to you without prompting. You can see it just defines it as a royal blue gown, which actually is even more specific than what we would get with a lot of other older models. But with a quick prompting change, you can see describe this image using the RGB code. We actually get the RGB value out of GPT four as well, which is really cool. I honestly was pretty impressed that it did it this well. I didn't do any kind of pre processing of the image I just passed in the whole model, and it was still able to extract the right values.

Kelsey Pedersen [00:08:51]: You can also see here too, that by just saying describe this image, we were able to get a lot of other fashion attributes out of it as well, including tiered ruffles, layers, bodice, spaghetti straps. And we could theoretically embed this whole description and add it to our search as well. So then just kind of a quick comparison against the two different strategies. The first, by using computer vision to extract the code, the RGB code. There's a couple advantages to this strategy that we're currently using. One, it's free, which is always great. It's really fast since we're not making any external API requests or dealing with hosting any external models. And it's also structured and consistent, so we always know that we're getting back the same vector.

Kelsey Pedersen [00:09:46]: And then the second option that I just talked through is using multimodal models to extract the RGB code and many other attributes. And I would say the main advantage of this is that we are able to get a much more robust description. And by embedding this additional information we could probably improve our search. Yeah, improve our search from where it is today. And then the other thing too is there's no image preprocessing, so we don't have to deal with cleaning up the image, removing the model, shrinking it down and getting it ready to run through that algorithm. And then the final thing is, it doesn't require any other engineering. It's just really fiddling around with the prompt and making sure that we're getting consistent responses back. So this you can see now, this is now that we're taking into account the rgb value.

Kelsey Pedersen [00:10:35]: Initially we had a huge set of color dresses, and now this is taking into account the rgb value, which you can see now we'll rank the images in a much more visually sensical way. And the scores underneath are just the similarity score with just the rgb value, and you can see that we rank them significantly better. Then finally, this is how it comes to life in the app where now when a user searches the image, um, you get to see all these like very very very um, color specific, um, results. Uh, one kind of final thing to note is like one thing that's been really fun to play around with is being able to weight all these attributes against each other. Obviously, color isn't the only thing that we weight in our algorithm, but, um, since we've launched, I think with user learnings, that's becoming a very dominant feature. Um, and so making sure that we're getting that right on the majority of our searches. Yeah, that's it. My name's Kelsey.

Kelsey Pedersen [00:11:38]: If you're interested in trying our app, feel free to download it. And, yeah, let me know if you have any questions. But that's it.

Q1 [00:11:46]: How do you get the catalog data?

Kelsey Pedersen [00:11:48]: Okay, that's a fantastic question. Okay, how do we get our catalog data? Okay, there's a couple different ways that we get it. One is we have our own catalog of products, which we are scaling up. We have over 15,000 brands in our catalog today. So a lot of what you guys were talking about with embedding a lot of data and making sense of that is something that we're dealing with in our day to day. And then the other way is utilizing basically understanding the image and then running Google search requests to pull in additional products that aren't in our catalog already. So it's kind of those two simultaneous things happening right now.

Q1 [00:12:27]: Do you also provide suggestions on, like, you know, what goes well with this? So, like, for example, if I throw one of my white tunes.

Kelsey Pedersen [00:12:34]: Yeah.

Q1 [00:12:34]: It would suggest me that you can wear this t shirt.

Kelsey Pedersen [00:12:38]: Yes, I think that's a great idea. We haven't done anything around discovery yet, but at stitch fix, we call that, like, complete the look, where it's like, okay, if you know you want this one item, how do you create a whole outfit around it? And I think that'll be probably our first pass at Discovery. Yeah, yeah, yeah. I'm actually talking to a founder tomorrow who's built a similar type of product like that. Yeah, that's a good question. Right now, we're not. We're just setting the, like, most dominant color, but I think for things, like, for products that have a lot of pattern in it, we're right now just identifying the pattern, too. So it's like, okay, if we can do a pattern in the top color, that you usually gets it close enough, but I think pretty soon we're gonna have to figure out a way to incorporate, like, the top handful of colors and then rank it by, like, dominance gradients.

Kelsey Pedersen [00:13:39]: Well, thankfully, there's not much gradients in fashion products, so we haven't really done anything around that. But that's a good question. I think it'll kind of follow up with your question of, like, okay, how do we handle multiple colors in a product, and I think that might actually be a pattern that we should potentially include. So.

Q2 [00:14:02]: And this is, like, a curveball question. Sorry. So there was a bad press certain company got when they announced that this gray colored plus size clothing was manatee grape.

Kelsey Pedersen [00:14:18]: Wait, was what?

Q2 [00:14:19]: Manatee grade. It was a plus size clothing, and they called it manatee gray, which is very offensive.

Kelsey Pedersen [00:14:24]: Yes.

Q2 [00:14:24]: And so. Well, I'm hoping that the LLMs are much better at describing the dresses than this.

Kelsey Pedersen [00:14:31]: Yeah.

Q2 [00:14:32]: What kind of, like, challenges do you see in terms of, like, making sure that people are finding the great dresses for the different bodies in different. Different shapes?

Kelsey Pedersen [00:14:42]: Yeah, that's a good question. We haven't fully dealt much with, like, shape yet. We've punted on, like, shape and size. I would say that's probably, like, one of the hardest things to normalize across e commerce, just because that data is so variant. There's so much variance across retailers. But it will be interesting to see, like, can we extract that from a product image, or does that more depend on other size data that we're pulling in? Because a lot of ecommerce companies now are really trying to be more size inclusive, so it might not be something that we can fully extract just from the image. Yeah.

Q2 [00:15:35]: So does it work with very complex building? Right.

Kelsey Pedersen [00:15:39]: Yeah, I would. I would say the multicolor is harder, but I think with the way that we're embedding, that we're, like, embedding with clip, like, just the specific product, it actually does a better job with pattern, so we're less reliant on the color. Like, I think the existing models that we use. This is kind of, like, filling in the gap of that, like, clip understanding, but with, like, very, very specific patterns. Clip does a much better job of ranking it without this, like, kind of additional piece of information. So I think it'll be kind of interesting to see, like, it's fun to see, like, what people are searching. And I think if people are searching these, like, more varied products or there's, like, a lot of vintage clothing, that there's a lot of different colors in it where it's, like, if it's not, like, a solid image, I think there's kind of a depreciating value of, like, even understanding all the colors in it. And there might be other, like, more important attributes to.

Kelsey Pedersen [00:16:39]: Wait.

Q3 [00:16:41]: This is more of a complicated question, but a lot of people have been, like, suggesting features with app. What, in your eyes, what does, like, a one year plan look like? And how are you factoring, like, monetization or whether you might exit to a company like Amazon, build that into their app. Because I imagine monetization would be very difficult for something like this because people would just click the link and go and buy somewhere else instead of paying membership or something like that. How do you see all those factors contributing to your plan?

Kelsey Pedersen [00:17:10]: Yeah, that's a great question. Like, are you an investor? Got the good business mindset. Yeah. So right now we've definitely focused on being more of a utility tool where search is the core function of our app. I think where things get really exciting is around the, like, complete your look or like, we have a lot of women right now who are like, using the app, searching for, like, wedding dresses or like, bridal looks. And I think even beyond like, a specific outfit, it's like understanding what broad concept is a user searching for and providing recommendations around that. And there's really, like, if you compare it to, like a Google or a Pinterest, like, they're not really providing that, like, level of personalization around your search history. And I think that area is really an interesting way to monetize and work directly with brands to push products that are in line with the user's search intent and don't feel pushy, but don't degrade the results that we're showing for the actual search to the user.

Kelsey Pedersen [00:18:12]: Because I think one of the pitfalls of where Google is right now is you go search for anything within Google and it doesn't feel as trustworthy. There's a ton of sponsored posts. They really heavily weight on page ranks. So you're seeing a lot of Amazon, Walmart, banana Republic style brands where it's like, no, I actually want to search and find brands that are owned by just one person and are more local and authentic. And so I think being able to surface those in a way that feel authentic to the search, but then providing this broader way to discover things that are kind of outside your search is pretty interesting. So that's where I see ourselves in a year where it's beyond this utility search and really building a broader discovery platform where then we're compensated by a click through rate or some sort of affiliate model. Yes. Yeah.

Kelsey Pedersen [00:19:11]: Segment. Anything's amazing. We're using them. And I think the kind of question around is it a shirt or is it a dress? Is. Yeah. And I think that's kind of a parallel example to this of like, okay, we implemented a version, one of being able to identify the category, but really being able to utilize new models coming out to be like, can we do this better than we are currently, for sure? Yeah, that's a great question. So I played around with c lab, which is another kind of color space, and a couple other ones, and we found that there wasn't enough differentiation between the way the understanding of those color spaces with the products that we have. But I think that's fascinating for extracting color from user generated photos or things that need more nuance or the lighting is weird and we have to normalize it.

Kelsey Pedersen [00:20:12]: But for now, RGB is totally fine and the easiest. Thank you.

+ Read More

Watch More

Do More with Less: Large Model Training and Inference with DeepSpeed
Posted Jun 20, 2023 | Views 1.2K
# LLMs
# LLM in Production
# DeepSpeed