Sign in or Join the community to continue

Textify, Text Diffusers, and Diffusing Problems in Production

Posted Mar 15, 2024 | Views 830

# Diffusion

# UX

# Storia AI

Share

speakers

Julia Turc

Co-CEO @ Storia AI

Julia is the co-CEO and co-founder of Storia AI, building an AI copilot for image editing. Previously, she spent 8 years at Google, where she worked on machine-learning-related projects including federated learning and BERT, one of the big large language models that started the LLM revolution. She holds a Computer Science degree from the University of Cambridge and a Masters degree in Machine Translation from the University of Oxford.

+ Read More

Adam Becker

IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More

SUMMARY

AI image generators have seen unprecedented progress in the last few years, but controllability remains a hard technical and UX problem. Our Textify service allows users to replace the gibberish in AI-generated images with their desired text. In this talk, we dive into the technical details of TextDiffusers and discuss challenges in productionizing them: inference costs, finding the right UX, and setting the right expectations for users in a world of cherry-picked AI demos.

+ Read More

TRANSCRIPT

Textify, Text Diffusers, and Diffusing Problems in Production

AI in Production

Slides: https://drive.google.com/file/d/1G3qygBzKdbzVvpqd_fvG2eJ8WMAszEfZ/view?usp=drive_link

Demetrios [00:00:05]: Okay, and next up we have Julia. Julia, are you here?

Julia Turc [00:00:10]: I'm here. Can you hear me?

Demetrios [00:00:12]: I can hear you, and I remember you from the quiz. I don't know if you were around for the quiz, but I don't think a lot of people guessed your name, but your name won. It was one of the answers for the quiz, and the question had to do with Transylvania. Today we're talking about text diffusers. As soon as that came on my radar, I was very curious about it and I started looking into it, and it sounds absolutely fascinating. So I'm stoked to have you share it with us. I'm going to share your screen here and take it away. I'll be back soon.

Julia Turc [00:00:52]: Awesome. Well, thanks everyone, for being here. I'm Julia. I'm co founder of Storia, and I'm here to talk about textify, which is one of our most popular features. We have 25,000 people who tried it out, and they've created hundreds of thousands of assets. But what we want to talk about today is our journey as AI founders and all of the many hats that we have to wear and all of the faces that we have to make every single day. I'm sure that most of you are familiar with the status of Gen AI today. We can think of it as a pyramid, where at the base level there are people who have enough resources and the expertise to build these foundational models which fuel the rest of the pyramid.

Julia Turc [00:01:45]: These tend to be very general APIs, like text to text, or text to image, or text to video. And if you tuned in last Thursday for my co founder's comedy bit, you know how we feel about this layer. The next layer of abstraction are specialized models. These are usually built by people who have either proprietary data so they fine tune open source foundational models on it, or by people who want to deliver something that is task specific, in which case the API might be a bit more nuanced or might be more complex than simple text to Y. And then finally there's workflows and applications, which are usually built by people who have deep industry expertise. They mostly call existing APIs. And their value proposition is very different from core Gen AI. Gen AI might just be an add on.

Julia Turc [00:02:42]: Where we operate today is kind of level 2.5, where we build our specialized models, and we expect a lot of our mode to come from there. But the decision on what to build is informed by the consumer application that we're also building. The consumer application is called storylab, and the end goal is to displace incumbents like Photoshop and canva for non designers. So without a doubt, image generation has been democratized over the last couple of years by providers like Midjourney and Leonardo and stability and OpenAI and so on. But if you want fine grape control over these images, you might still have to pull your images into Photoshop. So our philosophy is to focus on tasks that might not go viral, might not be as interesting as text to video, but that actually help people deliver value. And the good news is there's a mountain of research that is still going unnoticed. So the particular feature that I'm going to describe today is called textify, and its goal is to seamlessly change text and images with minimal effort from your side.

Julia Turc [00:04:01]: So you don't have to deal with the learning curve of Photoshop, but you still have full control to decide where the text goes and what shape it should have. These are some input output pairs that we generated with textify, and you can see that we're doing a pretty good job at preserving the initial style, even for complicated situations like this neon image over here, for instance. And we're doing better and better when it comes to curses. Okay? Of course we're not perfect, and sometimes you can see artifacts like for this word weight, which arguably is a little bit too spread out. So we're still working on it. It's still work in progress. Now, the research task behind textify is formerly known in academia as science text editing, and it's basically the task of automatically modifying text in images that are potentially photorealistic while preserving the original background and the original text aesthetics. So if you're not a nerd and you don't particularly care about model architectures, I will give you permission to tune out for the next couple of slides.

Julia Turc [00:05:16]: But I promise that it will get more interesting after that. And there might or might not be a little family feud game at the end. We published a medium article with the full literature review we did around scene text editing, and here I'm just going to go through a high level timeline of how these architectures evolved. The task was formalized in 2019 when this paper called Stefan used Gans to do character level edits. So this was pretty interesting because it defined the task itself, but was pretty limited because if you wanted to change an entire word, you would have to do it sequentially, letter by letter. And it also has the limitation that the target word needs to have the same length as the source word. A few months later, a couple of papers found a way to expand this to word level. Their approach was to break down the task into three simpler tasks.

Julia Turc [00:06:19]: Task number one is take the original image and wipe out the text. So in this case, the pedestrians image would just remove pedestrians. Task number two is to render your target text, in this case, sessions with the desired font, but on a plain background. And then task number three is to fuse the two together. So this breaking down the problem into three sub modules is conceptually very elegant, but requires direct supervision for every single module, which requires labeled data for every single stage, which constrains you to synthetic data. So because they couldn't really use photorealistic text, the generalization capabilities are pretty limited. Fast forward one year and another group of papers managed to train models with a single loss, again, loss, which doesn't require label data and allow them to reintroduce photorealistic data and get much better results. And of course, in 2021, when the entire image generation field moved towards diffusion models, scene text editing also adopted them.

Julia Turc [00:07:37]: Most people here are probably familiar with the architecture of diffusion based models, but I grabbed a screenshot from the illustrated stable diffusion here at the level of abstraction that is most relevant for this particular task. So you have your text encoder, which is taking natural language and translating it into a continuous embedding space that jointly represents images and text. Then the embeddings are fed into what's called here, the image information creator, which is the actual diffusion. So the sequential denoising steps, once we have a final latent representation of the image, the image decoder maps it back into pixel space. So when it comes to research around scene text editing, the question that needs to be answered is how do we specify additional requirements in addition to the text that you want to show up in the image? Things like the desired position, desired style and so on. And of course, there are as many answers as there are papers. But an interesting one is text diffusers that came out of Microsoft research in their version 1.0. The way they represent positions is they take the target word, in this case work.

Julia Turc [00:08:57]: They render it on a white background with an aerial font in the desired position, and then they pass it through a pretrained unit that gives them a segmentation map. And the segmentation map is able to capture both positional information, but also some information about the shape of the characters. So this segmentation map that you see here is what is getting passed into the image creator in addition to the text embeddings. A few months later, they build on top of research that comes out of Jeff Hinton's lab, which discovers that, surprisingly, language models are actually good at encoding position. Now, this is a little bit unintuitive, because position is represented as numbers. And how many times has a model seen x one, two, five in a training set? Probably very few times. So the intuition would be, how can this possibly have a meaningful embedding? But it turns out that our intuition just doesn't hold anymore with these massive models. So they're able to encode position by simply using tokens reflecting the top left and bottom right coordinates of a rectangle.

Julia Turc [00:10:17]: So x five, Y 70 simply means the pixel at location 570. Now, what's being passed to the image creator is basically the target text, and then the positional encoding of where you want the text to be. So these architectures, unfortunately, have not yet converged to a place of clarity or simplicity. For instance, this is what the architecture of any text looks like. A paper that came out of Alibaba in December 2023. And I'm not going to go through the details of it, but that is a normal person's reaction when they look at this architecture, especially realizing that in addition to the diffusion model, there are so many additional models, including pretrained architectures like OCRs and text encoders and so on. So if you were bored out of your mind, you can snap back in now, because we're going to talk about the actual topic of this conference, which is taking research models and actually productionizing them. So when you want to build specialized models, the first thing you do is you define your task.

Julia Turc [00:11:29]: You find a proxy for this task in research. In our case, that's scene text editing. Then you do a literature review, and you discover tens or maybe even hundreds of papers that are promising and are relevant. And the question is, which ones should you try out? Which ones should you invest effort into? So here's our reasoning. First we ask, do they have open source code? And if the answer is no? Well, as two people who have worked in research before, we know that this would be complete suicide for a precede stage startup that has obviously a limited Runway. If the answer is yes, then the next question is, do they also have an open source checkpoint? If they don't, well, then we're going to procrastinate, and we're going to look for lower hanging fruit to start with. If both of these answers are yes, then it should be pretty straightforward to simply just productionize these models, right? You would expect that, but you will have an initial shock when you realize that quality is just not as good as advertised in the real world. And that's probably because real world images are a lot more diverse than the training set.

Julia Turc [00:12:49]: There's also the obvious limited API problem, where most work coming out of research just supports the standard thing, which might be a PNG of fixed size 512 x 512. And finally, you have to face the fact that this is a research task that's only a proxy for what you need. So you might have to support additional things. In our case, that's font family or font color, which none of these research papers address. So after the initial shock wears off, what you do is you take these models that are available out of the box, you evaluate them, you figure out which models do well, in what circumstances potentially write your logic to route the right queries to the appropriate model. You add additional logic that deals with a limited research API, like cropping and resizing and scaling and painting. And last but not least, you also have at least in our case, you have to build a pretty front end because it's not 2010 anymore, unfortunately, and consumers do have high expectations when it comes to what they see. Maybe just the only notable exception that I can think of is midjourney that forced people to tolerate the discord interface.

Julia Turc [00:14:12]: Now, the harder pot path in the decision tree happens when the code is available, but the checkpoint is either unavailable or is just not good enough quality. And you have at least the hunch that you could do a better job and you could retrain it to get better quality. And there's still an initial shock when you first tried to train it. We did this for text diffusers, and we encountered the challenges that you would expect most, the standard ones that you would expect when you start training. The first one is obviously data. Data is always noisy, especially lion five B. The data set that stable diffusion was trained on has very noisy labels, which might be fine when you're training stable diffusion, or a very general API like text to image, but it becomes a lot less tolerable when you're trying to build a model for a very specialized task. And text diffuser itself comes with a data set that has some corrupted labels, potentially because of OCR failures, and some image URLs don't exist anymore, and so on.

Julia Turc [00:15:19]: The next obvious problem is compute, because obviously GPUs are not cheap, and thank God for the fact that early stage startups are being showered with free credits from the cloud providers. But there is a time constraint. There is a pressure to run all of your risky experiments and validate your market hypotheses in year one before these credits run out. Now, running code that you just find on GitHub is obviously never straightforward. Somebody always forgot to put some dependencies in that requirements. Txt another interesting anecdote is that when we tried swapping stable diffusion 1.5 for SBXL, we expected a pretty smooth transition, maybe some failed tensor shapes that we could easily fix. But then if you have to deal with GPU and CUDA, you might have to fix this sort of errors, which are super low level and might waste an entire day just trying to chase what sort of Ops are failing on the GPU. When it comes to training.

Julia Turc [00:16:26]: The biggest challenge for this particular task is that we need to find the right trade off between reproducing the style of the text in the original image and spelling the target text correctly. So we need to choose a set of hyperparameters that make the right trade off between these two. And choosing the right hyperparameters is particularly difficult in a world where there are no good automated evaluation metrics. So while you're training your model, there is no reliable metric that you can look at and have the peace of mind that your model is getting better. Basically, what we do today is after every few steps, we just run inference, and then we use our very subjective human judgment to decide whether the model is improving or not. So after a few weeks of hard work, all is well. You can see here some before and afters. So, text diffusers out of the box are not very good at producing cursive text.

Julia Turc [00:17:25]: But with a lot of data set, cleanup and replacing stable diffusion 1.5 with SDXL, which was trained on high resolution, we're getting a lot closer to cursive. So I cannot end this talk without addressing the elephant in the room, which is, a system like this is going to be abused sooner rather than later. So here's Steve Harvey asking you, what documents do you think people are attempting to fake with textify? And I would like to see your answers, maybe in the chat to see how close you get to the answer.

Demetrios [00:18:09]: So let's see. I think there might be some significant delay with the chat, but let's see if anybody stuck around to answer it. We could give it a few seconds, but can I start?

Julia Turc [00:18:22]: Of course.

Demetrios [00:18:24]: Okay, first of all, receipts, right? Like invoices, contracts.

Julia Turc [00:18:30]: Good one. Let's be more creative. These are the obvious ones, right?

Demetrios [00:18:38]: Okay. Text messages for sure. You can do, well, ads, right? So you could say that. False advertising. I paid different things. This is all the prices.

Julia Turc [00:18:54]: Terms and conditions. A lot of inspiration from you.

Demetrios [00:18:58]: Yeah. The insurance documents, my flight seat number. Let's see. Okay. Invoices condominium. Okay, condominium, bylaws, invoices, sales, docs, driver's license. I'm getting invoices, passports. Wales.

Demetrios [00:19:25]: Nice, Mike Wales. Let's see, invoices, passports, passports. Tax returns from Frederick.

Julia Turc [00:19:32]: Yeah. Okay, so that's a pretty comprehensive list that I see. People have a lot of imagination. Here are the top hits. Obviously things you would expect, like IDs, credit cards and diplomas, but there are some pretty creative ones. Some people are trying to increase their follower count on Instagram screenshots. Some people are faking their running stats on Strava. Some people are faking their blood test results for some reasons.

Julia Turc [00:20:00]: And one person was trying to get for free into four seasons. So very good guesses. That was not a comprehensive list, and there's a lot more to the list. Which brings us to this moral dilemma. As startup founders, our resources and time are so limited. So is this something we should invest in right from the start? Is this the right time to invest in something in preventing abuse? Then? Even if we did have infinite resources and time, who's the right entity to be policing these things? Should I, as a startup founder, be policing you on your Strava stats? Should I be preventing you from changing them? Of course, some cases are more clear cut than others. Faking IDs is probably universally a bad thing, but there are exceptions. What if you're just trying to make a cool slide deck and you're doing it for comedic purposes? And one way of answering these questions is to look at what incumbents do.

Julia Turc [00:21:04]: So you can definitely do these things in Photoshop today. But admittedly it's not the same skill because you need skills, you need a Photoshop license, and you can just automatically do that 1000 times in 5 seconds. So I'll leave you there with this open question today and thank you for your attention. And if you need help productionizing visual gen AI in your system, then we'd like to hear from you and you can reach us at standards at Storya AI.

Demetrios [00:21:39]: Awesome. Okay, Julia, thank you very much for that. Let's see if we have any questions in the chat. It's going to take a couple of seconds for people to probably get their thoughts in order. What absolutely blew my mind is the positioning. I did not see that one coming. You're right, it is completely counterintuitive. But once you kind of think about it for a second, it's also extremely intuitive.

Demetrios [00:22:11]: Right? Just for some reason I didn't think about it until you mentioned it. I guess at that point it depends. You want to be encoding it with the X and then the position is the pixel position, or I guess it doesn't even need to know really what it is. Do you have a sense if it's just like the relative distance between the X's and the Y's, or is there something more absolute about it? If you zoom into the weights, do you have a sense of what's going on?

Julia Turc [00:22:36]: The way the paper does it is absolute values, which is even more surprising, because if you were to pass the length of the rectangle, the width and the height, maybe there would be more intuition there, because these are relative measures. But yeah, it's pretty mind blowing. And as an NLP person who joined NLP back when there were dependency pursers and grammars and computational linguistics theories, is just mind blowing to me that all of those intuitions need to be put aside and just basically surrender to big data and all of the lack of intuition that it comes with.

Demetrios [00:23:18]: Yeah, amazing. So as you're building out this product, do you think about a particular target audience, or are you mostly right now kind of like in R and D, trying to get the technology to be at a particular level? How do you split your attention between these two things?

Julia Turc [00:23:35]: Where we're getting more traction is people who are non designers, who basically are not willing to learn Photoshop. I had this very interesting email thread with a truck driving company in New York City who was reporting bugs and doing it very consistently. And at some point I asked him, why are you giving me your time and attention? And he said, well, it's still easier than for me to learn Photoshop. And I think that's a cute anecdote for who our target audience is. There are people who feel like image generation has been democratized. So this person was generating his trucks with Empire State building background with mid journey, but he wanted to put his phone number and his contact details on the truck, and he just didn't know how. And I suspect that as technology gets better and better, designers will eventually want to use it as well. Because why spend 13 minutes instead of 5 seconds to do something? And these 13 minutes are, if you read our article, you'll see that's exactly how long a Photoshop expert took to fix some misspellings in an image.

Julia Turc [00:24:45]: So eventually, when the quality gets good enough, I suspect designers will be on board as well.

Demetrios [00:24:51]: I had a designer reach out to me the other day, and they tried to put something first, I think in mid journey and then in chat GPT four, and they tried to get some text on like a billboard or something. And obviously it didn't get right. It. It doesn't know how to actually generate text in that elegant way. So is there some kind of play by these big players that will ultimately sort of overtake this sort of thing where GPT five whatever is going to come out of the box with already these things? Or do you feel like the changes that you have to implement right now are still so low level that whatever next, even if they just come up with a few billion extra parameters or whatever, it's still not actually going to be architecturally correct enough to be able to get the job done.

Julia Turc [00:25:43]: So we are seeing progress when it comes to the quality of text. They started off by generating complete gibberish. Dali two could never spell a word. Now it's getting better and better up to the point where it only makes a few misspellings. And majority v six and Imogen and deployed actually do spell text pretty correctly most of the time. However, I feel like even when they have 100% accuracy in spelling, there will still be cases when people want to modify a particular piece of text. And in that case, the text to image paradigm is just not the right ux. So even if the model is absolutely perfect, I do not want to express in words, hey, I want this text to be five pixels to the left, and I want the color to be a bit more in line with the background.

Julia Turc [00:26:32]: And I actually would like this shade of blue and so on. So even if Dali 100 becomes absolute perfection, I just don't want to interact with it in that particular way. So it's very unlikely that OpenAI will be interested in building these very specific interfaces, or at least not until they reach AgI and they get bored and they have nothing else to do.

Demetrios [00:26:59]: No, 100%. And those are the kinds like people that are talking about moat a lot and what's defensible and what's not. It is these workflows. It is this understanding of your actual user, right, that I think is going to make the difference. So very nice. A couple of questions. Let's see from the audience. One is, has someone developed a tool that allows you to build checkpointing without coding it? I have a feeling that's probably related to your tree in the beginning about how to assess the viability of going on open source.

Julia Turc [00:27:30]: I'm not sure if it's referring to checkpointing when it comes to your developer flow. Like, here are the things that I tried and here's how I document them, or they refer more specifically to during my training process. How well is my model doing?

Demetrios [00:27:46]: Yeah, that's a good question. Let's see what they say. I suspect it's the latter. This is sort of how I interpreted it, but we can give them a second if they're still around to answer it. Nice. Okay, well, if they come through, we'll pick it up.

Julia Turc [00:28:06]: Weights and biases is very helpful when it comes to visualizing your logs and keeping track of what models you ran with what parameters.

Demetrios [00:28:18]: Yeah, my intuition is that they're referring to. Let's imagine that you're finding a paper and that paper seems interesting. They're implementing a new diffuser, whatever, in an interesting way. And they've also open sourced the weights, but you want to checkpoint either both the model during its various training. So you want to see whether or not how it's been kind of like improving in its capabilities, but then also the data. So this is my impression. Okay, let's see. They said model checking.

Demetrios [00:28:51]: Yeah, this is what they're saying. Model checking. Yeah, I don't know, I guess it depends what your framework is. I mean, there are certain things without checkpointing, without coding it. Yeah, I guess I'm not entirely sure what you mean by coding it. Okay, we have another. Julia, if you have some intuition about what they might mean, then jump in. Otherwise we can move on to the next one.

Demetrios [00:29:18]: Ali says, does this solution end up as a feature function in existing tools like Photoshop?

Julia Turc [00:29:24]: Yeah, the textify feature that I just described is definitely just one feature of a larger platform that we're trying to build out. There are a lot of other operations that would be nice to have, like being able to replace an existing object without having to manually trace it with a brush like you have to do today. Within painting is another thing that should be a lot easier to do, but with a text to image interface is not that easy. So definitely this has to turn into a fully fledged offering where you can import your image and then do whatever you can do with Photoshop today, but again in a much shorter time span.

Demetrios [00:30:08]: Yeah, nice. Wonderful. Julia, please stick around the chat some more in case folks have more questions, or in case mark, you can ask your question again, perhaps differently. Julia, thank you very much for joining, joining us, and best of luck with Storya.

Julia Turc [00:30:27]: Thank you for your time.

Demetrios [00:30:28]: Thank you.

+ Read More

Sign in or Join the community

Watch More

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production

Posted Nov 15, 2024 | Views 6.5K

# Generative AI Agents

# Vertex Applied AI

# Agents in Production

LLM in Production Round Table

Posted Mar 21, 2023 | Views 3.1K

# Large Language Models

# LLM in Production

# Cost of Production

LLM Use Cases in Production Panel

Posted Feb 28, 2024 | Views 3.8K

# LLM Use Cases

# Startups

# hello.theresidesk.com

# chaptr.xyz

# dataindependent.com