The Variational Book
Evolved from biomedical engineer to wet-lab scientist, and more recently transitioned Yuri's career to computer science with the last 10+ years developing projects at the intersection of medicine, life sciences, and machine learning. Yuri's educational background is in Biomedical Engineering, at Columbia University (M.S.) and the University of California, San Diego (B.S.). Current interests include generative AI, diffusion models, and LLMs.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Curiosity has been the underlying thread in Yuri's life and interests. With the explosion of Generative AI, Yuri was fascinated by the topic and decided he needed to learn more. Yuri pursued learning by reading, deriving, and understanding seminal papers within the last generation. The endeavors culminated in the writing of a book on the topic, The Variational Book, which Yuri expects to release shortly in the coming months. A bit of detail about the topics he covers can be found here: www.thevariationalbook.com.
Yuri Plotkin [00:00:00]: My name is Yuri Plotkin. I'm the author of the variational book. You could find more at the variationalbook.com or at the variational on x. And I like my coffee. I'm pretty simple. Just black. Just black. It doesn't even have to be good coffee.
Yuri Plotkin [00:00:16]: It just has to be black. I drink all types of coffee.
Demetrios [00:00:22]: What is happening? Mlops community? We are back for another podcast. As usual, I am your host, Demetrios. And today, talking with Yuri, we got into the pedigree of models, specifically all of these image generation models, and I really appreciated how he broke down the lineage, opening my eyes to the fact that there is a bit of standing on the shoulders of giants. We knew it. But here he talks about it very clearly, how one model inspired the next, and it left us with being able to use text to video where we are at today in this day and age. So cool to see he's got a book out, as he mentioned in the intro. Go check it out if you want to start getting into the architecture and understanding of these models. He talks about how he was inspired to write the book after just diving into reading so many different papers and then seeing this thread, the common thread that went through him and understanding the math and understanding why things were the way they were.
Demetrios [00:01:39]: A big caveat to this is that his book does not talk at all about LLMs. He said there's enough conversation and literature out there on the those specific types of models. So he hit the other ones. Hope you enjoy it.
Demetrios [00:03:41]: Okay, 20 seconds before we jump back into the show.
Demetrios [00:03:44]: We've got a CFP out right now for the data engineering for AI and ML virtual conference that's coming up on September 12. If you think you've got something interesting to say around any of these topics, we would love to hear from you. Hit that link in the description and fill out the CFP. Some interesting topics that you might want to touch on could be ingestion, storage or analysis, like data warehouse, reverse etls, DBT techniques, et cetera, et cetera. Data for inference or training, aka feature platforms, if you're using them, how you're using them, all that fun stuff, data for ML observability and anything finops that has to do with the data platform. I love hearing about that. How you saving money, how you making money with your data? Let's get back into the show now. Let's start with this, though.
Demetrios [00:04:35]: I think that you've got an interesting background. I want to get into some of it because you were in biology, right? And then I, you jumped into the AI scene. Can you talk to me about the switch and what made you want to go from being a biologist to generative aiologist?
Yuri Plotkin [00:05:00]: Yeah, it's a great question people ask me, I think, all the time, and I don't have a clear answer. But I'll tell you, kind of as a kid, I always loved biology. It was always something fascinating about it. This is a personal kind of story. When I was a kid, I used to love ketchup, and I haven't really shared it outside the family circle, but I used to ask my grandparents if ketchup kills germs and they would just lie to me to have me eat the food. So, yeah, I always had this fascination or there's interest in the biology side of things. I really don't know where that comes from. It's kind of like, I think it's just a little bit intrinsic, a little bit curiosity.
Yuri Plotkin [00:05:46]: So that kind of manifested into me studying biomedical engineering for my undergraduate and masters. And I loved it because it was a good kind of hybrid between learning the biology and an application and how it transfers to solving real world problems from medical devices. Or nowadays it's a very computational. So there's a lot of like problem solving analytics, as you could imagine, from the engineering side. So I really enjoyed that because that was kind of critical thinking aspect. At some point I transitioned to the wet lab research and I really enjoyed that. And yeah, there was kind of like a switch where I maybe got back to some roots or got back to some math and started studying computer science. And yeah, honestly, I haven't looked back.
Yuri Plotkin [00:06:39]: It's been a very fascinating topic. I love, I think, the thread through kind of my trajectory in life, it's learning. So the ability to learn and the ability to develop new skills. I think in this context, there are heart skills, you know, the science, understanding the science. And, and, yeah, there's just some, something amazing about, I think, the sciences where, you know, it's like the scientific method where you could test something and, you know, it follows or some, some laws or some observations, or it's very kind of systematic. And I love that reproducibility about the science. So I think I'm digressing, but, yeah, I think just curiosity kind of led me to computer science, and the more I started to learn, the more I kind of loved it. And the more it was just kind of.
Yuri Plotkin [00:07:39]: Yeah, took over, I suppose. And the more you learn, the more you want to know. The more you learn, the more you realize, you know, the less you know. So I think it's just, it's been kind of a natural evolution, I think, for me to get into the field.
Demetrios [00:07:58]: So you have been writing a book recently, the variational book. What is it and how did it come about?
Yuri Plotkin [00:08:07]: So, the variational book, it's about all things generative AI. I discuss literally the last ten years of generative AI algorithms, not including Llmsdev. That's the one kind of elephant in the room that I don't touch. But I basically derive and explain logically all the steps in terms of how the algorithms work, why they work a certain way, provide anecdotal examples, more intuitive examples of the rationale behind them. And it's a way, initially, it started as a way for me to personally learn, reading a lot of papers, trying to disseminate a lot of the in between kind of knowledge that was transferred in the papers. And, yeah, it kind of snowballed from a short kind of blog post that I was going to write in terms of. And then I just kind of kept writing and kept writing and kept writing. So I'm pretty excited about the book because it's very thorough, and I think I provide a different context and reference frame for describing these algorithms and then also for the pieces that I include.
Yuri Plotkin [00:09:31]: So if you can think of it as like a piece of a puzzle, each publication has a small piece, and sometimes it's hard to fit those pieces together, how everything relates to each other. A lot of these algorithms have evolved over time, and traditionally, now, you see there's a lot of push on, like text to image, text to video. These diffusion models. And these diffusion models actually have, you could say, have ancestral precedence and other earlier models. So if we talk about, you know, continuous normalizing flows, normalizing flows, time on score matching. You're talking about variational auto encoders. These are all, I'd say, ten years old techniques that are still widely used and applicable and very powerful methods, but yet they share similarities to these newer methods. So I think the exciting part about my book is I create that common thread that lets you kind of in a stepwise fashion, understand things by starting from bayesian modeling to latent models to working your way towards more generative methods.
Yuri Plotkin [00:10:42]: And just what it's about, I think it's a great resource for any machine learning engineer, anyone in the field, anyone who's an undergraduate or graduate student, because I literally try to not leave any logical gaps. So I tried to reduce the activation energy or the effort required to learn it. It's still very technical, still requires a lot of effort, but I think it's a very great kind of procession of ideas that I'm able to explain and provide. Yeah. In terms of how it came about, it's. Yeah, it was, started reading papers, it was more of a self interested, I want to learn myself. It was a way for me to read papers, to try to understand the papers. And I guess the saying goes, you don't really understand it until you either teach it or you write about it.
Yuri Plotkin [00:11:40]: So this was my way of gaining a better understanding. And I think the byproduct has, it's been several years, but the byproduct has been, has been, I think, a nice collection of writings that, that fulfilled at least my initial interest, but also, I think, provide a lot of value to the wider audience or to a lot of people, I think, in universities and in the field.
Demetrios [00:12:10]: So it's almost like you have this family tree of the different models. I really like how you put that together. The thing that is a very strong design decision on your part is not including LLMs in this book. Why did you not do that?
Yuri Plotkin [00:12:28]: I'd include LLMs because they're inherently a different, you could think about an architecture from the machine learning side, although maybe that's disputable at some level. But yeah, I focused on bayesian latent models and variational inference, and that kind of spilled over to lot of these kind of newer techniques. As I mentioned, diffusion models, score matching, normalizing foes, variational methods, a lot of more how do you compare two distributions, or how do you train a model to fit a distribution of data? So it's definitely, it's definitely focused on more bayesian kind of methodology. Suffice to say, there's also a lot of other books that have come out in the LLM space, both on the more theoretical side and more the practical implementation of how you do Lora and Rag and RLHF, and so on and so forth. So the spaces is a little crowded. So I thought there'd be a little bit more emphasis on more traditional approaches that are still foundational. And I think the theory, although it might be a bit different, still spills over to some of the modern techniques.
Demetrios [00:14:00]: Speaking of diffusion models, right now, I think last week, or just a few days ago, there was a new open source. Well, semi open source, because these days, when it comes to models being open source, you can't really believe any of that. The semi open source model that came out, flux, which is a diffusion model, and it's super cool to see because it has very high quality. And it seems like just from me playing around with it for 15 minutes, that it's easy to prompt and you can get some very high quality images. So there's a lot of people that are pushing the boundaries in that direction. When you were reading about all these different models, and almost like the evolution of the models per se, was there one thing or a few different things that surprised you or stood out to you?
Yuri Plotkin [00:14:57]: Yeah, I think there's a common thread, and if we start with, let's say, the initial conception of variational autoencoders, you encode some data into a latent space, and then you have this one step procedure where you sample this latent space and you generate whatever you're trying to generate, your synthetic data. Example. I think what was interesting, as you get into the weeds of the evolution after the techniques, is Gans, again, have a discriminator and a generator, or generator and discriminator, again, the generator Gans sampled in a one step process, and you generate your synthetic data sample. I bet the training algorithms are a little bit different for those techniques, and use different types of objectives and approximations. You could say what was interesting, what stood out was when I, I started the deep dive into diffusion based models, is you have essentially, you can think of it as a similar analogous process, but it's just repeated many, many times for the training and inference steps. So, for example, you start with a data sample, and you just corrupt it a thousand times and add noise at each iteration until you just end up with some gaussian noise. The inference part is you start without gaussian noise, and then you try to decorrupt it and relearn what that synthetic, or generate a synthetic example that comes from the same data distribution as the original training data. So you're relearning to generate those images.
Yuri Plotkin [00:16:43]: The caveat there is, these models are extremely powerful in the fidelity and the quality of the samples. But the bottlenecks are the inference, where you have to do 1000 steps to generate one example. And if you compare it to, let's say, Gans, you could sample thousands of gans in mere minutes. And to do the same with the diffusion models, if we reference back to 2020, the DDPM paper was a seminal paper that was, that would have taken hours because of the just computational effort to do the inference. So in sense, that was really surprising because they are very similar, but it's just the number of steps. So the interesting part is the evolution of diffusion models. The last few years have focused on the inference side, where instead of 1000 steps, you could reduce it to 100 to ten. Now they're trying to do one or two.
Yuri Plotkin [00:17:47]: So I think that was very surprising. And in the sense that the techniques are very related. Like you're sampling the lane in space in a very high level, abstracted sense, but also in terms of just, just in terms of, like, I would say maybe like in terms of. I was surprised in terms of how they're very related in that sense, and then they're just coming to more practical kind of differences in terms of the inference side of things.
Demetrios [00:18:22]: Yeah. So the nice piece about that is, and I like that you brought up how we're starting to see an evolution within diffusion models of cutting down those steps, because that is where the time consuming part comes in. And I imagine a lot of us out there have played around with, especially if you were playing around with stable diffusion in 2022, or whenever it came out, you would see the image kind of form, and then it would kind of get darker and darker, and then boom, it would be there. And it was like watching a Polaroid. And those were those steps, right? I think it was like every 15 steps or 20 steps. And you can put longer amounts of, you could, on the UI that you had, you could say how many steps you wanted, and if you put too many steps, it would over bake sometimes. So it's funny how that is the.
Yuri Plotkin [00:19:21]: Thing in real life. I've tried baking, and I'm not.
Demetrios [00:19:25]: Too many steps too. Goes back to that. But the, the piece there is now, how can we. It's almost like the architects are trying to figure out, a, how to optimize it so it's faster, and b, how to make it so we can't really over bake it, so we can idiot proof it a little bit. For people like myself to not be able to come up with something when we don't want it. And one thing that I was thinking about as you were talking about this, it just reminded me of a huge takeaway that I got from reading Joe Reese's book, the fundamentals of data engineering, is how useful it is for engineers to understand the tools and the space that they play in and really understand what are the pros and cons of each. And so, as you were saying, well, you know, you have gans that can be really fast, but they only have one step, and they don't have all of these cool, like, generative properties as much as a diffusion model. But a diffusion model is probably going to be a little bit slower.
Demetrios [00:20:33]: So when you're architecting what you need to architect, you can make your decisions based on that, because you can say, do we really need it to go through all these steps, or do we have the ability to wait 1 second or half a second? Sure, if not, and we just need to classify a picture, then maybe it's better to use a different type of architecture. We don't need this diffusion architecture. And Joe was talking about it a bunch in the book on understanding your compression algorithms, or understanding what databases are powerful in what ways. And so it just reminded me exactly of that.
Yuri Plotkin [00:21:18]: Yeah, I think those are good points. I would also add, I think it's very exciting nowadays, at least from my perspective and vantage point, is a lot of this compression on the inference side, I think has direct implications on the video generation side. Oh yeah, and a lot of. Well, so I think the current range is maybe around five minutes. I don't know what each company's secret sauce is, but the timeframe that you could generate a video is still limited to a certain degree. But I think the exciting part is that those two things go hand in hand. You reduce the inference time, and let's say for video generation, you can create more longer videos. And from what I'm seeing, I think there's a huge, like, there's hundreds of companies, I've talked to people before that are spawning, trying to do this video generation stuff.
Yuri Plotkin [00:22:19]: So it's a, I think it's very still early stages, but it's pretty significant in terms of its applicability and potential impacts on a lot of industries. And maybe I'm in LA, so I think a lot of the entertainment industry is cautiously scared, but very interesting, if.
Demetrios [00:22:42]: I'm understanding you correctly, it's because of the ability for us to go down to one or two step generations with these diffusion models. It's much easier to create videos from these one or two step generations. And then put, basically, you start to create 24 frames. And then you have the 24 frame per second video and you can add those up. Is it different when it comes through? Because I had always thought the models that were being used to output the videos were different. Like a whole different architecture than the models that were being used to output the photos.
Yuri Plotkin [00:23:33]: I'd have to review it just so I don't. But for video generation, it's usually units that are used, and they have some way of incorporating the time domain elements. So the issue with video generation is how you stitch certain frames together or different scenes. Yeah, for diffusion models, it could vary. It could be units as well. But I have to take a look at it. I think it depends on the exact implementation.
Demetrios [00:24:07]: So it goes back to what you were saying on how there is this lineage, or there's almost like a common thread that is going through all of these models. And you're really standing on the shoulder of giants when you're doing anything with diffusion models or units are coming from the optimization. And the efficiency gains that you get with diffusion models also help later on for the units. And vice versa, I would imagine.
Yuri Plotkin [00:24:40]: Yeah, I mean, I think the. From my kind of from, from my just literature review and just the process of writing the variation of book, you definitely see a lot of. I would say, I don't know what the right quantity is or how to quantify, but you definitely see a lot of spillover or translation. And maybe, I'm not quite sure what the right verbiage is, but you definitely do see a lot of, I don't know, borrowing or learnings. Inspiration. Literature. Yeah, inspiration. Maybe the best way to phrase is, like, learnings from how earlier publications work and stuff like that.
Yuri Plotkin [00:25:29]: In terms of some of the. In terms of my writing for the book, I tried to stay away from, like, specific architectural changes that might have been published, because those are very specific. And I tried to focus on the core algorithms at hand. So maybe I'm not the.
Demetrios [00:25:48]: What do you mean there?
Yuri Plotkin [00:25:49]: What, like, for example, in gantt training, there's organ training or any architecture or any model training. You could do specific, let's say, architectural changes. Say, you know, for gans, they would do. How do you incorporate style into the gan generation? So how do you precondition the gan to generate certain types of styles? Or they would modify the architectures that might increase, let's say, fidelity somehow. I try to stay away from architecture specific considerations, because I think those are less, those are more, those are more abundant. And what I try to focus on more is the core algorithm and the logic used. And of course, if there's like seminal papers, for example, again, if we stick with gans, there's a truncation trick or style based generators or washostein gans that improved either the objectives or did something in terms of the sampling, for example, the truncation trick, not sampling from the ends of this distribution to prevent searching, maybe rare events or rare generation. So, yeah, I tried to stay away from architectural changes.
Yuri Plotkin [00:27:18]: So I'm not maybe the best person to, like, comment on, like, what, how things are, have evolved.
Demetrios [00:27:29]: No, I understand that talking about all the different pieces, the ways that one is using the other, I think was the original thing.
Yuri Plotkin [00:27:37]: It's just, it's just like you could, for example. Yeah. If we talk about LLMs, the original transformer paper had one layer normalization after, I think, the transformer block or something like that. And then in hindsight, they added one before as well. So those architectural changes can have significant impact on the actual performance of the method and algorithm. But I've stayed away from those a little bit and tried to make it more focused on the algorithms themselves. Okay. Of course, once someone gets assimilated or comfortable with the algorithm approaches, then you can pull up archive or go do a Google Scholar search to see maybe what the latest and greatest approaches are.
Demetrios [00:28:34]: Yeah. So it's almost like there's remixes of each algorithm to try and figure out the best ways of optimizing it and seeing how can it go from this diffusion model that was originally taking thousands of steps to get to the image generation, to now it just takes one step, and that's because of some architectural changes that have happened.
Yuri Plotkin [00:28:59]: Sure, sure. Yeah. I mean, some of the, I can't remember off the top of my head, but for example, there's certain, there are certain public. So there are certain publications like the initial diffusion model started with DDPM. That was a big one. That kind of was like the breakthrough moment for those types of modelings, even though diffusion itself was published. The idea, like circa 2015, I think. But I want to, I'm not going to say his name.
Demetrios [00:29:29]: Right, but that's always fun, man. Pronouncing these names on the papers are a great time. Yeah. You want to get humble.
Yuri Plotkin [00:29:39]: Yeah. But I guess there. I guess it's interesting. The literature is filled with a lot of these methods to your point, there's architectural changes, there's more algorithmic changes. So from the diffusion side architecture, changes were in terms of could include some type of distillation, whereas like more algorithmic changes, there was a paper called DDIM, which is diffusion denoising implicit models where they focused on changing the markovian nature. Or they basically made the inference process non markovian in a high level way that allows you to reduce the number of inference steps that are required. Whereas in the original diffusion papers, it's very conditional on the previous time step, which would make a very sequential process.
Demetrios [00:30:39]: First of all, you said a word that went right over my head. What is markovian?
Yuri Plotkin [00:30:44]: Markovian. So from a probabilistic sense, like you can have a first order Markov model where the probability of X of taide is given based on, is conditioned on probability of X t minus one. So the time step before. So in this sense, it's a first order markovian model because you're only conditioning on one prior time step. And the only thing that really, really matters to predict the future is the previous time step. And you see, markovian types of modeling is very predominant in a lot of RL literature and assumptions, and a lot of machine learning basically simplifies these types of sequential decision processes using maybe some type of Markov assumption, although that's changed quite a bit, I think.
Demetrios [00:31:42]: How so?
Yuri Plotkin [00:31:44]: I mean, you could think about if we focus on diffusion models, let's say you have the diffuse, you have diffusion denoising implicit models. So now you're assuming a non markovian inference structure without getting into the math of it, you're basically not assuming that your inference is dependent on just the previous time step, but it's also dependent on the original. For these implicit models, it's dependent on the original data sample as well. You bake that into the probability distribution that you're trying to model. So with all these diffusion models is you're trying to learn at each sequential step what the distribution looks like. And if you take 1000 forward steps and you know what the distribution looks like at each time step, then you could easily take the reverse process back because you're just sampling from a distribution. But in non markovian diffusion modeling, implicit models, for example, you're not assuming just one conditional, you're not assuming markovian past or markovian model. I'd have to maybe go through some of the math to kind of maybe explain a little bit better.
Yuri Plotkin [00:33:13]: For example. Also, I think maybe I could be wrong. Maybe there's. But if you look at like language modeling in terms of, like, attention mechanisms. Right. What attention does is initially you would have sequential generation, right. It's, you're sampling from some multinomial distribution, and you're getting your next character, your next token, or whatever it is, but they're biasing it with attention these days. Now you have multiple, let's say, hidden states, and then attention is all you need.
Yuri Plotkin [00:33:41]: When the transformers came out, that changed what attention. How attention was defined. So therefore, this is an interesting question, and I don't know. I don't know how, I don't really know the theory behind it in terms of, let's say, how attention, whether it's relatable to a first order Markov process or a second or whatever it might be, there might. There's. There's definitely some probably deep connections there. And I would. I would be cautious and speaking on it on a personal level.
Yuri Plotkin [00:34:16]: But, yeah, there's definitely some, some connections and that's kind of focused on the interpretability of these models and how they actually work. Right. So I think probably in the next few years, we'll, we'll. We'll have more, I think, literature or literature or publications on that. So.
Demetrios [00:34:37]: Yeah. Or one of the listeners can just tell us in the comments, let us know.
Yuri Plotkin [00:34:42]: Yeah, you said everything. Well, hopefully they say you said pretty much everything correctly. But.
Demetrios [00:34:50]: Of course.
Demetrios [00:34:51]: All right, real quick question for you. Are you a Microsoft fabric user? Well, if you are, you are in luck because we are introducing SAS decision builder. It's a decision intelligence solution that is so good, it makes your models useful because let's face it, your data means nothing unless you use it to drive business outcomes. It's something we say time and time again on this very show. But wait, what do you mean by nothing? Well, SAS decision builder integrates seamlessly with Microsoft fabric to create effortless business intelligent flows. It's like having a team of geniuses you manage in your pocket without all that awkward small talk. With decision builder, you'll turn data into insights faster than brewing a double espresso. And you know how much we like coffee on this show.
Demetrios [00:35:44]: Visually construct decision strategies, align your business and call external language models. Leverage decision Builder to intuitively flex your data models and other capabilities. Breakneck speeds. There's use cases for every industry, including finance, retail, education and customer service. So stop making decisions in the dark. Turn the lights on with SAS decision Builder. Yes, I did just make that joke. Watch your business shine, SAS decision builder, because your business deserves better than guesswork.
Demetrios [00:36:20]: Want to be one of the first to experience diffusion. Want to be one of the first to experience the future of decisions? Well, sign up now for our exclusive preview. Visit sas.com fabric or click the link below.
Demetrios [00:36:38]: Now, if I'm understanding it correctly, and what I really like about that, and I appreciate you breaking down what Markovian is. For me, Markovian basically means just use the last step as your reference to create the next step. And the non Markovian is try and use the whole picture. And what we trained, whether, if it's the diffusion model, it's saying there was a thousand steps from when we had a nice picture to when we had perfect noise. Basically use that whole 1000 step journey as your reference.
Yuri Plotkin [00:37:18]: Not, not necess, not quite, but just to say that in a markovian first order Markov model, as you mentioned, you're just assuming that your future depends only on the previous state. And that could be for diffusion models, reinforcement learning, or whatever it might be. Basically, in a denoising diffusion probabilistic model, what you basically have is you have a conditional distribution in the forward training pass, you have your surrogate q x of t conditioned on x of T minus one. So imagine you start from t equals zero, you take a step, you only care about your previous distribution. What the non markovian denoising diffusion implicit models do is now they're learning a different distribution during inference. And what they're learning is q x of T condition on XT minus one the previous time step, just as before. But they're also conditioning it in on the original image, x of zero. So now this is, yeah, so now you're conditioning on two different, you could say variables instead of one.
Yuri Plotkin [00:38:44]: And the distribution you learn is non, is non markovian in that sense. What, what the diffusion and denoising implicit models show is that even non markovian distribute distributions can have the same marginals as the original denoising diffusion probabilistic model did. So the beauty of it is what they showed. And I think a lot of this gets lost in translation just by speaking about the mathematics. But from a layman's term, I think what's pretty amazing about these techniques is training. Nothing changes. And in a lot of these models, you don't really care about, you don't care about the training portion of it, because training, if it takes a thousand runs or a thousand steps to do the, to do the training to do, to do the noising steps, that's okay. But you care about the inference, the reverse, because you care about the generation.
Yuri Plotkin [00:39:45]: And you want that to be quick. So if you have again, to our discussion, a thousand steps, it's going to be very slow. But these implicit models, because they form non markovian distributions, what they show is you could train the model the same way, but when you do the inference side, because those non markovian and essence have, this is all just kind of like verbiage. I have to kind of. I would have to. Maybe this is like a selling point, like, you know, buy my book or subscribe because I go over all the details of the math. But in essence, what that means is you're reducing your inference time using these implicit non markovian distributions. And that goes into the fact that you can have, you can generate the same marginals that your distributions that you're interested for performing that inference.
Demetrios [00:40:39]: Dude. So I'm going to start using Markovian as an insult to people who are living in the past. You know, like people who can't let the past go. I'm going to say that's so markovian of you. They're just training. They're living their life on in the past and not, not able to get over that last step.
Yuri Plotkin [00:41:03]: Yeah, you would. That would kind of make sense, I think, from a nerd's perspective.
Demetrios [00:41:13]: Yeah. I might just be the one that looks like an idiot by saying that. Nobody's going to understand me. I'm just going to be like, what the hell are you talking about, Mandy? And then I have to say, well, you know, Markovian is actually, when it's t minus one, that's what's the Alps. And then I look like an idiot because I can't actually explain why or what Markovian is, but it does.
Yuri Plotkin [00:41:41]: You could get more complicated. You could have second order or third order. It's a little tricky because then you can condition on two. So you could have a more covalent process, but it's not a first order side be. We could work on the, on the.
Demetrios [00:41:55]: Joke on the burn. That's it.
Yuri Plotkin [00:41:59]: You have comedian and a computer science or two computer science people working on making sure that the delivery is good, but also that it's factually.
Demetrios [00:42:09]: Yeah, factually correct. And it actually makes sense. The analogy holds up. So, you know what else I was thinking, man? I would love to see what you were talking about. All these different remixes of the models and specifically the model architectures in like, a lineage, family tree, type of a graph or visualization to see. All right, here's all the gans. Here's all the different ways that we've tried to tweak gans. And then here's all the diffusion models or whatever it may be, and you can climb up the tree and see how diffusion models and gans came from the same mother or father or whatever it may be.
Yuri Plotkin [00:42:54]: Yeah, I think that would be pretty powerful. That would require another. Some more effort that.
Demetrios [00:43:04]: No, I'm not expecting you to put.
Yuri Plotkin [00:43:05]: That, but potentially I'll have it to you on Monday.
Demetrios [00:43:08]: Yeah, totally. I'll be expecting that. Before this comes out. I imagine somebody that loves data visualization has thought about that and put it together potentially and showed some way that all of this is linked because that's a huge takeaway from what you're saying. And really your book is that you can look to the past for inspiration and see how it has given us the future or given us at least the present.
Yuri Plotkin [00:43:44]: Yeah, I mean I think that's totally the case. I think actually, to your point, what I try to embody in this book is it's a little bit biased because I cherry pick for methods that worked and were well cited and more, you know, rose to the top of the pile. Yeah, but I actually essentially try to do as you described is go through the evolution from beginnings of bayesian models and, and just bayesian model selection or something more old school towards more new flavors, let's say diffusion models or text to video and stuff like that. Yeah, but yeah, I think it's, it's, I think that's a. I think it's a very good point. And I think some of the things I try to do actually lend itself towards that without actually having to read all the landscape. I think the beauty, every writer on a personal level there, you know, they. I'm always curious now since I've gone into this journey, they always talk about like some of the writer specific considerations, like when do you know when you finished, how do you stop? And unfortunately, like the computer science literature is so vast and there's so many great papers and there's so many great topics.
Yuri Plotkin [00:45:08]: And when you try to cover a lot, that basically means I just have to keep writing for another. I don't know how long, but yeah, I mean, yeah, it's an interesting point, I'd say.
Demetrios [00:45:30]: Yeah, you have the benefit of time too, now that you've been able to see all of these different models that bubbled up and have withstood the test of time, I think also is a huge factor. It would probably be really hard to do what you're trying to do now for the most cutting edge models because all of these variants are trying to take hold and we don't know which ones are going to be the ones that will stand the test of time.
Yuri Plotkin [00:46:07]: Yeah, yeah. There's definitely. There's definitely some selective bias there in terms of how or some benefits I have withstood the test of time, I suppose, for allowing me to even, like, write this book, because then, yeah, it's hard to predict the future. I think maybe only a few voices out there might have some reasonable competency to try to act Nostradamus.
Demetrios [00:46:38]: Well, it does make a lot of sense. You named it the variational book, and you're really looking at different variants of this original model. Is that where the inspiration came from?
Yuri Plotkin [00:46:52]: Pretty much, yeah. I initially focused on. First, I'd say I was very excited that the URL wasn't taken when I purchased. And for anyone who wants to remove the. The and just do variational book and do a mimic copy, I got that one, too.
Demetrios [00:47:11]: So they can do it.
Yuri Plotkin [00:47:13]: They can do it. You got to be a little bit more creative. But, yeah, it started with variational inference and again, kind of variational techniques, how you have some posterior distribution, some data distribution that is theoretically unknown, and how you could model that distribution through a sample of data and build some surrogate approximation to it. Those techniques, I think, have been foundational for just computer science in general. But, yeah, it's been pretty cool because it starts with from the bayesian side, just like I start talking about if I go through my chapters like a Bayes theorem, which is a seminal theory, that is, I think it's easy to memorize, but it's very hard to conceptualize what it actually doing and what it actually means. But once you conceptualize it, I think it's extremely powerful, because, again, I think no matter what the model is, I bet there's some interpretation that incorporates some type of Bayes theorem.
Demetrios [00:48:21]: But I go, how did you conceptualize it?
Yuri Plotkin [00:48:26]: So, I mean, it's a. You could memorize from a latent space model, you could conceptualize it. And this is some of the things I talk about to help bring some intuition about it. But from a latent space model, you always have variables that you observe, which are your x's, and then you have your latents, which are things that you don't observe. And you could think of both your latents and your, and your observed x's as forming some very complicated dependency graph with connections of some which you might observe and some of which you might not observe. And it's very, the goal is to learn that joint structure of your probability of your x and z your joint distribution. And when anyone. It's a very interesting, actually, it's a very interesting question because a lot of people ask you to explain what generative means, and I always go to Bayes joint distribution, but if you don't know what that is, then it's hard to kind of conceptualize what is generative.
Yuri Plotkin [00:49:34]: Like, you could generate stuff, I think it's in the name.
Demetrios [00:49:37]: But so that makes me think because you have these strong dependencies with the, the things you can see and the things you can't see. It's almost like if you want to find a certain part of latent space, sometimes you have different things. Like, I'm thinking specifically with LLMs, and sometimes you want to make the LLM perform better or do something be more rigorous. And so you can say things like think very hard or think through this clearly or things like that. And it makes me wonder if that's because there's these dependencies. So you almost zoom to that part of latent space by using these trigger words.
Yuri Plotkin [00:50:29]: Yeah, yeah. I mean, I think that's essentially that. I think that a lot of interesting research, a lot of the big companies are looking into that, like the interpretability, and that's the whole thing. It's like prompt engineering. Yeah, you could do it, but explaining why it might give you a better answer, I think that's the interesting part. And to your point, it's your potentially sampling from a different part of that joint distribution or somehow depending on how you would conceptualize it. But, yeah, so it's a very interesting topic. And a lot of the methods, again, stem from just these basic core ideas and from the variational side, I think, to your original questions.
Yuri Plotkin [00:51:11]: So I talk about model selection. I talk about exponential families. They're very important in just modeling in general and machine learning. And then you could talk about, well, if you build some approximate distribution with certain assumptions that we call the surrogate, how do you measure the difference between your surrogate and your true data distribution? And essentially what that is is everyone posts about the KL. Divergence is one metric, how you could do that, and you could define your objective that way. Then basically we snowball into more complicated topics of how you model the surrogate distribution, what assumptions you can make. And a lot of these generative models, the underlying theme there is that they're able to model a more complicated surrogate distribution. And because it's more complicated, you can incorporate a lot more of these dependencies that you might miss with a simple distribution.
Yuri Plotkin [00:52:13]: Analogously, it's like, and not to hate on linear regression, but linear regression is a great, very powerful model, but has many assumptions in terms of linearity baked into it. If you want something more complicated, then you could do a function approximation using a neural network, and there's pros and cons to that. So kind of similar in trajectory there, where you can build a simple surrogate distribution, but then you can work your way towards more complicated stuff. And this is essentially what a lot of these generative methods have done over the last, how many number of years?
Demetrios [00:52:53]: Yeah, not only that, it just goes back to the idea of understanding what you're working with, so you can know what the benefits are or the strengths and weaknesses are of each model and how you can as a. I can't tell you how many times I've heard the question, how much machine learning do I need to know to be an ML engineer? Or I'm coming from the data engineering space and I want to get into ML, what should I do? How should I go and learn the depths of this? Or do I just need to understand what people are saying or what they mean when they talk about the different model architectures? And it kind of reminds me of, again, of that idea of the more that you can understand on these model architectures, the better you're going to be, because you can suggest different architectures and different models for the different use cases and your needs.
Yuri Plotkin [00:53:48]: Yeah, 100%. A shameless plug, I was gonna say, and go read the variational book to learn more and more. I think theory that transcends a lot of ML, but, yeah, I think that's the hammer right on the nail. In order to be able to make changes or propose changes, you have to understand. It's very easy to come up with an objective and say, code it, or seed in code and say, well, okay, this is what we use, but the question is, why are you using it? What are the implications? And a lot of, I think that's where some of the kind of like boundaries between maybe ML engineering and theory and ML science kind of comes into play. And I think both are very crucially important. So it's, I think, yeah, I think it's good to have both.
Demetrios [00:54:41]: Yeah, it's fascinating to me, the more that you think about it, the more that you have to be really working through so many different levels when it comes to AI in industry, because you not only want to think about the infrastructure side of things and how to optimize on that level, but also optimizing the model and making sure that you're choosing the right model for the job? Are you choosing the right databases for the job? Are you choosing the right data for the job? All of that. Then even going up a level and recognizing what are we really even trying to accomplish here? When I get tasked with making our user retention better and I decide that I want to use machine learning for that, maybe I say, all right, well, if we have a better recommender system, then we can have better user retention. You have to be able to think through the business side of things, all the way down to the technical side and the data side and the machine learning side. So you're playing at so many different layers.
Yuri Plotkin [00:55:54]: I think with some humility, it's always, we could always do personally better with all those. And, yeah, I think that's the exciting part because some of that kind of manifests differently as time evolves, but also that's something to keep striving for and learning. And I think we could, like, on a personal level, there's always way more to learn and way more to understand and, and, yeah, every time I see, see really cool new papers that I'm like, well, oh, I should maybe incorporate that. And I'm like, on the one hand, it's a dopamine rush, because you're like, oh, it's a great new idea that I could learn. Right. And understand. And you kind of get into the heads of how people are thinking about these things, too. Right.
Yuri Plotkin [00:56:42]: On the, on the, on the other side, it's like, oh, it's a lot more work.
Demetrios [00:56:47]: Yeah. You have to balance it and decide, well, this has been awesome, man. I appreciate you, Yuri.
Yuri Plotkin [00:56:53]: Yeah. So it's great.