MLOps Community
Sign in or Join the community to continue

Zipline Roundtable episode: Building Real-Time ML Systems with Zipline + Chronon

Posted Jun 17, 2026 | Views 40
# Real-time ML
# Zipline
# Chronon
Share

Speakers

user's Avatar
German Krikorian
Software Engineer @ Credit Karma

German is a Software Engineer on the Feature Platform team at Credit Karma. Since joining the company during the early development of its recommendation system, they have played a key role in building and scaling the platform over the years. Their work focuses on feature pipelines and the feature store, which serves as critical infrastructure supporting numerous teams and business verticals across the organization.

+ Read More
user's Avatar
Raj Katakam
Staff Machine Learning Engineer @ Intuit Credit Karma

Raj Katakam architects ML Infrastructure at Credit Karma (Intuit). He holds a Master's in Software Engineering from Carnegie Mellon and a B.Tech in EECE from IIT Kharagpur. His interests include ML Infrastructure, Distributed Systems, Real-Time Data Processing, and Generative AI. His current focus is on providing feature engineering platforms, production GenAI infrastructure, vector databases, ML model serving, and MLOps pipelines for fraud detection, personalized recommendations, financial insights, and model explainability.

+ Read More
user's Avatar
Mick Jermsurawong
Member of Technical Staff @ OpenAI

Led Flyte ML training/experimentation at Stripe, and now led Chronon for ML features at OpenAI

+ Read More
user's Avatar
Ben Magyar
Staff Engineer @ Depop

Ben Magyar is an engineer at Depop working on ML and data systems. Before Depop, he worked on Search at Etsy. Most of his work is around the infrastructure and operational problems that come with running ML systems at scale.

+ Read More
user's Avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Real-time ML use cases like personalization and risk decisioning come with a unique set of challenges: serving fresh feature values at low latency for inference, generating temporally consistent backfills for training, and building complex chains of on-demand, batch, and streaming transformations. In this roundtable, practitioners from Intuit, CreditKarma, Depop, and OpenAI share how they use Zipline and the OSS Chronon project to solve these challenges and deploy real-time ML use cases in production.

+ Read More

TRANSCRIPT

Demetrios: [00:00:00] Ho-ho. We are live. Welcome everyone to this incredible day of good old fun. I see in the chat that some are asking, "Are we live? Are we not? Is it happening?" I just wanna confirm, tell me in the chat you can actually see us. We are full on in the session. We've got a... Eh. There's the, there's the... That was me talking in the background, so it should all be live.

Demetrios: So we've got a jam-packed day for you. The next hour, we're gonna be talking to some incredible folks that I wanna bring onto the stage right now. We're gonna be doing it in a really cool format where each one of the participants today is going to give a bit of background on what they do and what they're up to, and then we're gonna have a nice big round table discussion.

Demetrios: So, [00:01:00] I'm gonna start bringing people onto the stage. First up, I see on my screen we've got Mick. Where you at, Mick? Hey, there he is.

Mick Jermsurawong: Hi. Hi Demetrios.

Demetrios: How you doing?

Mick Jermsurawong: Nice meeting y'all.

Demetrios: Yeah. Great to have you here. And al- I see we got German. Hey dude, what's going on? We've also got Ben behind... There he is. How you doing, Ben?

Demetrios: So, you guys all are doing some amazing stuff at your respective companies. I know that we had said let's do some different ideas here where you can share and give everyone context on what you're working on. Are there any takers on who wants to share first about what they're doing?

Ben Magyar: I could start, I guess.

Demetrios: There we go, Ben. I like it.

Ben Magyar: Um, so I was gonna put, [00:02:00] like, a slide together, and then I was like, "I don't know how to describe it." Um, but, uh, so at Depop we've been actually migrating over to Zipline for the past few months, and as part of that, as you imagine for, like, a marketplace, um, the big problem is the matching problem.

Ben Magyar: So we've mostly been working on the ranking model around that and forming the dataset, getting all the features up and ready. Um- Yeah. As you, like, go through that, I think, like, the big piece around Zipline and Chronon has always been how do you fit into, like, the Zipline and Chronon model? Um, and then how do you kind of bring back, uh, that orchestration and, like, the correctness aspect into your existing feature set and what your, like, existing models look like.

Ben Magyar: So a lot of time just kind of conforming back to what expectations look like, um, and having a lot of fun with that.

Demetrios: All right. And main use cases [00:03:00] you're working on?

Ben Magyar: Um, ranking models, retrieval, uh, some cases in the trust side that we're getting over to.

Demetrios: Excellent. All right. German, uh, hopefully I'm pronouncing that right.

German Krikorian: Yeah. Yeah. Uh, it's pronounced German. No worries. Hello, everybody.

Demetrios: Way off, dude. I was gonna go with German, but I was

German Krikorian: like, "There's no way." It, uh, very common. No, no worries. Um, yeah, nice to be here. Thanks for having me. Um, just to give you a big bit of background about myself. So I, I work at, uh, Intuit Credit Karma.

German Krikorian: I joined Credit Karma before it was part of Intuit, uh, in early 2017, like the very early stages of its feature store and recommendation system. So I've been here for a little over nine years at this point, and I've been mainly, like, on the data and ML realm on the online platform side of things primarily.

German Krikorian: And I, I started out at work on, like, the online model selection and scoring service and, like, handling model deployments and refreshes and serving at scale. And that system [00:04:00] grew, uh, from a handful of new models a quarter to hundreds of refreshes weekly and, like, hundreds of models total. Um, and then I also worked on the custom-built feature store that we had.

German Krikorian: So we started out, we had like an in-house sharded MySQL, and we, we had a very, like, small feature platform, feature store at the time. There wasn't really a concept of feature store in early 2017, so which is kind of crazy to think about. But, um, yeah, w- we... It, it blew up and, you know, we, we scaled it. Um, we had to scale it.

German Krikorian: We had to change things up. We migrated to Google Bigtable and Dataflow, and that worked out very well for us, um, and evolved into, like, a custom-built feature platform with mature data pipelines. Um, but yeah, we've been just continuously evolving our custom setup and, like, I shifted my focus to feature platform exclusively.

German Krikorian: And the landscape recently especially has changed so much for the feature platform and feature stores, and there's so much tooling and so, so many products [00:05:00] Um, contrast to even, like, a few years ago. So we decided to, um, do a gap analysis, and our aim was mainly to continue improving our feature platform, but sometimes the best way to do that is just to look outward.

German Krikorian: So, um, we just wanted to learn ma- primarily what's out there and see, you know, where we can improve from the things that exist today. And ultimately, that cr- uh, culminated with, like, Chronon becoming our next evolution after a bunch of talks, a bunch of, you know, POCs and trying things out. Um, and thankfully our, you know, company leadership is very encouraging and supportive to enable us to do this.

German Krikorian: So it's, it's been a really fun journey, and we're in the very early stages of Chronon adoption here, but, um, things are looking great.

Demetrios: Okay. I, I'm gonna have a lot of questions for you later, but before that, Mick, you wanna give us some background context?

Mick Jermsurawong: Yeah. Um, hi, everyone. My name is Mick. Um, I am now at OpenAI.

Mick Jermsurawong: We're working on classical machine learning feature platform. Um, previously I was at Stripe working [00:06:00] on, um, led the, um, ML training with Flight Yeah, happy to be here

Demetrios: Very cool. And so now, what I, what I need to do is I need to start with what your systems looked like before you started using what you're using today and what you're going with.

Demetrios: Like, give me the kind of like, uh, run, crawl, walk type of thing, or how was it? How is it? And how has that evolution happened? And, uh, maybe Mick, I'll start with you, 'cause why not?

Mick Jermsurawong: Sure. Um, yeah, OpenAI, as you know, we... The- I work on a product side, infrastructure for the product side. So we're not training LLMs here, but the capacity [00:07:00] crunch is real across the organization, so there's definitely value for classical machine learning models, and that's where I think Chronon really shines to help us with these standard use case that,s- that industry has been solving for the entire, you know, for a decade almost.

Mick Jermsurawong: And, um, I think, yeah, Chronon is a great fit for starting with, you know, crawling when people are actually not, um, using classical ML in, in our product side of organization because people were able to use LLM to push through, uh, classical problems in the sense of prompt base asking for classifier response.

Mick Jermsurawong: And over time, as people got used to that, there's actually not a lot of appetite to actually adopt classical models, because as you all know, it's actually hard for you to actually maintain end-to-end pipeline to... and maintain your model lifes- ba- basically MLOps, right? And we didn't have a team for that.

Mick Jermsurawong: So to get started with [00:08:00] Chronon, actually a, a good place, it's using it as a way for, um, people to extern- to, to export offline data an- analytics into online serving. So basically we just use Chronon batch to calculate using the expressive features and aggregation semantics that Chronon provides and, um, readily people can actually query that on an online path.

Mick Jermsurawong: And this is like, if you squint, this looks like ML in the sense that we have features, but then the rule, it's actually just, you know, human encoded. So we provide the features, but the, it's still heuristic based. So that's how we got started. And, um, over time, I think Chronon really shines with, um, you know, real-time features and online, offline consistency, right?

Mick Jermsurawong: So with the real-time part, I think we just... we started with, um, Sora, which is, um, OpenAI, um, video, um, AI generated video content [00:09:00] platform.

Ben Magyar: Mm-hmm.

Mick Jermsurawong: And critical to that, it's recommendation system, and that's where Chronon helps us to- quickly spin up this, um, real-time features that otherwise would have taken us a lot more time.

Mick Jermsurawong: And within, within the month or so we were able to get it started.

Demetrios: Excellent. All right. Thanks for that. Uh, I'm gonna come loop back around to you. I've realized I've been keeping one of our panelists in the dark. Raj, where are you? Get on here. Here. There you go. Sorry about that, dude. I knew- It's

Raj Katakam: okay.

Demetrios: We have a surprise guest.

Demetrios: Hey, there he is. Raj. Hello,

Raj Katakam: everyone.

Demetrios: Oh. How you doing, dude? So you were-

Raj Katakam: So sorry, I think I interrupted, uh, my, uh, my, Mick. Uh-

Demetrios: No, it's perfect timing. We've been just talking about what your system looks like before, and now what it looks like now, and maybe going over some of these hard challenges that you've had as you're evolving your system to what you have now.[00:10:00]

Demetrios: So I know that's a very broad question. Let's just tighten it up by saying, hey, what did your previous system look like before Chronon? And Raj, since you're new here, I'll pick on you.

Raj Katakam: Ah, okay. So, uh, I was listening to the stream, uh, from the other one, like, uh, Germain did introduce it well. So, uh, I- we started off our feature platform in 2017, 2018-ish.

Raj Katakam: Uh, and, uh, majorly we are big- we are kind of big on, uh, GCP. So we so with you- with, with, with that, uh, we kind of scaled our particular fe- our, uh, our legacy features from around fi- 3,000 count to 50,000 count now. So overall it has been a journey and, uh, like, you know, with, uh, the growing business with,~ uh, uh, like, you know, we, uh~

Raj Katakam: What, what we did is mainly you can say it's kind of Lambda style feature platform, if not completely Lambda, where we do have [00:11:00] separate dual pipeline system where you have, uh, a complete batch, uh, uh, batch pipeline for cre- creating features, and also streaming pipelines for creating features. But as such, we have, ~uh, ~overall gone through a phase of, uh, rapid, uh, expansion and during that phases we have, uh, uh, accumulated a lot of debt, uh-

Demetrios: As one does, yes.

Raj Katakam: Yes. And, uh, and it's, it's a really good, uh, setup for us to actually try out, uh, Chronon because, uh, obviously a team of four who can't compete with a open source realm, which is, which has been dev- developing from, uh, uh, like, you know, almost seven, eight years, similar to as, similar to what we have been doing too, right?

Raj Katakam: Overall, uh, yeah, this, uh, we have done some POCs. That has been a really great journey as such.

Demetrios: Excellent. All right, Ben, I'm throwing the ball over to you. What'd your system look like beforehand? And really tell us about the [00:12:00] The good, the bad, and the ugly. Give us all the dirty secrets because I know there, it's not all roses and tie-dye

Ben Magyar: I feel like I'm a really bad person to answer this.

Ben Magyar: I joined Depop like six months ago, so then it's like a how do you historically look back? Um,

Demetrios: pretty tough- You've

Ben Magyar: only known the

Demetrios: good.

Ben Magyar: Yeah. But I, I think, like, a good reference point for what has changed in, in just that six months since we started moving over is the, um, separation of, like, features from training data and what you generate offline and, like, use online.

Ben Magyar: Um, prior to the migration, we, we would have features and have feature views, but, like, there was nothing actually unifying that into a training data set. That was often pushed out onto the product teams as a responsibility for them, um, which- Mm ... is pretty difficult for, like, large backfills. How do you start chunking?

Ben Magyar: How do you start, you know, filling in [00:13:00] gaps when you do have problems? Um, and since we've started to move over, the MLOps, the, the platform side has started to own more of that, and there, there's been, like, a responsibility shift in what do product teams own, where do they spend their time, and where does that focus go?

Ben Magyar: Um, and that's really been, like, the six-month shift, um, at least at, at Depop.

Demetrios: Yeah, and maybe you can go a little deeper on what some of these capabilities are with, like, that, that specific issue, you know? And saying, like, what were you trying to unlock there, and how has it worked out so far?

Ben Magyar: Yeah. I think the, the big unlock is, like- Correctly framed is probably like orchestrating correctness and- Mm-hmm

Ben Magyar: orchestrating training data. Um, so when you, when you write a feature, you know, you have like a 30-day window. That means that you have a 30-day dependency on some, you know, upstream source. How do you correctly know that you're backfilling that is like one [00:14:00] small step of just like how do I backfill that one feature?

Ben Magyar: Um, usually ranking models are pretty wide, like say you'll have like 300 features. Um, now how do you generate the training set with all 300 features at once to come out of that? Um, and then orchestrate that training set is fully complete off of all of the dependencies over that full, you know, everyone having its own window and like alignment requirements.

Ben Magyar: Um- Mm-hmm ... that was all, like each feature individually would get kind of wrapped in a, in a feature view, and you can orchestrate that separately, but the moment you combine it and the moment you wanna like incrementally add features, all of that was still being managed by the product teams. Um, so what we kind of like really wanted to push for and shift to was the more the platform owned that incremental like change and that inter- incremental shift in every training set, um, the quicker the product teams could kind of iterate and adopt and test new features.

Ben Magyar: Uh, that was, I'd [00:15:00] say, like the real value in the shift out of the product teams back onto the platform. Um, but also they're just not owning as much, so now they can just focus on the actual task, which is like how do you, you know, get people to match on the products that they actually wanna match over, and not necessarily how do you solve like backfill problems.

Demetrios: Hmm. So I love digging into some of these pain points, uh, because it really exposes where there's friction, and maybe we can talk about what other pain points have been solved with Chronon and what pain points you were going into. Um, I'll throw this out to anybody to... who wants to take it. Are there pain points that you've had in your mind and you were like, "Look, I'm gonna try and resolve this," and how has it gone so far?

Ben Magyar: I could grab this one too. All

Demetrios: right. You,

Ben Magyar: uh- Yeah.

Demetrios: Keep it [00:16:00] going. Keep it going while the others

Ben Magyar: are thinking. Um, so recently, like during this plan, um, we have embeddings that get generated, and this is a really good example where prior to the actual migration, um, we couldn't put the embeddings inside the feature store because of cost concerns.

Ben Magyar: Um, there were online cost concerns and offline cost concerns. Um, but that means that your training data, your features that you generate, uh, are clearly like misaligned with what you're going to serve online. Like your, your embeddings are not being sourced out of your feature store. They're completely sourced out of something else that gets orchestrated separately.

Ben Magyar: Uh, we had... We're like really intentional in like embeddings get sourced from the feature store. Every single thing that you wanna derive off that comes off this feature store. Um, the training data that dumps out the embeddings is now sourced from a single source, um, the online values also. So I think like, because costs can be cut, because you can save so much, um, you can also [00:17:00] help like resolve those really complex orchestration problems that occur both off and online

German Krikorian: ~I can also go next, add a few things, uh, ~from our perspective. So one of the things on our side is because we have like a fully custom system, I think both of us, Raj and I mentioned like there's a lot of maintenance overhead as, as a result and just like maintaining custom tooling. So that's something we wanted to move away from and kind of like find an open source community, ideally open source platform with a lot of like, you know, really, um, solid adopters, just proven at scale, uh, that Chronon has.

German Krikorian: And, um, so that, that's one of the things. The other one I would say is the, uh, feature definitions, just ensuring feature correctness. I think that with a structured DSL it is significantly easier to, uh, make sure that the features that you create are great features and, you know, point in time correct. Uh, basically it's a construct that's front and center in Chronon, but whereas if you have your own, [00:18:00] uh, DSL, your own, uh, kind of logic, it's pretty easy to get wrong, especially if you have a lot of like free form, uh, free feature definitions going on where you're trying to kind of like hand roll point in time joint correctness.

German Krikorian: Um, generally you get it right, but, um, if you get it wrong, it's, you have to have really good monitoring in place and be able to catch it. And so we really like that aspect of Chronon where it makes that a lot easier

Raj Katakam: I can add one more point to that. Uh, so we over the time we have built our batch pipelines first and then, uh, built on top of that we have built, uh, streaming and, uh, real-time pipelines. And one of the things which has, uh, which Chronon has really solved for us is a unified authoring system. So we had different authoring platform or different authoring style for batch and different authoring style for, uh, uh, streaming.

Raj Katakam: But now we, but now it kinda is unified, which kinda, uh, makes it easier for the data [00:19:00] scientists down s- down the lane, like, uh, because they don't have to interact with two different things. They don't have to think on, "Hey, should I just put this... Should I put this on here in this particular style, or should I put it here in this particular style?"

Raj Katakam: Now it's kinda unified. That's the, uh, from the user experience side, from the data science experience side, it has been, uh, uh, like, you know, really good, uh, overall experience per se.

Demetrios: E- eliminating that friction, I like it. Yeah. Mick, I see you wanna say something.

Mick Jermsurawong: Yeah. Um, I think besides all the points about Chronon ergonomics and how it fits with existing workflows that other people talk about and it fit with existing, um, you know, ML, like, um, requirements of their previous system, I think just from the scale perspective that we are starting with, I think we evaluated in OpenAI across multiple solutions, whether we have, like, in- in-house, um, counters that [00:20:00] people can write their own Flink job.

Mick Jermsurawong: I think... I mean, Flink really suffer from the, you know, um, a large bootstrap time when your time window, it's really large, and Chronon just with this Lambda architecture, it's able to effectively address that. Um, we also have in-house, um, so like OLAP system where the real-time events are ingested and people have the benefit of dynamic queries that are able to get more expressive features.

Mick Jermsurawong: But Chronon strikes a balance between, you know, the expressive SQL, but then with that, um, so like more, uh, restricted, um, expressions, you're able to materialize this and actually serve that with much, um, s- smaller latency compared to dynamic queries issued to OLAP's. And, um, so I, I think with that, um, considering from scale perspective, that's something that really compels us to, to use Chronon.

Mick Jermsurawong: And the other part it's, I think, the team and at open source is absolutely world-class, and I think they really understand this [00:21:00] problem. Nikhil and the team has been thinking about this for so long at really granular details. Um, so, um,

Demetrios: I- Yeah. They've been

Mick Jermsurawong: living the pain. Yeah. I, I think we fully trust them, and it's has been super collaborative, so I'm definitely advocating for, um, open source, um, um, Kro- Chronon just because p- just because of the founders and, and the, the engineering team there.

Demetrios: Ooh, that is a glowing recommendation. I like that. And you spoke a little bit about this stack. Uh, I, I like how you gave me some frame of reference on it, but maybe you can enlighten me a little more on how the... how Chronon fits in the overarching stack and where it plays

Mick Jermsurawong: ~Um So, so, ~so it's a part of, um, data platform. So we have dif- different, um, standard data products. Um, how it fits in, we So, Chronon made me... I guess, let me know if I'm off here, but I would start with just, like, how I understand Chronon in our [00:22:00] system and fits with overall architecture. So basically, because it's Lambda architecture, you have a batch side and you have a streaming side.

Mick Jermsurawong: And so from the batch side, it fits re- very much with our in-house Spark system. And, um, on a daily basis, this batch part of Lambda architecture, it's being so-called compacted, or we create the historical sort of like, um, aggregations from, from historical up to, like, midnight UTC, and then we export this to our, um, KV lookup.

Mick Jermsurawong: And then on the real-time side, of course, the source here is, um, real-time, um, Kafka events, where we have Flink that, um, continuously consume from these events and write out pre-aggregates for, um, the streaming features, and we push this to a KV store. And then during the serving time, when the client comes in, let's say we want to get, um, [00:23:00] feature for a particular user, um, Chronon serving layer knows how to pull in from the batch side store that has been stored, that has been exported from the batch Spark, and then from the KV store, which has the pre-intraday pre-aggregates, um, and merge them toget- together to get the full, um, features for, for the real-time one.

Demetrios: Awesome. And h- how about you guys, maybe somebody else on here, how does it look? Is it very similar to that? Is it different? ~Uh, a- anybody wanna give a stab at it? ~

German Krikorian: ~I can hop in. Um- ~

Demetrios: ~Yeah ... ~

German Krikorian: so i- it's very similar on our side also. That's, that's the beauty of Chronon is it's, it's like, it's portable. I think the difference on our side is the KV store.

German Krikorian: I mean, we're in GCP, so we use Bigtable, and we have a lot of experience with Bigtable, and we really like Bigtable, so it's nice that it's, uh, natively supported within, uh, Chronon. But aside from that, yeah, the, the batch and streaming pipeline's very similar, you know, Spark and Flink, and, uh, we have a, a similar, you know, serving [00:24:00] layer that is able to pull from both of those and, uh, serve those features.

German Krikorian: Architecturally, before our, before Chronon, our system was pretty similar, so that kind of made it easier to adopt in that we also had, like, batch pipelines and streaming pipelines, except in our case we were relying more heavily on BigQuery and Dataflow as opposed to Spark and Flink. Uh, but the fact that it's, like, cloud agnostic and also pretty similar is a really strong point for us, especially because we're part of Intuit now, and Intuit is on AWS.

German Krikorian: So the cloud interoperability, uh, gives us a lot of advantages and kind of future potential integrations that make this a lot easier Um, so it, it felt more like it, it wasn't really a huge architectural shift, more like a, a natural evolution of the system that we were working towards and had a lot of the similar capabilities kinda in progress, in flight, uh, already.

German Krikorian: So that's kinda how we started to move towards Chronon.

German Krikorian: Anyone else wanna hop on? I'm not sure. Demetrius, can't see you. I guess I could add a little bit onto it. Um, so for us, I'd say it's pretty much the [00:25:00] same. I think there's, there's always like some slight difference inside of some part of it. So for us it was like, oh, our, on our data warehouse side, we have n- unpartitioned tables.

Ben Magyar: Um, they're all like clustered over timestamp columns instead. So there's like some shifts on the orchestration work and like how you declare dependencies that weren't like naturally fitting. The, the Zipline team was there to help on that. Um, but otherwise, yeah, mostly follows on the same kind of, you know, construct around it

Raj Katakam: Got it.

Raj Katakam: Yeah,

Mick Jermsurawong: I think it's worth, it's worth pointing out that the system, it's very elaborate. It has a lot of moving parts, but the fact that it actually applicable to many organizations, I think that speaks highly to the, the right level of abstractions that Chronon open source is getting this right. So yeah, very, very happy to, to, to, to, um, onboard Chronon.

Mick Jermsurawong: I mean, it was without Zipline, without Chronon's, um, open source team help, it would be really difficult to get [00:26:00] started by ourselves. And, um, I think maybe when I said, like, we were able to set up a real-time streaming within one month, that is actually on the foundations of having the batch running for a while.

Mick Jermsurawong: But otherwise, I think it is really hard to, to, um, to, to get started, yeah.~ ~

Demetrios: So it looks like you were able to talk without me. Uh, I think I dropped off there for about a minute, if you didn't notice. I, I'm not sure what I missed, but I'm sure it was amazing. That is one thing I, I am sure of. And so did anybody else wanna talk about this point, or should we keep it moving?

Demetrios: ~If it's kind of a speak now or forever hold your peace type of thing All right. We're gonna keep it cruising. ~I would love to know, espec- um, as you are looked through some of these different pieces in the stack, I know that when I was with the Chronon folks in Seattle a few months ago, they were talking about ways that they're incorporating, like, the LLM capabilities into what ML ops do on a day-to-day basis.

Demetrios: I don't wanna just be like, "Hey, you think you could use AI for some of this or, like, coding agents on this?" But maybe we could take a little bit of a broader step and think about what are different use cases you see [00:27:00] in the future for using something like Chronon or Zipline, and what are the, like, what are the major use cases and how are those, if at all, from what you are doing or have been ~c- past doing, ~doing in the past?

Demetrios: Go.

Mick Jermsurawong: I think, um, with the agents and generative AI, I think we have a lot of questions about that, getting started with Chronon coming from OpenAI, and I think thinking about Chronon as, you know, a, a tool that agent can actually deterministically get this and, um, doing it at a much cheaper cost, I think that's...

Mick Jermsurawong: it's an endearing aspect of standard feature engineering. Um, speaking from experience and knowing a colleague at a large financial fintech company, um, they are using agents to orchestrate a lot of, you know, user-facing experience, but they do rigorous back [00:28:00] test, and they use Chronons as a part of that to actually feed as input to the agents.

Mick Jermsurawong: And the having Chronons route fetch these, let's say, last K features or last K items that agent can reason about context of the world in a recent history, they, um, they still use Chronon just because it's unified both the classical machine learning models and, um, the agentic work stream that actually wants reproducibility and ability to back test

Demetrios: Awesome. Brave to think about what the future might hold or make a prediction on something we have no idea about. I can chime in a little bit. Oh. Um, like from the just feature definition and just set up phase, I think im- um, using agents makes it a lot easier in terms of like you don't really have to manually write feature definitions that much anymore, especially if you have the structured DSL like Chronon [00:29:00] provides.

German Krikorian: In that, um, if you give it enough context and, you, you know, enough examples, and also there's a, a Chronon skill that exists that, uh, we've been u- using, we started using pretty recently, but it's been working pretty well, where, you know, you can freeform text and, um, you, you'll be able to generate pretty good feature definitions.

German Krikorian: Maybe need some slight tweaking back and forth, I mean, updated documentation. But ultimately I think we can get to a state where we don't really have to write, uh, like Chronon DSL manually at all or almost at all. I think that's a really promising feature that will make feature it- iteration and just, um, creating features a lot easier.

Demetrios: Hmm. Yeah, you gotta, you gotta love that. Take a little bit of that, uh, painful rote work out of it. So I've got a question from the chat that's coming through on,~ on ~Gabriel's asking, "Do you feel Chronon has a large learning curve? [00:30:00] How easy was the adoption?" I'll tack on there, like, what's the developer experience like?

Demetrios: How does that work? And I think one thing that is worth noting for specific tools like this is there's, you all are consumers or users of the tool, but then you have your consumers of that tool. So it's like there's two aspects of it, and I know that makes it particularly difficult to build for that type of this area, 'cause you have maybe other personas that aren't necessarily as strong in software engineering as you that wanna also leverage these tools.

Demetrios: And things that you think are blatantly obvious, maybe to a data scientist aren't as obvious. Uh, so all that out there, what's the developer experience like?

Demetrios: And Raj, you're here, so I'm gonna throw you on the, under the bus and ask you. [00:31:00] Yeah. ~So I, I, I, I... So, uh, native-- ~the part with, part I liked is here we, uh, Zipline made a lot of this easier. ~So it has, uh, uh, uh, uh, zip- So ~they have, uh, encapsulated or like, you know, abstracted all of this as such, like interacting, uh, g- uh, creating jobs from just your feature definitions and then, uh, uh, even, uh, running backfills and seeing the impact of, uh, uh, those new feature definitions and evaluating those.

Raj Katakam: That, if you were to do with direct open source, you have to ma- uh, you have to create multiple jobs and then like, you know, bring them all together so that you iterate on them. And that one piece, uh, those kind of pieces, uh, Zipline folks have, uh, made it easier with their, uh, uh, uh, with their encapsulations and the, the top level, uh, standardization which they have done on top, uh, uh, done.

Raj Katakam: And that, that kind of eased a lot of, uh, uh, learning curve, I would say. So now ~even if, uh, uh, ~even if you give any, uh, coding assistant these [00:32:00] docs and all, it automatically just, uh, makes it happen for you. And, uh, uh, and also in a very easy fashion. So that's, that's something which, uh, uh, helped us sell it to our downstream consumers as well while we were doing this thing.

Raj Katakam: That because there are two aspects to it, right? There is one piece is that we as engineers have to, uh, like, you know, once they author the features, we have to maintain those, and then we have to keep the systems running. The second piece is we have to make it easy for the underlying downstream data scientists to actually try out the new, try out new experimentations or try out new features and then, uh, author something and then, uh, uh, like, you know, make that particular path very easy, right?

Raj Katakam: So that's where, uh, this one has really helped us out.

Ben Magyar: Um, I think there's a lot of like The, the part that comes with Chronon and [00:33:00] Zipline is the, it works offline, so it works online side. Um, prior to the move, there's, there's usually some work around, like some feature that you generate offline, tack onto your DAG offline, and now like you use it to train over.

Ben Magyar: Um, but then you have to go productionize it 'cause you see some lift offline, and it's just a ton of work to do. Like you're orchestrating like multiple model calls online. Um, you have to make sure that they're get- both getting updated around the same time. Uh, so I, I think that the offline, online, like you just know it's gonna work right away, um, ease is such a, um...

Ben Magyar: It just brings like so much benefit to the, the product side that they're, they're pretty happy with it immediately when they like figure out that they don't have to go productionize every single feature that they go and generate, um, because they just kind of know it's gonna work once they flip the switch on.

Ben Magyar: I think that is like, kind of like alleviates any of the, the mental [00:34:00] overhead that they might have feared early on~ ~

Demetrios: ~Okay Anyone else? ~

German Krikorian: ~I can hop in. ~So, uh, the question about like large learning curve and how easy was the adoption, I would say definitely like not to sugarcoat it, it, I mean, it, it definitely has a pretty steep learning curve, and I think adoption and like adopting it is, uh, the biggest hurdle probably in, uh, the Chronon like system itself.

German Krikorian: It is a, like a, a brand new system. You have to learn it, and you have to understand how it works, and it's a continuous process. You know, you, you're going... Especially if you have a mature system now that you understand, understand very, very well, and now you're going into, uh, this brand new system that you have to continuously learn about and understand the ins and outs of that too.

German Krikorian: Um, that, that does take time, and you have to reason and think about, you know, uh, how am I going to set it up and align it to my organization and how we're doing things. You know, s- every company does things a bit differently. So there is that, that steep learning curve. But having said [00:35:00] that, um, I think working with the Zipline folks helped us significantly.

German Krikorian: We had a lot of back and forth with them ~and, um, ~and also like a lot of like features that we needed that open source Chronon didn't have, we were able to collaborate and, um, that helped us a lot. One of the biggest ones is like the GCP integration was, I think, non-existent at the time that we were looking into Chronon, and so the Zipline folks helped us a lot there.

German Krikorian: And then, um, a couple other things also like orchestration through the Zipline hub is a lot easier compared to running your own like, uh, Airflow and, uh, you know, trying to figure out DAG dependencies and all of that, that kind of gets abstracted away from you. So it makes, it makes it a lot easier to set up and also just collaborative, um, as a learning experience

Demetrios: ~So- ~

Mick Jermsurawong: Can I add to that?

Mick Jermsurawong: ~I would-- ~

Demetrios: ~Oh, yeah, yeah. ~Go ahead, Mick. Sorry.

Mick Jermsurawong: Yeah. Uh, yeah, I think the, I think maybe those mentioned before, but with like agents helping people to write code, 'cause now I [00:36:00] think for my team doing both Chronon and also, um, we have like Flight as a, you know, um, orchestrator system, but people really don't write code these days anymore, and they don't look at the artifacts at all.

Mick Jermsurawong: So, um, I think agents are pretty good with reasoning about Chronon, but I think the key here is perhaps education for your user. I think people are sometimes stuck with the idea of maybe like textbook ex- examples of like maybe Kaggle competition, that if you have a feature set and you just do like super expressive and like Python third parties and like 10 times transformations of the features to be able to, to get it to feed to your training.

Mick Jermsurawong: But I think when people think about... I, I think educating people that this is, you know, real time features and there's actually a lot that you can get from real time events and just a standard query. So like the thing [00:37:00] that I always tell my users, like, okay, imagine that you have an events analytics of like user event streams.

Mick Jermsurawong: Chronon is actually a SQL probe. Like you express your SQL expression, you only have that. You put this probe into that event stream, and Chronon will be able to get you, uh, features in real time. And that can help people to, once they understand that, have a good mental model around that, I think it's easier for them to express to the agents to get what they want, and the agent can then learn how-- A- agent will be smart enough to, to express what user wants once they have the right mental model into Chronon language.

Demetrios: Yes. Sometimes, man, it is so wild how it's those simple unlocks where if you just understand how to speak to the agent and explain it, give it what it needs, uh, and it's like, "Wow, that was not very difficult from my part, but I used the right words in the right sequence, and I have a whole new world in front of me."

Demetrios: That. Let's keep it moving 'cause there's [00:38:00] some good questions coming through in the chat, um, and we only have a few minutes left. I would love to know two things. I'm gonna start with, what other feature stores or feature platforms did you all evaluate, if any? I know that I think the Intuit Credit Karma guy said that you were already using one, and you went off of that.

Demetrios: During that migration, did you look at anything else that was open source? I know there's a few out there. Uh, and then also, so get it ready, what are the bad parts about this? We had one person in the chat set saying, "Hey, we just have all the pros and no cons? This, it's a little sus." So start thinking about that too.

Demetrios: Um, hit, hit me. Raj, I see you're off mute. What other things did you evaluate? Let's start with that, while you were- Yeah ... making that migration.

Raj Katakam: So for us, [00:39:00] the constraint space is quite large because we are operating in a finance space, right? It's, it's FinTech, right? So one of the biggest things is that we wanted, uh, an on-prem solution.

Raj Katakam: We didn't want to evaluate anything, uh, with- where we have to move the data altogether because that introduces a different, uh, uh, uh, it introduces a different complication altogether, moving around data. And even, uh, even if you have all these, uh, what do you call, security set things in place, it kind of makes the whole, uh, uh, space itself very complex to operate in.

Raj Katakam: So, so we, we were evaluating options where we can bring in the whole technology and then host it in our infrastructure itself, and that's one of the reasons why, uh, like, you know, uh, Chronon had a lot of points as such. Chronon, uh, im- uh, and we wanted to see if we can, uh, bring in, [00:40:00] uh, uh, along with an open source built on s- build something on open source so that we can also make it work for our ecosystem if, if something we are diverging from, from it.

Raj Katakam: And also we do wanted to have, uh, like, you know, say, hey, uh, a support group for that, uh, like, you know, and making things move faster. And that's, that's the reason why, like, you know, Chronon and Zipline together, they added a lot of points. They got a lot of points. But we were, uh, looking at other, ~uh, uh, uh, ~options as well, but majorly all of them fell apart in, on this particular thing, that we wanted to bring that technology and put it

Raj Katakam: That's where we were giving a lot of points to. So, uh, yeah. That, that was one.

Demetrios: Yeah.

Raj Katakam: And you were talking about pain points also, right? ~So one, uh, one- Bigger, ~one pain point which I do wanna talk about is, uh, this system, uh, it's, it's really good from the end user perspective where you're actually getting the [00:41:00] right definition saying, "Hey, if I want a feature aggregate from this point in time to last three sixty days or something," right?

Raj Katakam: You're getting it. But underneath, uh, you have three ecosystems operating with each other and they are talking to each other at, with precision. That makes it extremely like, you know, uh, uh, uh, from, uh, extremely difficult to, ~uh, uh, ~maintain and operate as such, right? Like, uh, uh, I mean, I... maybe not maintain, but, uh, if debuggability gets extremely tough if something goes wrong, right?

Raj Katakam: Uh, if any of these systems at some place falls apart, your final definition is going wrong and, uh, figuring those pieces out, like, you know, you need to really understand how they are operating and how they are handshaking perfectly so as to, uh, so as to actually make sure that thing which is running in production is right.

Raj Katakam: If something [00:42:00] goes wrong, you need to know exactly where you have to look at so that you can make things, uh, uh, come back to normal. So we, uh, the whole ecosystem we are adding a lot of guardrails for things like such things not to happen, but overall you need to understand that these three systems have to work together to give the final product and that's, that's where, uh, a lot of friction lies.

Demetrios: ~Hmm. And Ben, maybe, uh... Well, actually, you've been at... Were you around for the choice? You were only there for six months, so we'll skip you on that question. And, uh, and Ben, you can just tell us what you would like to see, uh, in the next, after this, after MIC Um, yeah, I think we... ~I'm definitely biased, so I, I want to be transparent.

Mick Jermsurawong: And when you said, like, you know, I only say the positive things, but I heard... I, I was from, I was from Stripe, and we were a big user and big fans of Chronon. Um, but of course, like, we had to do due diligence. I think we definitely looked, we looked at actually in-house, other in-house solutions, and as we look more at the computational model and how it is actually going to stand the test of scale, and we believe that Chronon does.

Mick Jermsurawong: But of course, I also want to be [00:43:00] balanced in that there are ob- like, efficiency challenges with Chronon. It can be expensive. And I think the team, and we worked with them aggressively to build, um, different models, uh, different, not models, ML models, uh, different computational scheme that, um, can, can optimize for things.

Mick Jermsurawong: So there's definitely rooms that some trade-offs that you can, um, that you can strike to, to get better efficiency. Um,

Demetrios: is- Is, are these things like data freshness and that type of...

Mick Jermsurawong: Yes. So d- I think it's, um, a Flink, the operational ability of Flink states or, like, the streaming part, like how big the job would be compared to the, um...

Mick Jermsurawong: So maybe te- technically, I think the, on the serving side, if you are want to do more work on the right side, you get a lot cheaper read side. For us in recommendation system, you, a single call to Chronon [00:44:00] usually fetch a large number of candidates, right? I think Ben might know this well too. But, and basically, I mean, for, may- maybe the, the catalog or, like, the, the, um, the universe of things that we want to select is not too high.

Mick Jermsurawong: It's, like, maybe medium size cardinality, but the number of users that comes in is a lot higher, and number of fan out of the things that want to fetch, it's a lot higher. So maybe one user, we actually serve 1,000 ads at the beginning of the recommendation pipeline. So in that case, the read side, if it's using, so like a traditional...

Mick Jermsurawong: So like, when I was saying that Chronon is merging pre-aggregates of multiple times, time, um, bucket together because it wants to gives you, uh, real time features, um, then your read actually scales a lot with, like, that fan out to serve this recommendation use case. Then, so s- [00:45:00] to strike the balance here is that you actually do a lot more on Flink side.

Mick Jermsurawong: You keep track, you keep time, um, so, like, a timer so that you can evict out. Basically be- the challenge here is that it's a rolling window aggregates, right? That means that the Flink always have to be active and keep writing out and publishing things and- Compared to if we want to just go on the, you know, pre-aggregates compart in the naive or like the, the basic Chronon features, um, Flink actually just write out the...

Mick Jermsurawong: So the five minutes window or one min- uh, one hour window that it cares about, and after that it's, it doesn't have to care about trying to evict or like keep the state anymore. And- Yeah ... it pushed the responsibility to the reader side, but that comes with a cost. So I think this is some, this is recent development, but I think it actually allows Chronon to scale a lot more.

Mick Jermsurawong: There's definitely more, um, efficiency challenges, but [00:46:00] at scale. But I, I think the team it's, um, definitely, um, are actively working on this. But, but for a medium scale use case, I think K- Chronon, it's definitely a great fit.

Demetrios: Can I

German Krikorian: jump in also?

Demetrios: Yeah. Hit

German Krikorian: me Uh, I wanted to talk more a bit also, like, about the pain points and just, you know, be transparent about, like, some of the negatives that, um, we've, we've gone through.

German Krikorian: So a- again, it's going back to the adoption. Like, I would say the adoption, not only is it, like, deciding that you want to adopt Chronon, but also as you're actually in the middle of the adoption and, like, planning things out, um, you're, you're probably gonna run into some, some challenges and, and some, some things you have to reconcile in terms of, like, how your current system operates and how Chronon wants to operate.

German Krikorian: Um, or you have to modify Chronon to kinda align to more towards your system, which is a, a challenge. But, like, a couple examples I'll give is, like, one of them is a feature versioning strategy. I think Chronon was a bit lacking in that department, [00:47:00] um, when we were doing the POC, and I think the support for that is improving, 'cause I think a few clients, including us, um, kinda wanted to push for a stronger versioning strategy.

German Krikorian: And again, the Zipline folks helped a lot there, and there's, um, good support for versioning in the Zipline fork. But it... That's just, like, one of those things where, like, if you wanna modify features in place, for example, or, um, like, yeah, th- that's gonna be a challenge, and there's just some, like, quirks and some things you kinda have to think through.

German Krikorian: The other thing is in terms of performance, um, in the POC, we were stretching the engine, engine quite a bit 'cause we chose, like, a really large use case. Um, like, one of our largest use cases at Credit Karma, and a specific, like, expensive aggregation that we like to do a lot of at Credit Karma, and it was kinda stretching, stretching Chronon a little bit, and we, we saw a bit of, like, red flags in performance, and it was like, "Oh, I don't know," like, uh, "this is not looking too great."

German Krikorian: Uh, we worked with the, again, with the Zipline folks here, and that ultimately culminated into the SKU-free union join. Um, Nikhil gave [00:48:00] a really good talk about it, uh, a few months ago, but it improved the, the join, uh, performance by a huge factor, like 5 to 10X for us. So it, it's like, uh, depending on your data sets, depending on your use cases, what you like to do as a company, you will run into those challenges.

German Krikorian: But the, the nice thing about, like, having this community and having more people that use the same platform is that other people will have these challenges, and the platform will improve as a whole over time. So you're, you're... It's gonna be stronger and stronger, as opposed to, like, working in a silo within your own company on a custom system.

German Krikorian: If you adopt, like, a, a framework that a lot of other companies are using and struggling through certain things and improving the platform, ultimately everybody will be better off for it

Demetrios: Such a good point. And you don't have that problem where it's like, "Oh yeah, that engineer that just left knows how to do that thing."

German Krikorian: Exactly, yeah, yeah.

Demetrios: Which we've all... Like, the reason we're laughing is because we've all run into it and it's just like, "No, the documentation sucks [00:49:00] on this. Why is this not working? And it's, uh, so vague." Hopefully now with agents writing our documentation, it's not gonna be like that forever, but who knows?

Demetrios: Uh, anyway, Ben, what you got for us? You wanna bring us home? We've got about one minute left. You're on the clock, and then we'll wrap it up.

Ben Magyar: Sure. Um, I'd say, like, the hard part is usually during the adoption of there are some really, like, good data expectations that sit inside of Chronon, and for sure not every company aligns with those expectations.

Ben Magyar: Um, like, oh, your, your data's partitioned. You, you know, your schema from your Kafka topic aligns with the schema of the offline table. Um, we, we hit those during the adoption period, and either there's like breakout strategies of, "Okay, go and like re-normalize your table in a staging query within Chronon," or push back onto the, the Zipline side of like, "Hey, we need help," [00:50:00] and then they'll come and fill that like gap of, "Okay, here's how you get support on time partition tables."

Ben Magyar: I, I think that is like the real, um... That, that alignment is, is pretty difficult to, to start with and it's like a one-time cost, and then the moment that it's over, like you kind of... You got the good expectations that were kind of built in inside of Chronon w- within your org now.

Demetrios: Folks, this has been extremely helpful for me.

Demetrios: I really appreciate you all coming and doing this and being very transparent with everything. If anyone has any more questions for you, I will direct them to hit you up on LinkedIn. I think we're all on LinkedIn. And, uh, if not, then scour the internet and find them on X or wherever it may be. And we're gonna drop in a link to...

Demetrios: I'm gonna try and find Nikhil's talk that was ref- And we will be sending [00:51:00] everyone a follow-up email so that if you want to go deeper down the rabbit hole, you can, and you can see a few of these assets that were referenced. So until next time, thank you all for doing this. This was a blast. I appreciate you, and I'll see you later

+ Read More

Watch More

Real-time Machine Learning with Chip Huyen
Posted Nov 22, 2022 | Views 1.9K
# Real-time Machine Learning
# Accountability
# MLOps Practice
# Claypot AI
# Claypot.ai
Building Recommender Systems with Large Language Models
Posted Jul 06, 2023 | Views 1.4K
# LLM in Production
# Recommender Systems
# Meta
Code of Conduct
Your Privacy Choices