Sign in or Join the community to continue

Data Engineering for Streamlining the Data Science Developer Experience // Aishwarya Joshi // DE4AI

Posted Sep 18, 2024 | Views 1.3K

Share

speaker

Aishwarya Joshi

Machine Learning Engineer @ Chime

Aishwarya is a machine learning engineer who has worked in software domains across the stack from firmware and hardware to Java backend development to now data science and ML. She previously studied electrical and computer engineering at Carnegie Mellon University. Currently, she works at Chime, focused on their ML platform supporting development of models for financial fraud detection and marketing use cases.

+ Read More

SUMMARY

At Chime, low latency inference serving is critical for risk models identifying fraudulent transactions in near real time. However, to create these models, a large amount of time is spent on feature engineering-- creating and processing features to serve models at training and inference time is key in the DS user experience, but difficult to optimize with challenges in scaling and data quality. How can we enable data scientists to deploy features for training and ensure that these features are replicated with parity for real time model inference serving while meeting the lower latency requirements for fraud detection as the scale of transactions being processed grows? The answers are in the underlying infrastructure supporting feature storage and ingestion as well as the frameworks we expose to data scientists to streamline their development workflow.

+ Read More

TRANSCRIPT

Adam Becker [00:00:09]: Aishwarya, are you here?

Aishwarya Joshi [00:00:10]: Hello. Can I be heard?

Adam Becker [00:00:14]: Yes, I think you're heard. Okay, awesome. Ishwarya, welcome to the stage. I'm gonna come back in a few minutes. Take it away.

Aishwarya Joshi [00:00:23]: All right, great. Hi everyone, I'm Aishwarya. Really excited to be able to speak today. I'm a machine learning engineer at Chime, which is a fintech company that aims to address everyday people's needs with free and helpful financial services. I'm going to be talking about some data engineering practices that support our data science developer experience, particularly when it comes to feature engineering. So I'll explain a bit of context to break this down. I'll get into high level infrastructure and the interfaces through which data scientists can leverage that infrastructure. And then at the end, I'll briefly touch on some common feature engineering challenges that we have to keep in mind when solving these problems.

Aishwarya Joshi [00:01:04]: So, being a fintech, we have lots of use cases for machine learning. Obviously that includes fraud detection. So examples can include payments between users, mobile check deposits, and unauthorized account logins. And many of our use cases rely on near real time inference because of the very strict latency requirements, being able to respond as quickly as possible. So, for example, in terms of unauthorized account access, we want to act as soon as possible before a malicious actor can do harm by accessing the user's funds. And chime handles a lot of data. So that makes low latency inference serving challenges grow even more as our user base continues to grow and we have more and more data to store internally. So when we're developing models for some of these use cases, we generally have three types of features that the models depend on the so, going from left to right batch features which are calculated over a window of historical data, typically at a regular cadence.

Aishwarya Joshi [00:02:04]: Examples like a feature calculated of total withdrawals over the past week for a user and updated on a daily basis near real time features which are computed also over a window of time, but have typically lower latency because we have streaming pipelines and streaming ingestion to support their computation. And that could be a feature like the past five minutes of transactions for a given user. And then of course, real time features self explanatory, they're available right away. We typically get them from whatever call or service in the payload request upon model invocation. So with that information in mind, as a data scientist or MLA, I really care about a few things. I want fast feature lookup to serve my real time models. I want data quality checks in my ingestion infrastructure so that I'm aware of feature quality impacting my model's performance, and I want interfaces to be able to define my feature logic and not have to worry about do. I have parity between my development and production environments, and I want to make sure, and this is easier said than done, that my features generated are correct.

Aishwarya Joshi [00:03:04]: Especially this is tricky with data that's always evolving over time, which we'll discuss more about a bit later. We can translate those needs into some technical requirements for our infrastructure. To support that perspective of data scientists and MLE work, I'm going to need an online feature store to serve my real time models with low latency separate from an offline or at least offline computation of features. That might be sufficient for batch models and training pipelines. But it's not good enough for our real time use cases where we have to serve features as quickly as possible and do inference in milliseconds. So secondly, a feature library will help us consolidate subject matter expertise and define features in one place so that anyone in our company or in our organization can reuse them to create their own model or do data analysis. We also need to combat training serving SKU, which introduces feature disparity and obviously model performance degradation between training and production inference serving. We'd also want to eliminate feature mismatch or the case of feature mismatch because of any ad hoc feature serving logic, or trying to reimplement feature logic in different cases or different environments, and reinvent the wheel when we can just all have it consolidated in one place.

Aishwarya Joshi [00:04:21]: And finally, when it comes to feature serving, I just want to make sure that I'm computing correct point in time feature data. Again, that's tricky when data is always updating over time and you have variable system lag to deal with. So to help meet some of those requirements at a high level, our platform supports different infra for batch and real time near real time feature serving. On the right side, we have like Kinesis streams and glue jobs to execute transformations in the Spark SQL environment, and this supports our near real time features for more efficient parallel processing of large workloads. And we can maintain fresher features when it comes to near real time processing. And the data is written to our online store, so models can just query those precomputed features through more performant reads on the bottom left. For batch features, we can compute features on demand, or we can read them in an offline store if we have. Since we have less strict latency requirements than we do with real time, we don't always want to deploy features generated to use our neon real time ingestion infra with these kinesis streams and glue jobs because it's generally more expensive to use those resources.

Aishwarya Joshi [00:05:21]: So it's only beneficial if that low latency is make or break for your model performance. So a model that generates a batch of predictions once per day might not really need it. And then there's interfaces that data scientists commonly interact with to leverage that infrastructure. So on the right side, users define feature logic in our library to consolidate query logic, and that can also help us configure ingestion jobs configurations, et cetera. On the left side, users specify which features to use in model development when they're actually training a model. So I'm going to go in a little more detail of how these interfaces enable data scientists during feature development in feature library on the right hand side, when users define their features, they can specify details about their feature like metadata along with the actual SQL query logic. For the features, we can detail the windows of time that we want the feature to be computed over and like the cadence at which we want the job to be kicked off so that we can run all those transformations and then update the value in feature store. Once we define this feature information, CI jobs help us generate like what configuration we need for glue jobs, what configuration we need for data quality checks, etcetera.

Aishwarya Joshi [00:06:29]: And data scientists are also able to update those data quality checks via config and set up like what kind of alerts they want in pagerduty, and how they want to conduct checks in their SQL query logic. So the interface enables them to specify what features to use and the logic for those features that are defined in our feature library is reused during model development on the left hand side and lets us make sure we maintain the same logic regardless of the environment we're doing feature computation in. So data quality checks enable data scientists to specify checks at different points of the ingestion flow. We can check before data transformation event occurs to make sure our source data is good quality and there's no corruption or missing data that can save us money before we kick off a pipeline that's not going to, that's going to fail or error out anyways. And then of course checks after we do the transformations to make sure we didn't lose data, or again that it was corrupted, and then after data is loaded to our sink, we can check it again. A data quality check would be useful again to make sure that downstream consumers again are not receiving corrupted data. Those downstream consumers being real time models that rely on the features ultimately landing in feature store, we can check null rates, duplicates, freshness, et cetera, all the usual stuff. I'll close out then briefly by highlighting some of the challenges we have to address in feature engineering.

Aishwarya Joshi [00:07:51]: So feature availability lag, that's a challenge that can introduce feature disparity and make point in time feature generation trickier. Just to define again, point in time feature generation is basically making sure a sample at a given timestamp when we generate features for it. We are only joining records from different sources using data that's available at or before the prediction timestamp. If you include data after, then that constitutes data leakage from future data values and will give you inaccurate performance of your model, inconsistent between training and deployment. So in addition to this, we also see delay from source data availability. So like just being loaded and available in the source storage location, and then also delay due to processing itself. So we have to be able to account for these delays at training and inference serving when essentially data scientists actually have to configure the features for model development on the left hand side here, or in feature library on the right hand side here. We essentially parameterize that so that if a feature, say, is served, it might be delayed by five minutes because of lag at inference time.

Aishwarya Joshi [00:08:59]: That means the features only based on data available up to five minutes before the prediction is generated. So at training time, you also have to make sure you're only creating the feature based on data up to five minutes before the timestamp. And by timestamp I mean like the point in time the data was becoming or became available to do feature processing in the first place. So to close out, I hope that you had some insights in this quick talk, namely more understanding of the infrastructure requirements for both batch feature processing and real time feature processing to like maintain model performance in production and also like how we can expose unified frameworks for data scientists despite this divergence in the underlying infrastructure. So yeah, thank you everyone for listening.

Adam Becker [00:09:41]: Thank you. Aishwarya. We do have a couple of questions. I know we don't have too much time for questions, but if it's okay, I want to ask them. Okay, so Rohan here says, how do you handle feature drift for features that do change over time?

Aishwarya Joshi [00:09:56]: Features that do change over time. So that is like a whole presentation in of itself, probably with observability. And I'm sure some people are talking about essentially like you need to build out monitoring capabilities for your, for your models. You can see the distribution of the predictions that the model is generating and that can give you insight into how some of your features are changing and you can also another way is tracking feature lineage. So being able to see upstream sources for your features and seeing any changes in those that might impact the features that your model actually relies on. So that can indicate to you whether there's potentially feature drift down the line. Again, upstream changes can impact that. And yeah, Datadog monitoring essentially is like the feature you would need to make sure to build out for your models and also for your blinds that do data ingestion for in my diagram where I showed like real time feature ingestion.

Aishwarya Joshi [00:10:51]: Yeah.

Adam Becker [00:10:53]: Yes. How do you pack your ML models for serving on bedrock? And how would you specify the output.

Aishwarya Joshi [00:11:00]: Collections on dynamo serving on bedrock? So we're not bedrock necessity, but essentially like we are using containers to serve model and we're able to duplicate what logic we have for our workflows, for inference and for training in our container environment. For real time or batch, we configure those beforehand and we can generate model artifacts to store storage location and be able to quickly read that in our actual inference environment from that storage location. Just basically read like okay, what version? Make sure we have the correct artifact to execute predictions in production. Not sure if that inspiration, but yeah.

Adam Becker [00:11:57]: If not they can find you in the chat. Thank you very much aishwarya.

+ Read More

Sign in or Join the community

Watch More

Data Engineering for ML

Posted Aug 18, 2022 | Views 1.5K

# Data Modeling

# Data Warehouses

# Semantic Data Model

# Convoy

# Convoy.com

Reproducible data science over data lakes // Ciro Greco // DE4AI

Posted Sep 18, 2024 | Views 546

Building for Small Data Science Teams

Posted Dec 19, 2021 | Views 962

# Spothero.com

# SpotHero

# ML