Data Engineering for AI/ML

Data Meh-sh: Composability over ideology in the data stack // Stephen Bailey // DE4AI

Data can't be split into microservices! But teams should own their data! But there should be one definition for metrics! But teams can bring their own architectures! Data platform teams have a tough job: they need to find the right balance between creating reliable data services and decentralizing ownership -- and rarely do off-the-shelf architectures end up working as expected. In this talk, I'll discuss Whatnot's journey to providing a suite of data services -- including machine learning, business intelligence, and real-time analytics tools -- that power product features and business operations. Attendees will walk away with a practical framework for thinking about the maturation of these services, as well as patterns we've seen that make a big difference in increasing our adoption while reducing our maintenance load as we've grown.

Stephen Bailey

Stephen Bailey · Sep 24th, 2024

All

Aishwarya Ramasethu

Aishwarya Ramasethu · Sep 18th, 2024

Applying Differential Privacy (DP) to LLM Prompts While Maintaining Accuracy // Aishwarya Ramasethu // DE4AI

Utilizing LLMs in high impact scenarios (e.g., healthcare) remains difficult due to the necessity of including private/ sensitive information in prompts. In many scenarios, AI/prompt engineers might want to include few shots examples in prompts to improve LLM performance, but the relevant examples are sensitive and need to be kept private. Any leakage of PII or PHI into LLM outputs could result in compliance problems and liability. Differential Privacy (DP) can help mitigate these issues. The Machine Learning (ML) community has recognized the importance of DP in statistical inference, but its application to generative models, like LLMs, remains limited. This talk will introduce a practical pipeline for incorporating synthetic data into prompts, offering robust privacy guarantees. This approach is also computationally efficient when compared to other approaches like privacy-focused fine-tuning or end-to-end encryption. I will demonstrate the pipeline, and I will also examine the impact of differentially private prompts on the accuracy of LLM responses.

8:51

Miriah Peterson

Miriah Peterson · Sep 18th, 2024

Scaling Data Reliably: A Journey in Growing Through Data Pain Points // Miriah Peterson // DE4AI

Software practitioners work to make their systems reliable. We hear teams boasting of having four or five 9s of uptime. Data Systems depend on data that can be out of date or late. Pipelines and automated jobs fail to run. Data sometimes arrives late changing the outcomes of processing jobs. All these situations are examples of Data Downtime and lead to misleading results and false reporting. As a DRE team (Data Reliability Engineering) we borrowed tools and practices from SRE to build a better data system. In this talk, we will explore real-world reliability situations for our data systems and address three major topics to strengthen any pipeline: Data Downtime: What Data Downtime is, how it affects your bottom line, and how to minimize it. Data Service Level Metrics: We will talk about metadata for your Data pipeline and how to report on pipeline transactions that can lead to preventative data engineering practices. Data monitoring: What to look out for and how to be aware of system failure versus data failures.

16:04

Nikhil Simha

Nikhil Simha · Sep 18th, 2024

Chronon: Airbnb's Open-Source Data Platform for AI & ML applications // Nikhil Simha // DE4AI

This talk will introduce the open source Chronon project, authored and maintained by Airbnb and Stripe. It will cover the technical problems that Chronon solves, and how it can be used to organizations to accelerate their AI/ML efforts.

12:35

Yangqing Jia

Yangqing Jia · Sep 18th, 2024

LLMs and Beyond with Lepton // Yangqing Jia // DE4AI

LLMs have become the de-facto standard in modern AI toolchain, but it still comes with a lot of confusions - quality, speed, cost, etc. In this talk, I will share a few observations we have regarding LLM, both from an algorithm engineer and an infra engineer perspective, on how we should best utilize LLMs in our daily operations. I'll also touch a bit topic on how enterprises think of their IT and AI strategy, given that the fast changing computation pattern is disrupting conventional cloud in unprecedented ways.

13:47

Ryan Wolf

Ryan Wolf · Sep 18th, 2024

GPU Accelerated Data Curation for LLMs // Ryan Wolf // DE4AI

Scaling large language models is a well-discussed topic in the machine learning community. Providing LLMs with equally scaled, well-curated data is less discussed but incredibly important. We will examine how to curate high quality datasets, and how GPUs allow us to effectively scale datasets to trillions of tokens with NeMo Curator.

30:02

Mark Freeman

Mark Freeman · Sep 18th, 2024

Do We Really Need Data Contracts and Observability? (Hint: Yes) // Mark Freeman // DE4AI

When companies explore data quality initiatives, it’s common to wonder whether data contracts or observability is more critical. In this talk, we’ll clarify the unique roles each plays: data contracts focus on preventing known data quality issues, while data observability detects unknown issues across the entire data system. Drawing on real-world insights, we’ll show how these two approaches complement one another—think of observability as a flashlight illuminating the whole data landscape, while contracts act as a laser pointer, targeting specific areas. Attendees will learn why using both is essential for ensuring data reliability and efficiency.

13:40

Rebecca Taylor

Rebecca Taylor · Sep 18th, 2024

An Overview of Common ML Serving Architectures // Rebecca Taylor // DE4AI

There is often a disconnect between what is taught about model serving and what is actually standard practice in industry. Your deployment design is often severely impacted by the unique data and platform setup of your company as well as financial constraints. Here I discuss some of these constraints as well as how to build designs that can fit within them.

17:33

Aishwarya Joshi

Aishwarya Joshi · Sep 18th, 2024

Data Engineering for Streamlining the Data Science Developer Experience // Aishwarya Joshi // DE4AI

At Chime, low latency inference serving is critical for risk models identifying fraudulent transactions in near real time. However, to create these models, a large amount of time is spent on feature engineering-- creating and processing features to serve models at training and inference time is key in the DS user experience, but difficult to optimize with challenges in scaling and data quality. How can we enable data scientists to deploy features for training and ensure that these features are replicated with parity for real time model inference serving while meeting the lower latency requirements for fraud detection as the scale of transactions being processed grows? The answers are in the underlying infrastructure supporting feature storage and ingestion as well as the frameworks we expose to data scientists to streamline their development workflow.

12:09

Shailvi Wakhlu

Shailvi Wakhlu · Sep 18th, 2024

Data Quality: Preventing, Diagnosing & Curing Bad Data // Shailvi Wakhlu // DE4AI

Uncover the secrets to harnessing quality data for amplifying business success. This talk equips you with invaluable strategies and proven frameworks to navigate the data lifecycle confidently. Learn to spot and eradicate low-quality data, fortify decision-making, and build trust with data. With streamlined prevention strategies and hands-on diagnostics, optimize efficiency and elevate your company's data-driven initiatives.

28:07

Ciro Greco

Ciro Greco · Sep 18th, 2024

Reproducible data science over data lakes // Ciro Greco // DE4AI

As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands.

11:59