Data Engineering for AI/ML
Data can't be split into microservices! But teams should own their data! But there should be one definition for metrics! But teams can bring their own architectures! Data platform teams have a tough job: they need to find the right balance between creating reliable data services and decentralizing ownership -- and rarely do off-the-shelf architectures end up working as expected. In this talk, I'll discuss Whatnot's journey to providing a suite of data services -- including machine learning, business intelligence, and real-time analytics tools -- that power product features and business operations. Attendees will walk away with a practical framework for thinking about the maturation of these services, as well as patterns we've seen that make a big difference in increasing our adoption while reducing our maintenance load as we've grown.
Stephen Bailey · Sep 24th, 2024
Popular topics
# LLMs
# MLops
# LLM in Production
# RAG
# Rungalileo.io
# mckinsey.com/quantumblack
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Turningpost.com
# Bigbraindaily.com
# Redis.com
# Alphasignal.ai
# Union.ai
# Petuum.com
# Wallaroo.ai
# Monitoring
# DataOps
# ML Workflow
# Observability
Aishwarya Ramasethu · Sep 18th, 2024
Utilizing LLMs in high impact scenarios (e.g., healthcare) remains difficult due to the necessity of including private/ sensitive information in prompts. In many scenarios, AI/prompt engineers might want to include few shots examples in prompts to improve LLM performance, but the relevant examples are sensitive and need to be kept private. Any leakage of PII or PHI into LLM outputs could result in compliance problems and liability. Differential Privacy (DP) can help mitigate these issues. The Machine Learning (ML) community has recognized the importance of DP in statistical inference, but its application to generative models, like LLMs, remains limited. This talk will introduce a practical pipeline for incorporating synthetic data into prompts, offering robust privacy guarantees. This approach is also computationally efficient when compared to other approaches like privacy-focused fine-tuning or end-to-end encryption. I will demonstrate the pipeline, and I will also examine the impact of differentially private prompts on the accuracy of LLM responses.
Miriah Peterson · Sep 18th, 2024
Software practitioners work to make their systems reliable. We hear teams boasting of having four or five 9s of uptime. Data Systems depend on data that can be out of date or late. Pipelines and automated jobs fail to run. Data sometimes arrives late changing the outcomes of processing jobs. All these situations are examples of Data Downtime and lead to misleading results and false reporting. As a DRE team (Data Reliability Engineering) we borrowed tools and practices from SRE to build a better data system. In this talk, we will explore real-world reliability situations for our data systems and address three major topics to strengthen any pipeline: Data Downtime: What Data Downtime is, how it affects your bottom line, and how to minimize it. Data Service Level Metrics: We will talk about metadata for your Data pipeline and how to report on pipeline transactions that can lead to preventative data engineering practices. Data monitoring: What to look out for and how to be aware of system failure versus data failures.
Nikhil Simha · Sep 18th, 2024
This talk will introduce the open source Chronon project, authored and maintained by Airbnb and Stripe. It will cover the technical problems that Chronon solves, and how it can be used to organizations to accelerate their AI/ML efforts.
Yangqing Jia · Sep 18th, 2024
LLMs have become the de-facto standard in modern AI toolchain, but it still comes with a lot of confusions - quality, speed, cost, etc. In this talk, I will share a few observations we have regarding LLM, both from an algorithm engineer and an infra engineer perspective, on how we should best utilize LLMs in our daily operations. I'll also touch a bit topic on how enterprises think of their IT and AI strategy, given that the fast changing computation pattern is disrupting conventional cloud in unprecedented ways.
Ryan Wolf · Sep 18th, 2024
Scaling large language models is a well-discussed topic in the machine learning community. Providing LLMs with equally scaled, well-curated data is less discussed but incredibly important. We will examine how to curate high quality datasets, and how GPUs allow us to effectively scale datasets to trillions of tokens with NeMo Curator.
Mark Freeman · Sep 18th, 2024
When companies explore data quality initiatives, it’s common to wonder whether data contracts or observability is more critical. In this talk, we’ll clarify the unique roles each plays: data contracts focus on preventing known data quality issues, while data observability detects unknown issues across the entire data system. Drawing on real-world insights, we’ll show how these two approaches complement one another—think of observability as a flashlight illuminating the whole data landscape, while contracts act as a laser pointer, targeting specific areas. Attendees will learn why using both is essential for ensuring data reliability and efficiency.
Rebecca Taylor · Sep 18th, 2024
There is often a disconnect between what is taught about model serving and what is actually standard practice in industry. Your deployment design is often severely impacted by the unique data and platform setup of your company as well as financial constraints. Here I discuss some of these constraints as well as how to build designs that can fit within them.
Aishwarya Joshi · Sep 18th, 2024
At Chime, low latency inference serving is critical for risk models identifying fraudulent transactions in near real time. However, to create these models, a large amount of time is spent on feature engineering-- creating and processing features to serve models at training and inference time is key in the DS user experience, but difficult to optimize with challenges in scaling and data quality. How can we enable data scientists to deploy features for training and ensure that these features are replicated with parity for real time model inference serving while meeting the lower latency requirements for fraud detection as the scale of transactions being processed grows? The answers are in the underlying infrastructure supporting feature storage and ingestion as well as the frameworks we expose to data scientists to streamline their development workflow.
Shailvi Wakhlu · Sep 18th, 2024
Uncover the secrets to harnessing quality data for amplifying business success. This talk equips you with invaluable strategies and proven frameworks to navigate the data lifecycle confidently. Learn to spot and eradicate low-quality data, fortify decision-making, and build trust with data. With streamlined prevention strategies and hands-on diagnostics, optimize efficiency and elevate your company's data-driven initiatives.
Ciro Greco · Sep 18th, 2024
As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands.
Popular