MLOps Community
Home
/
Collections
/
Data Engineering for AI/ML

Data Engineering for AI/ML

Popular topics
# LLMs
# LLM in Production
# AI Agents
# Agents in Production
# AI
# LLM
# Machine Learning
# MLOps
# Rungalileo.io
# MLops
# RAG
# Prosus Group
# Generative AI
# Interview
# Machine learning
# Tecton.ai
# Arize.com
# mckinsey.com/quantumblack
# Redis.io
# Zilliz.com
Video

GenAI in production with MLflow // Ben Wilson // DE4AI

We'll be covering the recent advancements in the supported integrations for GenAI application lifecycle management, from supported GenAI application tracking and evaluation to deployment and monitoring. Part of this talk will focus on the future of GenAI support in MLflow and what our vision is for supporting advanced agentic solutions.
Ben Wilson
Ben Wilson · Sep 17th, 2024
19:53
Video

Building Hyper-Personalized LLM Applications with Rich Contextual Data // Mike Del Balso // DE4AI

In the era of AI-driven applications, personalization is paramount. This talk explores the concept of Full RAG (Retrieval-Augmented Generation) and its potential to revolutionize user experiences across industries. We examine four levels of context personalization, from basic recommendations to highly tailored, real-time interactions. The presentation demonstrates how increasing levels of context - from batch data to streaming and real-time inputs - can dramatically improve AI model outputs. We discuss the challenges of implementing sophisticated context personalization, including data engineering complexities and the need for efficient, scalable solutions. Introducing the concept of a Context Platform, we showcase how tools like Tecton can simplify the process of building, deploying, and managing personalized context at scale. Through practical examples in travel recommendations, we illustrate how developers can easily create and integrate batch, streaming, and real-time context using simple Python code, enabling more engaging and valuable AI-powered experiences.
Michael Del Balso
Michael Del Balso · Sep 17th, 2024
28:17
Video

Unified Data + AI Governance with Unity Catalog // Michelle Leon & Victoria Bukta // DE4AI

In today’s multi-vendor data and AI landscapes, organizations often find themselves struggling with fragmented governance. The proliferation of diverse tools and platforms leads to increased overhead, making it challenging to maintain a unified governance strategy across data and AI assets. This session will explore what a typical multi-vendor organization looks like and highlight the common challenges they face. We’ll delve into the complexities of the current governance space, focusing on the inefficiencies and risks that arise from tool sprawl. The talk will then introduce Unity Catalog’s mission to simplify and unify governance across diverse data formats and AI assets. Attendees will gain insights into how Unity Catalog’s multi-format, multi-asset approach enables seamless governance, empowering organizations to effectively manage their data and AI resources under a cohesive framework. Join us to discover how Unity Catalog can transform your organization’s governance strategy, reducing overhead and enhancing control over your data and AI assets.
Michelle Leon
Victoria Bukta
Michelle Leon & Victoria Bukta · Sep 17th, 2024
24:56
Video

Building Data Infrastructure at Scale for AI/ML with Open Data Lakehouses // Vinoth Chandar // DE4AI

Data engineers love to solve interesting new problems. Sometimes an existing off-the-shelf tool will suffice; sometimes we have to get creative and come up with new ways to build with our existing toolkit. And, perhaps most rewarding, some use cases call for us to develop something completely new that takes on a life of its own - see Apache Spark, Apache Kafka, and the entire data lakehouse category for somewhat recent examples. AI and ML engineers find themselves at these crossroads all the time. In this keynote, we will explore how a data lakehouse architecture with Apache Hudi is being used to support real-world predictive ML and vector-based AI use cases across organizations such as NielsenIQ, Notion, and Uber. We’ll explore how a data lakehouse can be used to ingest data with minute-level freshness and provide a single source of truth for all of an organization’s structured and unstructured data. We’ll show how the lakehouse can be used for feature engineering, to generate accurate training datasets and generate production features. We’ll further explain the role of the lakehouse for GenAI use cases, allowing organizations to operate vector generation pipelines at scale and integrate with vector databases for real-time vector serving.
Vinoth Chandar
Vinoth Chandar · Sep 17th, 2024
29:41
Video

11 lessons learned from doing deployments // Sol Rashidi // DE4AI

With over 200+ POCs built and with nearly 40 products in production, Sol walks us through the journey of developing AI products at scale and the 11 Lessons Learned in the journey - spoiler alert, only 30% of the challenges are tech related, 70% are non-tech issues!
Sol Rashidi
Sol Rashidi · Sep 17th, 2024
35:59
Video

How Feature Stores Work: Enabling Data Scientists to Write Petabyte-Scale Data Pipelines for AI/ML

The term "Feature Store" often conjures a simplistic idea of a storage place for features. However, in reality, feature stores are powerful frameworks and orchestrators for defining, managing, and deploying data pipelines at scale. This session is designed to demystify feature stores, outlining the three distinct types and their roles within a broader ML ecosystem. We’ll explore how feature stores empower data scientists to build and manage their own data pipelines, even at petabyte scale, while efficiently processing streaming data, and maintaining versioning and lineage. Join Simba Khadder, founder and CEO of Featureform, as he moves beyond concepts and marketing talk to deliver real-world, applicable examples. This session will demonstrate how feature stores can be leveraged to define, manage, and deploy scalable data pipelines for AI/ML, offering a practical blueprint for integrating feature stores into ML workflows. We’ll also dive into the internals of feature stores to reveal how they achieve scalability, ensuring participants leave with actionable insights. You’ll gain a solid grasp of feature stores, equipped to drive meaningful enhancements in your ML platforms and projects.
# Feature Store
# Petabyte-Scale
# Featureform
Simba Khadder
Simba Khadder · Sep 17th, 2024
30:33
Video

The Daft distributed Python data engine: multimodal data curation at any scale // Jay Chia // DE4AI

It's 2024 but data curation for ML/AI is still incredibly hard. Daft is an open-sourced Python data engine that changes that paradigm by focusing on the 3 fundamental needs for any ML/AI data platform: 1. ETL at terabyte+ scale: with steps that require complex model batch inference or algorithms that can only be expressed in custom Python. 2. Analytics: involving multimodal datatypes such as images and tensors, but with SQL as the language of choice. 3. Dataloading: performant streaming transport and processing of data from cloud storage into your GPUs for model training/inference In this talk, we explore how other tools fall short of delivering on these hard requirements for any ML/AI data platform. We then showcase a full example of using the Daft Dataframe and simple open file formats such as JSON/Parquet to build out a highly performant data platform - all in your own cloud and on your own data!
Jay Chia
Jay Chia · Sep 17th, 2024
26:34
Video

DuckDB is fast for analytics, but what can it do for AI? // Mehdi Ouazza // DE4AI

The need for versatile and efficient search mechanisms has never been more critical. This talk will explore DuckDB's underrated search capabilities and usage within an LLM stack.
Mehdi Ouazza
Mehdi Ouazza · Sep 17th, 2024
12:49
Video

From Notebook to Kubernetes: Scaling GenAI Pipelines with ZenML // Alex Strick van Linschoten // DE4AI

This lightning talk demonstrates how ZenML, an open-source MLOps framework, enables seamless transition from local development to cloud-scale deployment of generative AI pipelines. We'll showcase a workflow that begins in a Jupyter notebook, with data processing steps run locally, then scales up by offloading intensive training to Kubernetes. The presentation will highlight ZenML's Kubernetes integration and caching features, illustrating how they streamline the development-to-production pipeline for generative AI projects.
Alex Strick van Linschoten
Alex Strick van Linschoten · Sep 17th, 2024
12:42
Video

LLMs in Financial Services: Personalized Portfolio Recommendation Engines // Akmal Chaudhri // DE4AI

This session will delve into how LLMs can be leveraged to create highly personalized and efficient portfolio recommendation engines. Using a Kafka stock ticker feed, tick data will be ingested into a database system where we'll query the data using natural language and build a simple chatbot using speech-to-text.
Akmal Chaudhri
Akmal Chaudhri · Sep 17th, 2024
15:56
Video

Turn Data Chaos into AI Strategy with Programmatic AI Data Development // Elena Boiarskaia // DE4AI

Join Elena and learn how programmatic data development is transforming enterprise AI specialization from unconnected, manual data tasks into a streamlined, strategic development process. Learn how a programmatic approach to data development allows enterprises to efficiently manage, curate, and label data at scale, accelerating production AI and aligning models to unique business critieria—especially critical in sectors like banking and healthcare, where accuracy is non-negotiable. Discover how Snorkel's data development platform empowers AI teams to build and release faster with high quality custom training data sets.
Elena Boiarskaia
Elena Boiarskaia · Sep 17th, 2024
27:14
Video

Going Beyond Two Tier Data Architectures with DuckDB // Hannes Mühleisen // DE4AI

DuckDB is an in-process analytical data management system. DuckDB is lightweight yet fast and available under the permissive MIT license. DuckDB can be deployed everywhere, from a smart watch to a big iron server. This flexibility has lead to a plethora of new and exciting data architectures, for example on-device processing , SQL lambdas, efficient large-scale pipelines, in-browser SQL, and more. In this talk, Hannes will give an overview of architectures observed in the wild and some ideas on what would be possible.
Hannes Mühleisen
Hannes Mühleisen · Sep 18th, 2024
31:55
Video

Data Infrastructure Cost: Tips to Keep Our CFOs Happy // Jose Navarro // DE4AI

“Hey Platform team, any idea why the cloud bill is up by X% this term compared to our previous one?” As platform engineers working at organizations developing or using AI products, that X % amount can be quite high very quickly. In this talk, I will show some strategies that you can use to reduce the cost of your Data Infrastructure, share the responsibility across product teams and control it overtime.
Jose Navarro
Jose Navarro · Sep 18th, 2024
12:33
Video

Supercharging Your RAG System: Techniques and Challenges // Tengyu Ma // DE4AI

Retrieval-augmented generation is the predominant way to ingest proprietary unstructured data into generative AI systems. First, I will briefly state my view on the comparison between RAG and other competing paradigms such as finetuning and long-context LLMs. Then, I will briefly introduce embedding models and rerankers, two key components responsible for the retrieval quality. I will then discuss a list of techniques for improving the retrieval quality, such as query generation/decomposition and proper evaluation methods. Finally, I will discuss some current challenges in RAG and possible future directions.
Tengyu  Ma
Tengyu Ma · Sep 18th, 2024
40:20
Video

Putting the AI back in Medallion Lake Design // Simon Whiteley // DE4AI

In recent years, companies have seen an explosion in adopting lakehouses and reaping the rewards, but time and time again, we hear from people that they regret the layering of their lake. The zones don't quite fit what they were trying to achieve, and no one in the company understands what "silver" vs. "gold" actually means. Worst of all, it has become the domain of engineers & analysts alone - The original boom of data lakes was down to the AI revolution, so how do various AI personas fit into the mix? In this session, we'll recap a mature, production-grade lake design, then overlay our various AI activities on the top. Whether you're an AI Engineer, citizen scientist or grizzled data science guru, you'll leave this session with a better understanding of how lakehouse design works for you.
Simon Whiteley
Simon Whiteley · Sep 18th, 2024
13:28
Video

AI-Powered Data Unification for Data Platforms // Shelby Heinecke // DE4AI

A robust data platform is the first step to ensuring that downstream AI is grounded on accurate and relevant data. And while data platforms power AI, did you know that AI can also power data platforms? In this talk, we will discuss one of the most critical operations in data platforms, data unification, and discuss how we use small, efficient LLMs to power this step in Salesforce’s data platform.
Shelby Heinecke
Shelby Heinecke · Sep 18th, 2024
13:16
Video

Partnering with Product for Effective, quality Data Ingestion & Training Data // Daniela Santisteban

The product organization at a company can vary vastly, but getting the right PMs on your side can give you certainty on what data can be collected, influencing the architecture to preemptively set up ML models for success, and prove out models' ROI to the business.
Daniela Santisteban
Daniela Santisteban · Sep 18th, 2024
18:43
Video

Data Scientists & Data Engineers: How the Best Teams Work // Panel // DE4AI

There are clear patterns that make the highest functioning data teams work so well together. In this panel we will explore what data scientists and data engineers need to know about each other's responsibilities to speak the same language and align incentives.
Beverly Wright
Joe Reis
Sadie  St. Lawrence
+2
Beverly Wright, Joe Reis, Sadie St. Lawrence & 2 content:more content:speakers · Sep 18th, 2024
27:55
Video

The Only Constant is (Data) Change // Panel // DE4AI

If there is one thing that is true, it is data is constantly changing. How can we keep up with these changes? How can we make sure that every stakeholder has visibility? How can we create the culture of understanding around data change management?
Benjamin Rogojan
Christophe Blefari
Chad Sanderson
+2
Benjamin Rogojan, Christophe Blefari, Chad Sanderson & 2 content:more content:speakers · Sep 18th, 2024
40:50
Video

Engineering Your AI Platform // Panel // DE4AI

To build a solid AI platform, it’s important to zero in on what really matters. This panel will dive into the key lessons from the evolution of data engineering and MLOps, including how the industry shifted from niche tools like feature stores to broader platforms. They'll discuss whether separate data and ML platforms are necessary or more effective when integrated, particularly for companies with smaller data teams. By taking a step back and looking at what’s actually worked in the world of MLOps and the recent buzz around LLMs, this panel will also dive into the merging roles of data engineering, analytics, MLOps, and whether the distinct ML engineer role is still relevant. Finally, they’ll share insights on designing an AI platform that’s practical, future-proof, and free from unnecessary complexity.
Tobias Macey
Daniel Svonava
Colleen  Tartow
+1
Tobias Macey, Daniel Svonava, Colleen Tartow & 1 content:more content:speaker · Sep 18th, 2024
30:09
Video

The Evolution of Lyft's Feature Store // Devon Mittow // DE4AI

A brief overview of how Lyft's ML feature store has changed and evolved alongside the business.
Devon  Mittow
Devon Mittow · Sep 18th, 2024
13:13
Video

Real-Time Event Processing for AI/ML with Numaflow // Sri Harsha Yayi // DE4AI

At Intuit, our machine learning teams encountered significant hurdles in event processing and running inference on streaming data. The process of integrating with various messaging systems such as Kafka etc was both time-consuming and complex. Additionally, our teams needed capabilities for intermediate processing before executing inference as part of their workflows. The need to scale event processing and inference in response to fluctuating event volumes added another layer of complexity. To address these challenges, we developed Numaflow, an open-source, Kubernetes-native platform designed for scalable event processing. Numaflow streamlines integrating with event sources, and enables teams to perform event processing and inference on streaming data without a steep learning curve. This talk is geared towards ML engineers, data engineers, application developers, and anyone interested in event processing or inference on streaming data. We will demonstrate how Numaflow overcomes these challenges and simplifies the process
Sri Harsha  Yayi
Sri Harsha Yayi · Sep 18th, 2024
22:38
Video

Implementing Data Capture for ML Observability and Drift Detection // Pushkar Garg // DE4AI

Modern ML Systems comprise of complex data pipelines and multiple transformations happening in multiple layers of the system like the Data Warehouse, Offline Feature Store, Online Feature Store etc. One important aspect of productionizing any ML Model is to implement ML Observability. The key component for enabling ML Observability is to have efficient data capture running on the prediction endpoints. In this talk, I will talk about my experience of implementing Data Capture by coding up an in-memory buffer and lessons learnt while doing so. I will also touch base on how downstream monitoring jobs consume these data capture logs to complete the loop on ML Observability.
Pushkar Garg
Pushkar Garg · Sep 18th, 2024
23:49
Video

Real-Time Data Streaming Architectures for Generative AI // Emily Ekdahl // DE4AI

Bridging the Gap Between Batch Processing and the Lakehouse for Next-Gen Customer Experience As Generative AI (GenAI) and large language models (LLMs) evolve at an unprecedented pace, traditional machine learning architectures that rely on batch processing and static can no longer keep up with the amount of data they need to process. To beat competitors, numerous organizations are implementing real-time data streaming solutions, leveraging technologies like Apache Kafka and Apache Flink. These tools work together to ingest and process data in real-time, which, when combined with a vector database, can significantly boost the performance and reliability of GenAI applications. In this talk, we’ll dive into the benefits of the "shift-left" paradigm, which is all about moving from the old-school batch and lakehouse models to real-time data products. This shift allows companies to create GenAI applications that are more responsive and context-aware. By integrating streaming data with real-time model inference and using the Retrieval Augmented Generation (RAG) method, companies can cut down on latency and ensure their LLMs deliver up-to-date responses. We’ll cover key architectural patterns, potential challenges, and best practices for making this transition, all while sharing real-world examples of how integrating Kafka and Flink with vector databases can lead to next-level NLP applications.
Emily Ekdahl
Emily Ekdahl · Sep 18th, 2024
12:19
Video

Reproducible data science over data lakes // Ciro Greco // DE4AI

As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands.
Ciro Greco
Ciro Greco · Sep 18th, 2024
11:59
Video

Data Quality: Preventing, Diagnosing & Curing Bad Data // Shailvi Wakhlu // DE4AI

Uncover the secrets to harnessing quality data for amplifying business success. This talk equips you with invaluable strategies and proven frameworks to navigate the data lifecycle confidently. Learn to spot and eradicate low-quality data, fortify decision-making, and build trust with data. With streamlined prevention strategies and hands-on diagnostics, optimize efficiency and elevate your company's data-driven initiatives.
Shailvi Wakhlu
Shailvi Wakhlu · Sep 18th, 2024
28:07
Video

An Overview of Common ML Serving Architectures // Rebecca Taylor // DE4AI

There is often a disconnect between what is taught about model serving and what is actually standard practice in industry. Your deployment design is often severely impacted by the unique data and platform setup of your company as well as financial constraints. Here I discuss some of these constraints as well as how to build designs that can fit within them.
Rebecca  Taylor
Rebecca Taylor · Sep 18th, 2024
17:33
Video

Data Engineering for Streamlining the Data Science Developer Experience // Aishwarya Joshi // DE4AI

At Chime, low latency inference serving is critical for risk models identifying fraudulent transactions in near real time. However, to create these models, a large amount of time is spent on feature engineering-- creating and processing features to serve models at training and inference time is key in the DS user experience, but difficult to optimize with challenges in scaling and data quality. How can we enable data scientists to deploy features for training and ensure that these features are replicated with parity for real time model inference serving while meeting the lower latency requirements for fraud detection as the scale of transactions being processed grows? The answers are in the underlying infrastructure supporting feature storage and ingestion as well as the frameworks we expose to data scientists to streamline their development workflow.
Aishwarya  Joshi
Aishwarya Joshi · Sep 18th, 2024
12:09
Video

Do We Really Need Data Contracts and Observability? (Hint: Yes) // Mark Freeman // DE4AI

When companies explore data quality initiatives, it’s common to wonder whether data contracts or observability is more critical. In this talk, we’ll clarify the unique roles each plays: data contracts focus on preventing known data quality issues, while data observability detects unknown issues across the entire data system. Drawing on real-world insights, we’ll show how these two approaches complement one another—think of observability as a flashlight illuminating the whole data landscape, while contracts act as a laser pointer, targeting specific areas. Attendees will learn why using both is essential for ensuring data reliability and efficiency.
Mark Freeman
Mark Freeman · Sep 18th, 2024
13:40
Video

GPU Accelerated Data Curation for LLMs // Ryan Wolf // DE4AI

Scaling large language models is a well-discussed topic in the machine learning community. Providing LLMs with equally scaled, well-curated data is less discussed but incredibly important. We will examine how to curate high quality datasets, and how GPUs allow us to effectively scale datasets to trillions of tokens with NeMo Curator.
Ryan Wolf
Ryan Wolf · Sep 18th, 2024
30:02
Code of Conduct
Your Privacy Choices