MLOps Community
+00:00 GMT
LIVESTREAM
Data Engineering for AI/ML

Your AI Means Nothing Without the Data.

Everyone in data engineering is fighting a hard battle you know nothing about.

This is a conference at the intersection of Data and AI. It will be fun and educational. Don't believe me? Check out what past guests have said.

Oh yeah, and the speakers are top of their game.

some-file-91c642d4-bf5e-4319-936d-75b8107410a9some-file-5bc63651-3ce1-43cc-a0de-528b883ce8da


Speakers
Sadie  St. Lawrence
Sadie St. Lawrence
Founder / AI Instructor @ Human Machine Collaboration Institute / LinkedIn Learning
Hannes Mühleisen
Hannes Mühleisen
Co-Founder & CEO @ DuckDB Labs
Sol Rashidi
Sol Rashidi
CEO and Founder @ ExecutiveAI
Tengyu  Ma
Tengyu Ma
Co-Founder and CEO @ Voyage AI
Shelby Heinecke
Shelby Heinecke
Senior AI Research Manager @ Salesforce
Joe Reis
Joe Reis
CEO/Co-Founder @ Ternary Data
Ryan  Wolf
Ryan Wolf
Deep Learning Algorithm Engineer @ NVIDIA
Yangqing  Jia
Yangqing Jia
Founder @ Lepton AI
Chad Sanderson
Chad Sanderson
CEO & Co-Founder @ Gable
Vinoth Chandar
Vinoth Chandar
Founder/CEO @ Onehouse
Miriah  Peterson
Miriah Peterson
Data Engineer @ Soypete tech
Benjamin Rogojan
Benjamin Rogojan
Data Science And Engineering Consultant @ Seattle Data Guy
Michael Del Balso
Michael Del Balso
CEO & Co-founder @ Tecton
Pushkar Garg
Pushkar Garg
Staff Machine Learning Engineer @ Clari Inc.
Simon Whiteley
Simon Whiteley
CTO & Co-Owner @ Advancing Analytics
Shailvi Wakhlu
Shailvi Wakhlu
Founder @ Shailvi Ventures LLC
Aishwarya  Joshi
Aishwarya Joshi
Machine Learning Engineer @ Chime
Aishwarya Ramasethu
Aishwarya Ramasethu
AI Engineer @ Prediction Guard
Ciro Greco
Ciro Greco
Founder and CEO @ Bauplan
Sridhar Natarajan
Sridhar Natarajan
Senior Software Engineer @ Intuit
Daniela Santisteban
Daniela Santisteban
Product Manager - Data Taxonomies @ Numerator
Nikhil  Simha
Nikhil Simha
CTO @ Zipline AI
Sri Harsha  Yayi
Sri Harsha Yayi
Product Manager @ Intuit
Tobias Macey
Tobias Macey
Associate Director of Platform and DevOps Engineering @ Massachusetts Institute of Technology (MIT)
Alex Strick van Linschoten
Alex Strick van Linschoten
ML Engineer @ ZenML
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community
Beverly Wright
Beverly Wright
VP - Data Science & AI / CAIO @ Wavicle Data Solutions
Rebecca  Taylor
Rebecca Taylor
Tech lead: Personalization @ Lidl e-commerce
Emily Ekdahl
Emily Ekdahl
AI Ops Engineer @ Gusto
Jacopo  Himberg
Jacopo Himberg
Director, Data @ Wolt
Devon  Mittow
Devon Mittow
Staff Software Engineer @ Lyft
Jose Navarro
Jose Navarro
MLOps Engineer @ Cleo
Stephen Bailey
Stephen Bailey
Data Engineer @ Whatnot
Mark Freeman
Mark Freeman
Tech Lead, GTM Engineering @ Gable
Victor Cuadros
Victor Cuadros
Senior Software/Data Engineer @ Microsoft
Akmal Chaudhri
Akmal Chaudhri
Technical Evangelist @ SingleStore
Jesse  Anderson
Jesse Anderson
Managing Director @ Big Data Institute
Mehdi Ouazza
Mehdi Ouazza
Data Eng & Devrel @ MotherDuck
Michelle Leon
Michelle Leon
Staff Product Manager @ Databricks
Victoria Bukta
Victoria Bukta
Member of Technical Staff @ Databricks
Christophe Blefari
Christophe Blefari
CTO & Co-founder @ NAO
Daniel Svonava
Daniel Svonava
CEO & Co-founder @ Superlinked
Maggie Hays
Maggie Hays
Founding Community Product Manager, DataHub @ Acryl Data
Korri Jones
Korri Jones
Senior Lead Machine Learning Engineer @ Chick-fil-A, Inc.
Nehil Jain
Nehil Jain
MLE Consultant @ TBA
Sonam Gupta
Sonam Gupta
Sr. Developer Relations @ aiXplain
Valdimar Eggertsson
Valdimar Eggertsson
AI Developer @ Snjallgögn (Smart Data inc.)
Simba Khadder
Simba Khadder
Founder & CEO @ Featureform
Jay Chia
Jay Chia
Cofounder @ Eventual
Colleen  Tartow
Colleen Tartow
Field CTO @ VAST Data
Elena Boiarskaia
Elena Boiarskaia
Head of Applied Machine Learning @ Snorkel AI
Ben Wilson
Ben Wilson
Software Engineer, ML @ Databricks
Agenda
Track 1
Track 2
Track 3
Track 4
1:00 PM, GMT
-
1:20 PM, GMT
Opening / Closing
Welcome - Data Engineering for AI/ML
Demetrios Brinkmann
1:20 PM, GMT
-
1:45 PM, GMT
Keynote
11 lessons learned from doing deployments

With over 200+ POCs built and with nearly 40 products in production, Sol walks us through the journey of developing AI products at scale and the 11 Lessons Learned in the journey - spoiler alert, only 30% of the challenges are tech related, 70% are non-tech issues!

+ Read More
Sol Rashidi
1:50 PM, GMT
-
2:15 PM, GMT
Presentation
Data Teams survey 2024

Diving into the results of the 2024 Data Teams survey.

+ Read More
Jesse  Anderson
2:20 PM, GMT
-
2:45 PM, GMT
Keynote
Building Data Infrastructure at Scale for AI/ML with Open Data Lakehouses

Data engineers love to solve interesting new problems. Sometimes an existing off-the-shelf tool will suffice; sometimes we have to get creative and come up with new ways to build with our existing toolkit. And, perhaps most rewarding, some use cases call for us to develop something completely new that takes on a life of its own - see Apache Spark, Apache Kafka, and the entire data lakehouse category for somewhat recent examples.

AI and ML engineers find themselves at these crossroads all the time. In this keynote, we will explore how a data lakehouse architecture with Apache Hudi is being used to support real-world predictive ML and vector-based AI use cases across organizations such as NielsenIQ, Notion, and Uber.

We’ll explore how a data lakehouse can be used to ingest data with minute-level freshness and provide a single source of truth for all of an organization’s structured and unstructured data. We’ll show how the lakehouse can be used for feature engineering, to generate accurate training datasets and generate production features. We’ll further explain the role of the lakehouse for GenAI use cases, allowing organizations to operate vector generation pipelines at scale and integrate with vector databases for real-time vector serving.

+ Read More
Vinoth Chandar
2:50 PM, GMT
-
3:15 PM, GMT
Presentation
Going Beyond Two Tier Data Architectures with DuckDB

DuckDB is an in-process analytical data management system. DuckDB is lightweight yet fast and available under the permissive MIT license. DuckDB can be deployed everywhere, from a smart watch to a big iron server. This flexibility has lead to a plethora of new and exciting data architectures, for example on-device processing , SQL lambdas, efficient large-scale pipelines, in-browser SQL, and more.

In this talk, Hannes will give an overview of architectures observed in the wild and some ideas on what would be possible.

+ Read More
Hannes Mühleisen
3:15 PM, GMT
-
3:35 PM, GMT
Break
The Booth Crawl
3:35 PM, GMT
-
3:45 PM, GMT
Lightning Talk
Data Infrastructure Cost: Tips to Keep Our CFOs Happy

“Hey Platform team, any idea why the cloud bill is up by X% this term compared to our previous one?”

As platform engineers working at organisations developing or using AI products, that X % amount can be quite high very quickly.

In this talk, I will show some strategies that you can use to reduce the cost of your Data Infrastructure, share the responsibility across product teams and control it overtime.

+ Read More
Jose Navarro
3:50 PM, GMT
-
4:15 PM, GMT
Presentation
Building Hyper-Personalized LLM Applications with Rich Contextual Data

In the era of AI-driven applications, personalization is paramount. This talk explores the concept of Full RAG (Retrieval-Augmented Generation) and its potential to revolutionize user experiences across industries. We examine four levels of context personalization, from basic recommendations to highly tailored, real-time interactions. The presentation demonstrates how increasing levels of context - from batch data to streaming and real-time inputs - can dramatically improve AI model outputs. We discuss the challenges of implementing sophisticated context personalization, including data engineering complexities and the need for efficient, scalable solutions. Introducing the concept of a Context Platform, we showcase how tools like Tecton can simplify the process of building, deploying, and managing personalized context at scale. Through practical examples in travel recommendations, we illustrate how developers can easily create and integrate batch, streaming, and real-time context using simple Python code, enabling more engaging and valuable AI-powered experiences.

+ Read More
Michael Del Balso
4:20 PM, GMT
-
4:45 PM, GMT
Presentation
The Daft distributed Python data engine: multimodal data curation at any scale

It's 2024 but data curation for ML/AI is still incredibly hard. Daft is an open-sourced Python data engine that changes that paradigm by focusing on the 3 fundamental needs for any ML/AI data platform:

  1. ETL at terabyte+ scale: with steps that require complex model batch inference or algorithms that can only be expressed in custom Python.

  2. Analytics: involving multimodal datatypes such as images and tensors, but with SQL as the language of choice.

  3. Dataloading: performant streaming transport and processing of data from cloud storage into your GPUs for model training/inference

In this talk, we explore how other tools fall short of delivering on these hard requirements for any ML/AI data platform. We then showcase a full example of using the Daft Dataframe and simple open file formats such as JSON/Parquet to build out a highly performant data platform - all in your own cloud and on your own data!

+ Read More
Jay Chia
4:50 PM, GMT
-
5:15 PM, GMT
Presentation
How Feature Stores Work: Enabling Data Scientists to Write Petabyte-Scale Data Pipelines for AI/ML

The term "Feature Store" often conjures a simplistic idea of a storage place for features. However, in reality, feature stores are powerful frameworks and orchestrators for defining, managing, and deploying data pipelines at scale. This session is designed to demystify feature stores, outlining the three distinct types and their roles within a broader ML ecosystem. We’ll explore how feature stores empower data scientists to build and manage their own data pipelines, even at petabyte scale, while efficiently processing streaming data, and maintaining versioning and lineage.

Join Simba Khadder, founder and CEO of Featureform, as he moves beyond concepts and marketing talk to deliver real-world, applicable examples. This session will demonstrate how feature stores can be leveraged to define, manage, and deploy scalable data pipelines for AI/ML, offering a practical blueprint for integrating feature stores into ML workflows.

We’ll also dive into the internals of feature stores to reveal how they achieve scalability, ensuring participants leave with actionable insights. You’ll gain a solid grasp of feature stores, equipped to drive meaningful enhancements in your ML platforms and projects.

+ Read More
Simba Khadder
5:20 PM, GMT
-
6:00 PM, GMT
Roundtable
Guest Roundtable Discussion
6:05 PM, GMT
-
6:15 PM, GMT
Lightning Talk
DuckDB is fast for analytics, but what can it do for AI?

The need for versatile and efficient search mechanisms has never been more critical. This talk will explore DuckDB's underrated search capabilities and usage within an LLM stack.

+ Read More
Mehdi Ouazza
6:20 PM, GMT
-
6:30 PM, GMT
Lightning Talk
From Notebook to Kubernetes: Scaling GenAI Pipelines with ZenML

This lightning talk demonstrates how ZenML, an open-source MLOps framework, enables seamless transition from local development to cloud-scale deployment of generative AI pipelines. We'll showcase a workflow that begins in a Jupyter notebook, with data processing steps run locally, then scales up by offloading intensive training to Kubernetes. The presentation will highlight ZenML's Kubernetes integration and caching features, illustrating how they streamline the development-to-production pipeline for generative AI projects.

+ Read More
Alex Strick van Linschoten
6:35 PM, GMT
-
6:45 PM, GMT
Lightning Talk
LLMs in Financial Services: Personalized Portfolio Recommendation Engines

This session will delve into how LLMs can be leveraged to create highly personalized and efficient portfolio recommendation engines. Using a Kafka stock ticker feed, tick data will be ingested into a database system where we'll query the data using natural language and build a simple chatbot using speech-to-text.

+ Read More
Akmal Chaudhri
6:50 PM, GMT
-
7:15 PM, GMT
Presentation
Unified Data + AI Governance with Unity Catalog

In today’s multi-vendor data and AI landscapes, organizations often find themselves struggling with fragmented governance. The proliferation of diverse tools and platforms leads to increased overhead, making it challenging to maintain a unified governance strategy across data and AI assets. This session will explore what a typical multi-vendor organization looks like and highlight the common challenges they face.

We’ll delve into the complexities of the current governance space, focusing on the inefficiencies and risks that arise from tool sprawl. The talk will then introduce Unity Catalog’s mission to simplify and unify governance across diverse data formats and AI assets. Attendees will gain insights into how Unity Catalog’s multi-format, multi-asset approach enables seamless governance, empowering organizations to effectively manage their data and AI resources under a cohesive framework.

Join us to discover how Unity Catalog can transform your organization’s governance strategy, reducing overhead and enhancing control over your data and AI assets.

+ Read More
Michelle Leon
Victoria Bukta
7:15 PM, GMT
-
7:35 PM, GMT
Break
Musical Entertainment By Yours Truly
Demetrios Brinkmann
7:35 PM, GMT
-
8:05 PM, GMT
Panel Discussion
Data Scientists & Data Engineers: How the Best Teams Work

There are clear patterns that make the highest functioning data teams work so well together. In this panel we will explore what data scientists and data engineers need to know about each other's responsibilities to speak the same language and align incentives.

+ Read More
Beverly Wright
Sadie  St. Lawrence
Joe Reis
Victor Cuadros
8:05 PM, GMT
-
8:30 PM, GMT
Keynote
Supercharging Your RAG System: Techniques and Challenges

Retrieval-augmented generation is the predominant way to ingest proprietary unstructured data into generative AI systems. First, I will briefly state my view on the comparison between RAG and other competing paradigms such as finetuning and long-context LLMs. Then, I will briefly introduce embedding models and rerankers, two key components responsible for the retrieval quality. I will then discuss a list of techniques for improving the retrieval quality, such as query generation/decomposition and proper evaluation methods. Finally, I will discuss some current challenges in RAG and possible future directions.

+ Read More
Tengyu  Ma
Sponsors
Diamond
Gold
Silver
Community
Event has finished
September 12, 1:00 PM, GMT
Online
Organized by
MLOps Community
MLOps Community
Event has finished
September 12, 1:00 PM, GMT
Online
Organized by
MLOps Community
MLOps Community