MLOps Community
timezone
+00:00 GMT
SIGN IN
  • Home
  • Events
  • Content
  • People
  • Messages
  • Channels
  • Help
Sign In
Sign in or Join the community to continue

Scalable Python for Everyone, Everywhere, Conversation with the Creators of Dask

Posted Oct 14
# Presentation
# Coding Workshop
Share
SPEAKER
Matthew Rocklin
Matthew Rocklin
Matthew Rocklin
CEO @ Coiled, Maintainer of Dask

Matthew is an open source software developer in the numeric Python ecosystem. He maintains several PyData libraries, but today focuses mostly on Dask a library for scalable computing. Matthew worked for Anaconda Inc for several years, then built out the Dask team at NVIDIA for RAPIDS, and most recently founded Coiled Computing to improve Python's scalability with Dask for large organizations.

Matthew has given talks at a variety of technical, academic, and industry conferences. A list of talks and keynotes is available at (https://matthewrocklin.com/talks).

Matthew holds a bachelor’s degree from UC Berkeley in physics and mathematics, and a PhD in computer science from the University of Chicago.

+ Read More

Matthew is an open source software developer in the numeric Python ecosystem. He maintains several PyData libraries, but today focuses mostly on Dask a library for scalable computing. Matthew worked for Anaconda Inc for several years, then built out the Dask team at NVIDIA for RAPIDS, and most recently founded Coiled Computing to improve Python's scalability with Dask for large organizations.

Matthew has given talks at a variety of technical, academic, and industry conferences. A list of talks and keynotes is available at (https://matthewrocklin.com/talks).

Matthew holds a bachelor’s degree from UC Berkeley in physics and mathematics, and a PhD in computer science from the University of Chicago.

+ Read More
SUMMARY

Parallel Computing with Dask and Coiled Python makes data science and machine learning accessible to millions of people around the world. However, historically Python hasn't handled parallel computing well, which leads to issues as researchers try to tackle problems on increasingly large datasets.  Dask is an open source Python library that enables the existing Python data science stack (Numpy, Pandas, Scikit-Learn, Jupyter, ...) with parallel and distributed computing. Today Dask has been broadly adopted by most major Python libraries, and is maintained by a robust open source community across the world.   This talk will discuss parallel computing generally, Dask's approach to parallelizing an existing ecosystem of software, and some of the challenges we've seen in deploying distributed systems. Finally, we'll also address the challenges of robustly deploying distributed systems, which ends up being one of the main accessibility challenges for users today. We hope that by the end of the meetup attendees will better understand parallel computing, have built intuition around how Dask works, and have the opportunity to play with their own Dask cluster on the cloud. Check out our posts here to get more context around where we're coming from: https://medium.com/coiled-hq/coiled-dask-for-everyone-everywhere-376f5de0eff4 https://medium.com/coiled-hq/the-unbearable-challenges-of-data-science-at-scale-83d294fa67f8, Dask What is it? Parallelism for analytics What is parallelism? Doing a lot at once by splitting tasks into smaller subtasks which can be processed in parallel (at the same time) Distributed work across multiple machines and then combining the results Helpful for CPU bound - doing a bunch of calculations on the CPU. The rate at which process progresses is limited by the speed of the CPU Concurrency? Similar but a but things don’t have to happen at the same time, they can happen asynchronously. They can overlap. Shared state Helpful to I/O bound - networking, reading from disk, etc. The rate at which a process progresses is limited by the speed of the I/O subsystem. Multi-core vs distributed Multi-core is a single processor with 2 or more cores that can cooperate through threads - multithreading Distributed is across multiple nodes communicating via HTTP or RPC Why is this hard? Python has its challenges due to GIL, other languages don't have this problem The shared state can lead to potential race conditions, deadlocks, etc Coordination work across the machines For analytics? Calculating some statistics on a large dataset can be tricky if it can’t fit in memory

+ Read More

Watch More

49:15
Posted Aug 02 | Views 1.4K
# MLX
# ML Flow
# Pipelines
1:03:54
Posted Dec 07 | Views 1K
# FinTech
# Case Study
# Interview
48:31
Posted Jul 28 | Views 726
# Redis
# AI Native
# Vector Search
See more