What Does James Lamb – A Machine Learning Engineer Do?
💡 This is a project for the MLOps Community to fully understand what different people do at their jobs
September 15, 2022💡 This is a project for the MLOps Community to fully understand what different people do at their jobs. We want to find out what your day-to-day looks like.
From the most granular to the most mundane, please tell us everything! This is our chance to bring clarity around the different parts of MLOps ranging from big companies to small start-ups.
Today we shine the spotlight on Mr. James Lamb Machine Learning Extrodinar and LightGBM maintainer. You can find the first one of the series with Frata here.
Machine Learning Engineers At Work
Name: James Lamb (https://github.com/jameslamb)
Official Title: Staff Machine Learning Engineer
Company: SpotHero
Years in the game : 17 years since my first paycheck, 10 years working on computers with data, 8 years full-time
Years specifically working on ML: 6
Direct reports: 0
What was your path into ML?
Luck, privilege, and geography contributed to me learning about timeseries data, which set me up to get a job on an amazing data science team at an IIoT startup, which gave me the core skills and self-confidence to pursue a career in data science and software engineering.
The Whole Story
Ok it was a long journey….stay with me here.
When I went to college, I majored in marketing in a business school. I wanted to work in marketing for a record label. At the school I was at, you could earn a double major by just taking one additional class. I was very fortunate to go to a high school that offered Advanced Placement (AP) courses, so I had 3 Economics credits already. Out of pure laziness, I decided to use those 3 credits to pick up a second major in economics.
Going into senior year of college, I thought “I like school, I’m gonna do grad school. I’m a busy businessperson (or at least want to be), so guess I’ll do an MBA”. One of my advisors told me “don’t do an MBA, it’s not really for you…you don’t have enough professional experience”. She then told me that I was luckily at one of a small handful of schools that offered a terminal masters degree in Economics. So I did that!
In that Masters in Applied Economics, I learned enough math and statistics to be dangerous, and got experience applying the theory to real-world datasets. My thesis was a time-series forecasting project, applying some flavors of the ARCH/GARCH models popular in finance to grocery store sales. This was a super repetitive process….fit, predict, store predictions, add another week of data, re-fit, predict, store predictions…etc.. I suspected that I could go a lot faster and use a lot more data if I could just write code, but I was afraid to devote the time to learning. I thought “at least with this button-clicking and Excel formulas, I know I’ll finish in time”.
As soon as I finished that degree, I started taking online courses to learn how to code. I took every language tutorial on Codecademy (link) and a handful of R courses on edX. I think the thing that permanently put me on a different career path was the Data Science Specialization from Johns Hopkins University, on Coursera (link).
I did those online courses on nights and weekends for about a year. I started them with a narrow interest in learning how to code in R, because I wanted to do more ambitious economics work like the “Atlas of Economic Complexity” (link), “Santa Fe Artificial Stock Exchange” (link), and Raj Chetty’s various projects on economic mobility (link). But besides R, those courses introduced me to core skills that I’d need to do professional applied statistics (”data science”). In those courses, I created my first git
repo, wrote my first function, built my first interactive dashboard, and trained my first tree-based model.
By 2016, I had a few years of professional experience doing data work in economics jobs, and knew enough machine learning and R to have a shot at entry-level data science jobs. Luckily for me, a startup in my hometown of Chicago got a big round of funding and started hiring data scientists with experience in time-series data (see this profile in Inc.).
I started at Uptake in July 2016 and over my 3+ years there, I worked with literally dozens of world-class engineers and data scientists. I learned SO MUCH in my time there. That job gave me my first experiences putting machine learning models into production, writing code and services intended to be used by other people, and communicating data science concepts to non-data-scientist audiences.
During that time, I also published my first open source project uptasticsearch
, (co-authored with Austin Dickey and Nick Paras), was invited to join as a maintainer on a high-profile machine learning project (LightGBM), and earned a Masters Degree in Data Science. I left that job with the raw skills to do machine learning engineering work, and a focus on getting better at it.
Since then, I’ve worked as a data science consultant, backend engineer, and machine learning engineer. I’ve worked in the world’s largest companies and at a startup with less than 15 employees. Throughout that time, I’ve loved contributing to open source projects, speaking about data science at conferences/meetups, and helping other people start their machine learning careers.
What interests you about your current position?
There’s a high level of trust and autonomy in the Engineering organization I’m part of…I can see every line of code in the company and provision any infrastructure I want. It’s awesome to be able to go from “hey I think we should do {X}
” to working on {X}
in weeks, not years. And to be clear, I don’t mean “we’re a small overworked group of engineers YOLO-ing things up into k8s and praying it works”. There’s a strong culture of code review, testing, and monitoring guiding everything we do.
As weird as it sounds, I also really love the simplicity of SpotHero’s business model. We’re a marketplace where people buy and sell parking. That’s it. Whether I’m working on data pipelines to power marketing automation, upgrading to a new major version of a backend Python library, or trying to cut 30 seconds out of a CI pipeline, I can always tie what I’m doing back to how it helps our business. That hasn’t been true at every job I’ve worked.
To be clear, I don’t mean we’re a small overworked group of engineers YOLO-ing things up into k8s and praying it works.
In my role as an open source maintainer, I love getting to learn about random technology via issues, feature requests, and the research process involved in debugging. Like did you know shared libraries (e.g. a .so
) have an optional property called RPATH that can be used to tell library-loading code to look in an alternative directory for dynamically-linked libraries? (click here for more details than you could ever want) Or did you know that conda
patches many popular Python libraries to make them work differently when installed via conda
? If you have conda
in your development environment, try looking at some of the files returned by this:
I don’t know if I would have learned and internalized all this random stuff on my own, but learning them along the way while working on a specific task like investigating a bug has been really fun.
What are some things that drive you crazy about your position?
There’s a solid amount of stuff in my current company’s environment that‘s like “thing someone created manually 3 years ago and everyone involved with it is gone and no one left knows whether or not we can delete it”. I love doing maintenance work (seriously!), but the investigations involved in trying to understand those old things can be draining. And it’s definitely frustrating when your focus is broken by an on-call alert related to something you’ve never heard of, which you eventually learn is old and unnecessary.
In my role as an open source maintainer, it’s frustrating to deal with people who aren’t respectful of my time.
In my role as an open source maintainer, it’s frustrating to deal with people who aren’t respectful of my time. For example, I once received a “bug report” on a project I maintain that was just titled "Doesn't work on AWS"
with no other information. Having to say over and over again “what version are you using? can you provide a reproducible example showing what you tried? what operating system are you on? can you share any logs?” is exhausting. I actually almost quit open source recently after a few particularly bad months dealing with unhelpful and rude people, but “Working in Public” by Nadia Eghbal (link) helped me develop a healthier, more sustainable approach to that work.
What does your company do?
SpotHero is a parking marketplace. People who operate parking facilities (which can be as small as a single spot behind an apartment and as large as a downtown parking garage) list parking spots with us. Drivers can reserve them ahead of time on our website, mobile app, or over the phone. We deal with all the integrations work, like generating a QR code in the app that drivers can scan to make the gate at a garage open.
What is your team responsible for?
At SpotHero, the roles “data engineer” and “machine learning engineer” coexist on the Data Engineering team.
We own a mix of things:
- bespoke data pipelines that populate specific data sources
- reusable services used by other developers, including:
- data lake for large structured datasets (Trino)
- data warehouse for fast, powerful queries over curated datasets (Redshift)
- a remote development environment for data scientists (Databricks + some mildly fancy AWS and container stuff)
- workflow orchestration + scheduler for data-intensive scheduled jobs (Airflow + some mildly fancy k8s stuff)
- a custom low-code* service for scheduled re-materialization of views
- a custom low-code* service for persisting messages from Kafka topics as Hive tables
- vendor integrations for replicating application data to the data lake, and for persisting clickstream data
*”low-code” = YAML 😛
My team also has a consulting role within the company. We answer other teams’ questions like “how should we make this application data available to analysts?” and “how should we store predictions from this model?”.
What projects are you working on in the next 6 months?
As of this writing, I’m not sure what I’ll be working on at SpotHero for the next 6 months.
In open source world, I’m planning to focus on:
- finally getting LightGBM 4.0 released (https://github.com/microsoft/LightGBM/issues/5153)
- returning to active participation in the Dask ecosystem (https://github.com/dask)
- helping out where I can with
hamilton
(https://github.com/stitchfix/hamilton) - finishing the new “Practical Deep Learning” course (https://course.fast.ai/) to expand my knowledge of neural networks
What tech do you touch on a daily basis? For what?
As a Machine Learning Engineer at SpotHero:
Google
→ trying to figure out how to make software workPython
→ for many things (click here to see my talk “Every Way SpotHero Uses Python”)- sample of libraries:
numpy
,pandas
,psycopg2
,requests
- sample of libraries:
make
→ gluing together development commands (e.g. running tests, building images, etc.)docker
→ for distributing software as self-contained environmentshelm
(link) → modifyingKubernetes
resourcesk3d
(link) → runningKubernetes
on my laptop to test thingsasdf-vm
(link) → installing and using multiple different versions of CLIsgit
+ GitHub → collaborating on code changesAmazon Redshift
→ storing small-to-medium-sized tabular datasets and querying them with SQLTrino
(link) → reading large datasets stored inApache Parquet
files in cloud object storageDatabricks Container Services
(link) → runningJupyterLab
in containers on arbitrary-sizedAmazon EC2
instancessops
(link) → storing secrets in source controlPrometheus
→ collecting operational metrics + generating alerts based on those metricsGrafana
→ creating plots of metrics, and dashboards of those plotsApache Airflow
→ executing scheduled workloads and providing operational control over them (e.g. aggregating logs, visualizing historical runs, triggering alerts on job failures)
As an open source contributor / maintainer:
Google
→ trying to figure out how to make software workCMake
→ compiling a C++ library with many combinations of operating system, architecture, compiler, and library featuresR
→ I contribute to a couple of libraries written in this language- sample of libraries I use:
{data.table}
,{jsonlite}
,{lintr}
,{Matrix}
,{testthat}
- sample of libraries I use:
Python
→ I contribute to a couple of libraries written in this language- sample of libraries I use:
dask
,numpy
,pandas
,pytest
,scipy
- sample of libraries I use:
GitHub Actions
+Appveyor
+Azure DevOps
→ continuous integrationreadthedocs
→ deploying and hosting documentationdocker
→ for distributing software as self-contained environments
What are your main responsibilities?
As a machine learning engineer at SpotHero, my responsibilities are:
- design, implement, and maintain systems (infrastructure + applications + libraries) used to ingest, transform, validate, store, and serve data
- create and maintain data pipelines that ingest and transform data
- design, implement, and maintain systems (infrastructure + applications + libraries) used to develop machine learning models and integrate those models’ outputs into SpotHero’s operations (either as part of other systems, or in the form of dashboards and reports informing human decision-making)
- advise other engineers, data scientists, and analysts at SpotHero on how to best use the available systems at SpotHero to ingest, transform, validate, store, and serve data
😬 If this all sounds kind of vague, see “What do your days consist of?” section below for specific examples.
As a staff-level engineer, I have some additional responsibilities, like:
- writing specific, honest, inclusive job descriptions and designing interviews to evaluate candidates for new roles
- teaching and supporting more junior engineers
- contributing to company-wide engineering efforts, like updates to shared infrastructure
As an open source maintainer, my responsibilities are:
- asking “can you please not post screenshots of code and logs? are you able to provide a reproducible example?” every day
- documenting bugs and features in enough detail that contributors can help with them
- defending against unrestricted growth of the public API
- contributing code to fix bugs, improving testing / docs, and add new features
- speaking at conferences and meetups (and the MLOps Community podcast, hi! 👋🏻)
What do your days consist of?
As a machine learning engineer at SpotHero, my time roughly breaks down as follows.
20% – recurring meetings
- team-specific meetings
- Standup, Planning, Refinement, Retro (click here to learn about “agile ceremonies”)
- on-call retro where we review the types of incidents on-call engineers have been dealing with and look for patterns
- large-group meetings where people share presentations about what they’ve done, or where cascading updates from executives are shared
- department-wide meeting
- company-wide All Hands
- company-wide Sprint Review Demos (all engineers + product + agile)
- “guild” meetings (engineering interest groups)
- Backend Unite (all backend engineers)
- Ops Guild (all engineers especially interested in infrastructure + developer experience)
10% – non-recurring meetings
- “let’s talk about this proposal and agree on the approach we want to take”
- “hey can you teach me how to do this thing”
- “can we pair-program for a bit? I can’t figure out this bug and need another set of eyes”
- required trainings (e.g. information security, anti-harassment)
5% – free-use time for learning and experimenting
Once every two weeks, every engineer at SpotHero participates in something called “Discovery Day”.
This is a half day dedicated to working on whatever you want, and it doesn’t have to be SpotHero-related. I’ve used it for things like:
- taking online courses
- making open source contributions
- working on side projects like https://github.com/jameslamb/pydistcheck
- preparing conference talks
- trying to get a functioning Trino + Hive Metastore + Iceberg + S3 setup running on my laptop, to help in reproducing and reporting bugs in Trino
30% – writing
Writing is a really important part of my job, and something I put a lot of energy into. My primary value to SpotHero is my knowledge and ideas… given clear descriptions of those things, anyone could turn those ideas into software.
Some types of writing I produce in this job:
- developer-facing documentation
- backlog tickets
- design proposals
- I’m a HUGE fan of the “Architecture Decision Record” approach, described in https://github.com/joelparkerhenderson/architecture-decision-record.
- incident postmortems
- operational runbooks
- debugging Slack threads
15% – reviewing others’ work
My team uses pull requests on GitHub as a way to get asynchronous feedback on code. We also use asynchronous comments on written proposals similar to architecture decision records (link) as a way to make larger design decisions.
As a result, a significant portion of my time is spent reviewing these outputs from other engineers and providing suggestions. I’d love to write a whole blog post or give a lightning talk some day about how to be effective in this important type of work…but I’ll spare you all for now.
10% – support
Support takes many forms on the Data Engineering team at SpotHero.
It includes the following activities:
- responding to on-call incidents
- sometimes this means “turn it off and turn it back on”
- sometimes this means “spend 6 hours looking for a root cause, push a quick code fix to stabilize things, then spend the next 4 days working on a more permanent fix”
- answering “hey how do I {do-thing}” questions from users of my team’s systems
- manual administrative actions like:
- modifying users and permissions
- reporting issues in third-party services to those companies’ support teams
10% – writing code
“writing code” takes many forms in my current role.
Some representative examples:
- modifying a Helm chart for Apache Airflow so that Airflow schedulers and workers use
git-sync
(link) to read job configuration files from agit
repository - creating a Python module with code to pull down schemas from an external schema registry, flatten them out into a data frame, and upload them to Redshift. And a
Dockerfile
to distribute that code as a container image. - modifying a Kafka Streams application written in Kotlin which joins clickstream and transactional data
- writing Terraform code to allow management of permissions, secrets, cloud storage, and network-level security settings for users of a “run-Jupyter-on-a-fat-virtual-machine” service
My time as an open source maintainer roughly breaks down as follows.
- 30% – writing & contributing to bug reports and feature requests
- 30% – reviewing issues and pull requests
- 30% – writing code & responding to reviewer feedback
- 10% – speaking at conferences and meetups
What kind of metrics do you follow closely?
The only metric that I ever manually review reports on or make quarterly goals about improving is cloud cost.
Otherwise, my team currently just focuses on providing capabilities and ensuring that all the things we say work continue to work.
My team basically owns two types of software:
- systems (infrastructure + services + libraries) used to do stuff with data
- pipelines that do specific stuff with specific data
For those systems, we capture the metrics necessary to detect problems (e.g. CPU utilization, memory usage, disk usage, lag on Kafka topics), and use those metrics to generate alerts that automatically create incidents for on-call engineers on the team. We react to those incidents and alerts as they happen, but don’t typically review the metrics proactively or try to tie them to business value.
For pipelines, the business value of having the pipelines and the requirements for them to be considered “working” are presented to my team by other teams asking us to implement and maintain them. From the moment we agree to build and maintain them, the only metrics we follow are those necessary to meet the requirements.
I’m not saying that this is an ideal state to be in, but it is where we’re at right now.
War stories?
Here are the clickbait titles to real stories from my ML past. I won’t write out the full stories here, but could be fun in a future blog post!
- “I once accidentally released LightGBM”
- “A frontend developer once catastrophically broke an ML model my team owned by adding code to divide a number by 1000, for display reasons”
- “It’s because of me that
pip install pandas
no longer downloads any Powerpoint files” - “I once worked on a failure prediction model for water filtration systems used in large breweries…while sitting in a bar in a foreign country”
- “I spent a week at a CAT training facility in a converted mine in Arizona, learning how to diagnose problems with mining equipment….for (data) science”
- “The wildest data ingestion pipeline I ever wrote involved walking a USB stick from one cubicle to another and abusing Excel’s
Paste as Values
“ - “My friend Rita and I won $250 for implementing time-series cross-validation in a programming language where every variable name has to start with a
%
“ - “I once watched someone pull up a virtual keyboard on their Mac because some administrative action on the company’s Active Directory (AD) instance required a keyboard shortcut with the Windows key”
- “At one company, the easiest way to ship a single
if
statement was through two layers of transpilation used to produce a custom Kafka stream-processing app and dedicated output topic for model results. And it made sense!”
Who do you admire?
In the areas of machine learning and computers, I really admire the people who take time to make complicated topics accessible to wide audiences, who are genuinely knowledgable and talented practitioners, and who demonstrate patience and empathy in their in-person and online interactions with the people using their software.
Here’s a short list of those people that I follow:
- Uwe Korn (https://twitter.com/xhochy)
- Vicki Boykis (https://twitter.com/vboykis)
- Julia Evans (https://twitter.com/b0rk)
- Jacob Tomlinson (https://twitter.com/_JacobTomlinson)
- Jay Qi (https://github.com/jayqi)
- Stephanie Kirmer (https://twitter.com/data_stephanie)
In general, like just in life, I tend to admire people who:
- acknowledge their privilege and the degree to which luck has contributed to their success
- communicate with humility and empathy
- talk in specifics
- try to make the world a little bit better for everyone in it
Looking Ahead
Where do you want to take your career next and why?
5 years from now, I want to be doing less implementation work and more architecture / design work, and I’d like to be doing that on systems that include machine learning workloads.
I’m not the most talented programmer or statistician, but I do think I have the qualities required to design large systems and the interactions between different systems, like:
- breadth of knowledge
- attention to detail
- ability to decompose large problems into smaller ones
No one should pay me to re-write some Java services into Rust or design a large-scale experiment, but I do feel confident proposing a design for something like “how should batch re-training of machine learning models on sensitive data be performed?” or “how should the company store and provide access to container images?”.
I really like doing that work. I enjoy the challenge of breaking a large, important, ill-defined problem down into more manageable pieces and some criteria for choosing between different options.
What advice do you have for someone starting now?
- GO 👏🏻 TO 👏🏻 MEETUPS 👏🏻
- If you feel you’re at least 50% qualified for a role based on the job description, apply. Lots of otherwise-good engineering and data science teams don’t put enough energy into making their job descriptions realistic and inclusive.
- If a recruiter or hiring manager tells you “we’re kind of like a small startup inside a large, old, publicly-traded company”, be VERY skeptical.
- Focus on jobs where you’ll learn transferable skills. A job where you’re writing Python code or using AWS services to create dashboards is going to create a lot more opportunities for you in the future than one where you’re some using random commercial BI software (or even worse, the company’s proprietary software).
- Most jobs around machine learning are in companies using it to build products or make business decisions… not in research labs. A week spent learning how to create Docker images will improve your marketability a lot more than a week learning one more neural network architecture.
If you enjoyed this you might also like our newsletter where we give a round-up of all the good stuff happening in the MLOps Community. Subscribe here.