MLOps Community
+00:00 GMT

What I Learned Building Platforms at Stitch Fix

What I Learned Building Platforms at Stitch Fix

Five lessons by Stefan Krawczyk Why build a platform? Picture this

September 30, 2022
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
What I Learned Building Platforms at Stitch Fix

Five lessons

by Stefan Krawczyk

Why build a platform?

Picture this. You’re an individual contributor working at some company that requires you to write “code” to get your job done. I’m trying to cast a wide net here, for example, you could be a full-stack data scientist at Stitch Fix creating models that plug back into the business, or you could be a software engineer at a startup writing product features, basically, anyone where you have to develop some “software” and through your work, the business somehow moves forward. In general, it is easy to get started and deliver value to the business, since things are relatively simple. But consistently being able to deliver value and doing it over time is hard. You can easily reach terminal velocity and end up spending all your time keeping your prior efforts running, or fighting their details to expand and do more, versus moving your business forward. So how do you prevent this? At some point you need to start building out abstractions to reduce maintenance costs and increase your development velocity, this is, after all, what all the big tech companies do internally. What these abstractions build out is a platform, i.e. something you build on top of. Now, building good platforms isn’t that straightforward, especially as businesses grow and scale.

I was lucky enough to spend the last six years focusing on “engineering for data science” and learning to build great platforms for the world-class data science team at Stitch Fix. During this time, I saw lots of platform successes and failures firsthand. Now, there is plenty of material available on what types of platforms have been built (see any big tech company’s blog) and how to think about building a software product (e.g. building an MVP), but very few on how to start a platform and build one out. In this post, I will synthesize my major learnings about how to build platforms into five lessons. My hope is that these five lessons will come in handy for anyone trying to build a platform, especially in the data/ML space.

Background Context

When I joined the data platform team back in 2016, Jeff Magnusson had just written Engineers Shouldn’t write ETL. I was excited to build out capabilities for Data Scientists who were operating in a no-hand-off model. At the time it was an avant-garde way to run a data science department (if you haven’t read either post it’s worth the read). At a high level, the platform team operated without product managers and had to come up with platform capabilities to move Data Scientists forward, who in turn moved the Stitch Fix business forward. As cheesy as it sounds, what Jeff Magnusson wrote ‘Engineers should see themselves as being “Tony Stark’s tailor”, building the armor that prevents data scientists from falling into pitfalls that yield unscalable or unreliable solutions.’ is true, we really did get to dream big with the tooling we built. Now, how did things work out in practice? Well, some people’s ideas and efforts flopped hard and others were smashing successes, hence the motivation for this post.

Before we go further, quick work on nomenclature. I will use the term “platform” in a loose metaphorical sense – it is anything you build on top of. So if you are someone who provides a web service API, a library, a UI, etc., that other people use to construct on top of, then you are building a platform. I’m also being liberal with the term “API” to cover all of your UX on your platform unless noted otherwise.

Lessons Learned

Here I’ll present five lessons. While the lessons can be read independently of one another, I highly recommend reading them in order.

Lesson 1: Focus on adoption, not completeness

Everyone wants to build the perfect platform for their stakeholders, with all the bells and whistles attached. While this is well-intentioned, they commonly fall into a trap — that of building too much with no early adopters. For those familiar with the terms MVP & PMF, this is basically what this lesson is about.

Let me put this into context. The Stitch Fix Data Platform team operated without product managers. So each platform team had to figure out what to build and who to build it for. A simple solution here could be “just hire a PM”, but (1) technical ones are hard to find (especially back in 2016) and (2) it would go against how we wanted to operate. A lot of engineers had to learn the hard way that they couldn’t just build something in isolation; going off for a quarter and then going “tada”🎊 wasn’t going to guarantee anyone would use what you were building. In fact, that was a recipe to get you fired!

Why might this happen? Well, if you have a vision for a platform that fulfills a wide array of use cases, it’s tempting to build for all use cases from the very beginning. This is an arduous process, and takes a long time before you arrive at something usable. My metaphor to describe this is: if you want to build a house (which represents your platform), you generally start with all the foundations, and then build upwards, adding walls, the ceiling, and then once the exterior is done, the internals – the house isn’t livable or usable until everything is completed. If you build a platform this way, it’s very easy to go away for a long time and not have anything to show for it. Worse yet, you waste a lot of effort building a house that no one wants, e.g. with only one bathroom, only to discover your end users need a bathroom for each room.

So instead, one should try to find a way to build up “vertically” for a single room at a time, so that it’s habitable and someone can make use of it before the entire “house” is completed. Yep, go ahead, try to picture a house, where only the structure for one room exists, that room is functional – that’s the image I’m going for. While we might not build a house like this in the real world, we can always build software like this, so bear with me. That said, modular construction is all the rage these days, so maybe I am onto something with this metaphor… By building up a room at a time, you will get faster validation/have the time to pivot/correct as you fill out the rest of the house. Now, this doesn’t solve the problem of knowing what room to build first and thus who to build first for. Remember there is a human side to building a platform. Determining who and getting their commitment arguably can make or break your project. Here are two patterns that I saw work well:

  1. Adopt existing user tooling
  2. Partner closely with a team and a specific use case

Adopt existing user tooling

The data scientists Stitch Fix hired were a capable lot. If there was a gap in some area of the platform, you can be sure that data scientists filled that void themselves and built something. As a team determining its own product roadmap, we were on the hunt for capabilities to build and extend. Inheriting homegrown tooling/frameworks/software made a lot of sense. Why? Adoption was all but guaranteed – the platform team only had to polish and generalize. If they built a shack that worked for them, then coming in and doing a remodel gave you a very specific set of parameters to work with. One caveat with this approach is that you need to see a bigger vision than what their solution currently provides, e.g. more capabilities, or supporting more users, or else you’ll be doing a remodel for little likely benefit.

For example, there was a homegrown tool that one of the teams had come up with for their own particular business context. It was a configuration-driven approach to standardize their team’s model training pipelines. They had built it because they needed it to solve some very specific pain points they were having. We did not partner in building it because we were not in a place to support such an endeavor at the time (we were even skeptical of it). Fast forward a year, and suddenly more data science teams hear about it and want to start using it. Problem is that it was very coupled to the context of the originating team, who had little incentive to support other teams using it. Perfect problem for a platform team to step in and own! Importantly, we could see a grander vision with it and how it could serve more use cases. See this post for the outcome and extensions we added.

I like this approach in particular because:

  1. You didn’t spend time iterating yourself to determine what to build for people to adopt it1. Win.
  2. You got someone else to prove its value. Win.
  3. You can then have good reason to inherit it and improve it. Win.

Note: inheriting can get political at times, especially when the person building it doesn’t want to give it up. If there are clear platform responsibility boundaries in place this isn’t a hard pill to swallow, but if it’s a surprise to the creator then options are to have them transfer to the platform, or simply have a hard conversation… In general, however, this should be a win-win for everyone involved:

  1. a win for the team that created the tool because they are now unburdened by its maintenance of it.
  2. a win for you because you can take over the tool and take adoption and capabilities further than it would otherwise have gone.
  3. a win for the business because it’s not wasting resources on speculative efforts.

Partnering very closely with a team and a specific use case.

I recall one conversation with a platform engineer. They were balking at the feedback that they should be able to deliver something sooner for people to get their hands on it. “No, that’s not possible, that will take two months” (or something to that effect). I agreed, yes, this is a challenge, but if you think about it long enough, there generally are ways for any platform project to be chunked in a way that can show incremental value to bring a stakeholder along.

Showing incremental value is important; it helps keep you aligned with your stakeholders/users that you’re targeting. It is also a good way to de-risk projects. When building platforms you have technological risk to mitigate, i.e. proving that the “how” will actually work, and adoption risk, i.e. will someone actually use what I’ve built. With our house-building metaphor, this is what I mean by figuring out how to build a habitable room without completing the entire house. You want to bring your stakeholder along from architecture diagrams, to showing sample materials, to building something that minimally works for their use case.

Practically speaking, a way to frame delivering incremental value is to do time-boxed prototyping and make go/no-go decisions based on the results. It is far better to pay a small price here and learn to kill a project early, versus getting a lot of investment without mitigating the key risks to success. Do this by targeting a specific, narrow use case, then determining how to broaden the appeal by expanding the platform “horizontally” to support wider use cases. For example, when we set out to build out our capability to capture a machine learning model and enable no work on part of the data scientist to deploy the model, we partnered very closely with a team that was embarking on building out a new initiative. You could think of them as a “design partner”. They had a narrow use case with which they wanted to track what models were built and then selectively deploy their models in batches. This enabled us to focus narrowly on two parts: saving their models and owning a batch job operator that they could insert into their offline workflows for model prediction. Constraining it to a team which had a deadline gave us some clear constraints with which to deliver incrementally. First, the API to save models, and then the job to orchestrate batch predictions. Because we had a vision to support other teams with these capabilities, we knew not to over-index on engineering towards this one team. By working closely with them we ensured we got adoption early, which helped provide valuable feedback on our intended APIs and batch prediction functionality. In turn, they got a partner that supported and heard their concerns and was aligned to ensure that they were successful.

As an astute reader, you might be thinking this just sounds like agile project management applied to building a platform. My answer is yes, you’re basically right, but many a platform engineer likely hasn’t had this sort of framing or mentorship to see the connection, especially in a world where product managers would do this type of thing for you.

Lesson 2: Your users are not all equal

As engineers, we love building for possibilities. It’s very easy for us to want to ensure that anyone can do anything with the platforms that we provide. Why is that? Well, I’m stereotyping here, but we generally want to be egalitarian and treat every user that we’re building for equally in terms of providing support and functionality.

That is a mistake.

Two facts:

  1. Users you build for will fall on a spectrum (a bell curve, if you will) of abilities. There will be average users, as well as outlier users. Outlier users are your most sophisticated users.
  2. Features you add to the platform do not contribute equally to development costs and maintenance.

In my experience, the outlier users want your platform to support more complex capabilities/needs because they want you to support their more sophisticated desires. This in general means higher development costs and maintenance costs for you to implement such a feature. So you really have to ask yourself, should I:
(1) design for this feature at all?

(2) then actually spend time building it and maintaining it?

Or (3), push back and tell that user they should build that themselves.

You might be thinking that what I’m talking about is simply a case of over-engineering. While, yes, this does have that flavor, over-engineering has more to do with what the solution is, versus actually deciding whether you should support some functionality in the platform or not. Using our building a house metaphor, should you build in some sophisticated custom home automation system because someone wants voice-activated lights, or should you just tell the user to figure out how to provide that feature themselves?

Unless you’re looking to build a completely new platform and searching for a customer, or there are compelling business reasons to do so otherwise, you as a platform builder should learn to say no (in a nice way of course). In my experience, more often than not, these features end up being related to speculative efforts. I found it is better to wait it out and ensure that this effort proves to be valuable first, before determining if it should be supported. Remember, these asks come from sophisticated end users, so they very likely can get by, by supporting it themselves. Note, if you take this strategy, then it can feed into the “adopting homegrown tooling” strategy from lesson 1.

Lesson 3: Abstract away the internals of your system

Over time, less and less infrastructure/tooling is being built within an organization as the maturity of technology providers in whatever domain you’re in has grown. Invariably you, as a platform builder, will integrate with some third-party vendor, e.g. AWS, GCP, an MLOps vendor, etc. It’s very tempting, especially if the vendor solves the exact problem you want to solve, to straight up expose their API to users you’re building the platform for since it’s a quick way to deliver some value.

Exposing APIs like this to an end user is a great recipe for:

  1. Vendor lock-in.
  2. Painful migrations.

Why? You have just given up your ability to control the API of your users.

Instead, provide your version2 of that API. This should take the form of a lightweight wrapper that encapsulates this vendor API. Now it’s easy to do this poorly, and couple your API with the underlying API, e.g. using the same verbiage, same data structures, etc.

Your design goal should be to ensure your API does notleak what you’re using underneath. That way, you retain the ability to change the vendor without forcing users to migrate, because you retain the degrees of freedom you need to do it without requiring users to change their code. This is also a good way to simplify the experience of using a vendor API too, as you can lower your users’ cognitive burden by making common decisions on their behalf, e.g. how things are named, structured, or stored.

For example, we integrated an observability vendor into our systems at Stitch Fix. Exposing their python client API directly would mean that if we ever wanted to change/migrate away it would be difficult to do so. Instead, we wrapped their API in our own client library, being sure to use in-house nomenclature and API data structures. That way we could easily facilitate swapping this vendor out if we wanted to in the future.

Note, this isn’t an unreasonable approach to also take with your sister platform teams either if you use their APIs. Some rhetorical questions to think about, do you want to control your own destiny? Or, be coupled to their goals and system design?

Lesson 4: Live your users’ life cycle

If you operate with product managers then they should ostensibly know and be aware of your users’ life cycle, to help guide you as you build your platform. As we had no product managers at Stitch Fix, we were forced to do this ourselves, hence this lesson. Now, even if you do have product managers, my guess is that they will still appreciate you taking on a bit of this burden.

The capabilities and experiences that you provide for your end users result in downstream effects over time. While it can be easy to gloss over the intricacies of your users’ workflows, especially if they stretch past your platform, doing so will inevitably result in tenant and community issues (to use our housing metaphor).

Tenant issues are generally small problems, like when simultaneous faucet usage reduces everyone’s water pressure. These problems only require some small tweaks to fix/mitigate. E.g. you made it super easy to launch parameterized jobs and people clog up your cluster with work in addition to your cloud expenses jumping up. What’s the quick fix here? Perhaps you ensure jobs are always tagged with a user and SLA, so you can quickly identify who is utilizing all your cloud resources/use it to make decisions as to where to route tasks based on priority. Or, just identify who you need to talk to, to kill their jobs.

“Community issues” are bigger problems. For example, say you’ve built an awesome house (platform) that can support many tenants (users), but street parking around it is minimal; you didn’t account for this. Anytime someone (i.e. a potential user) wants to visit the house they struggle to park their car and have to walk a long way. If not fixed quickly, these issues can really hurt your platform. To try to illustrate this point, say you focused on making one aspect of a user’s workflow really easy with your platform, but you neglected how it fits into their bigger picture. For example, you might have just increased the total amount of work someone needs to fulfill to get something to production because their development work isn’t directly translatable to your production platform system. In which case, your platform solution that was initially met with enthusiasm turns into dread because there is a particular sticking point that your end users hit time and again. A smoking gun indicating that this is happening is when end users come up with their own tooling to get around this problem.

So what should you do? Walk in the shoes of the end user, and take a macro view of how what you are providing fits into what work they need to get done. Here are a few approaches to mitigate problems:

  1. Be an end user: actually, use your platform and get things to production on it.
  2. Model the hypothetical: draw the flow chart of your users’ workflows and then think about ramifications of whatever platform feature you’re providing (works for every situation).
  3. Bring in an end user: bring a user on for an internal rotation – they should be able to understand and explain this to you and your team (bring someone to help be a better voice for your users).
  4. Build relationships: build deep enough trust and relationships with your peers such that you can ask blunt questions like “what do you hate about your workflow?”, “if there was something you wouldn’t have to do in getting X to production, what could it be?”. Sometimes your users are just anchored and resigned to the fact they can’t change the world, where in fact they can by giving you feedback. Other times they don’t feel safe enough to give you the feedback you really need, so you’ll need to build trust for that to happen.

If you do the above for long enough, you can start to intuit what’s going to happen more easily and thus determine what extra features you might need, or potential issues to anticipate and plan for.

Lesson 5: The two-layer API trick

In this lesson, I put forward my high-level framing of how I think when I set out to build a platform. This is essentially the playbook I came up with to help deliver successful platforms at Stitch Fix. I concede it might not be always possible to fulfill this approach due to tight requirements/the nature of your platform. But as you build higher-level abstractions you should be able to apply this way of thinking. Otherwise, as you go through this lesson, you’ll hopefully see connections with the prior four lessons. But first, some motivation.

Motivation

(1) remember the sophisticated user of your platform that asks for that complex feature? Since you said “no, go build it yourself”, they will likely go ahead and do so. But, if they’re successful you’re going to want to inherit their code, right? Wouldn’t you want to make that a simpler process to inherit if you could?

(2) it is easy to write very coupled, non-generalizable code when providing platform capabilities, i.e. it’s hard to break apart and extend/reuse. This isn’t a bad thing if you’re getting started and need to get something out there, but it becomes a problem when you want to extend your platform. In my experience, especially if you don’t have the time for “tech debt” projects, it’s easy for such coupled code to snowball and thus significantly impact your team’s delivery of work.

(3) in lesson three, the focus is on not leaking vendor API details. I think that’s a good approach, in effect, you create two layers of APIs, but it’s quite focused on the micro problem of vendor API encapsulation. How can we extend that thinking further and provide ourselves with some framing for our entire platform?

Two layers of APIs

To help with maintaining and growing a platform, you should think about building two layers of APIs:

  1. A bottom layer allows one to build “anything” but in a bounded way.
  2. A second, higher-level layer provides a less cognitively taxing, opinionated way to do something.

Using our house building analogy here, the lower layer represents the house’s foundation, plumbing, and electrical; it bounds the shape and surface area of the house. The higher-level API corresponds to what a room is; its features and layout, e.g. for your users you have placed the refrigerator, stove, and sink to form a kitchen triangle because for anyone doing cooking that’s a pretty good setup. Then if someone wants something more complex in their room, we’ve made it easy to take the walls off and get access to the plumbing and electrical so they can rearrange it how they want instead.

Let’s expand on these two layers more concretely.

What is this bottom API layer?

The purpose of this “low-level API” is that you can express anything you want your platform to do, i.e. this API captures base-level primitives. That is, this is your base capability layer, using it means you can control all the minutiae.

The goal of this layer is not to expose it to your end users per se. Instead, the goal is to make you define yourself a clear foundation (pun intended with our house building metaphor), with which to build off of. Therefore you should consider yourself the primary target of this layer. For example, this layer could have APIs for reading and writing data in various formats, where to use it one needs to make decisions about file names, locations, which function to use for what format, etc.

What is this second API layer?

The purpose of this “higher level API” is to provide a simple experience for your average user-built solely on top of your lower level API. You are essentially defining a convention into this API to simplify the user’s platform experience, as you have made some lower-level API decisions for them. For example, building off the example for the lower level layer, this layer could expose simple APIs for saving machine learning model objects. This is a simpler API because you’ve already made decisions on file name conventions, location, format, etc. to save that model so your platform end user doesn’t have to.

The goal of this layer is to be the main API interface for your platform end users. Ideally, they can get everything they need done with it. But if they need to do something more complex, and it doesn’t exist in this higher level API, they can drop down to the lower level API you have provided to build what they need for themselves.

Why two layers should work

By forcing yourself to think about two layers you:

  1. Make it harder for you and your team to couple concerns together. Since, by design, you’re forcing yourself to determine how a more opinionated capability (higher level API) on your platform is decomposed into base level primitives (lower level API).
  2. You can more easily bound how the platform takes shape because you define your base foundation layer. This helps provide support for the more sophisticated user you support, who can peel back the opinionated layer and do more complex things, without you having to explicitly support that. By enabling more complex users in this way, you have time to think whether you should support their more complex use case in a first-class manner (see how this can feed in the “Adopt existing user tooling” part of Lesson 1).

Now some of you might react and balk at the idea of supporting two APIs, as it sounds like a whole bunch of work on API development, maintenance, and versioning. To that, I say, yes, but you’re largely going to be paying it anyway if you’re following good documentation and API versioning practices. Whether it’s internal or external to your team shouldn’t really change much, except how and where you communicate. If you take the alternative approach of building a single API layer, your initial costs might be lower, but the future maintenance and development costs are going to be much higher; you should expect that your platform needs to change over time. E.g. security-related updates, major library versions, new features, etc. My argument here is that it’ll be easier to do so with two API layers than a single API layer.

Two brief examples

To help crystalize this, let’s look at two examples of this two-layer API thinking in action.

Example 1

For example, when we introduced our configuration-based approach to training models, it was built on top of our model envelope approach to capturing models and then enabling deployment. So if someone didn’t want to use our configuration approach to creating a model, they could still make use of the model envelope benefits by dropping down to use that API.

Example 2

At Stitch Fix, we made it easy to build FastAPI web services, but users did not actually have to know or care that they were using FastAPI. That’s because they were using a higher-level opinionated API that enabled them to just focus on writing python functions, which would then be turned into web service endpoints running on a web server; they didn’t need to configure the FastAPI web service by writing that code themselves because it was already taken care of for them. This functionality was built on top of FastAPI as the base foundational layer. Should a user want more functionality than the upper opinionated layer could provide, one was able to invoke the lower level FastAPI API directly instead.

Summary

Thanks for reading! In case you’ve been skimming, here’s what I want you to take home. To build platforms:

  1. Build for a particular vertical/use case first and deliver incremental value, where either you inherit something that works, or target a specific team that will adopt your work as soon as it’s ready.
  2. Don’t build for every user equally. Let sophisticated users fend for themselves until it’s proven that you should invest your time for them.
  3. Don’t leak underlying vendor/implementation details if you can. Provide your own thin wrapper around underlying APIs to ensure you have more options you can control when you have to make platform changes.
  4. Live your users’ lifecycles. Remember that you provide for and shape the experience of users using your platform, so don’t forget the macro context and the implications of your UX; drink your own champagne/eat your own dog food so that you can ensure you can foresee/understand resonating impacts of what you provide.
  5. Think about providing two layers of APIs to keep your platform development nimble:

    a. Think about a bounded foundational API layer. That is, what are the base level primitives/capabilities you want your platform to provide, and thus what’s a good base to build on top of for yourself.

    b. Think about an opinionated higher-level API layer. This layer should be much simpler to use for the average user than your lower foundational API layer. To handle more complex cases, it should still be possible for more advanced users to drop down to your bounded lower-level foundational API.

If you disagree, have questions or comments, tweet at me https://twitter.com/stefkrawczyk, connect with me on LinkedIn, or hit me up in MLOps Slack @Stefan Krawczyk to chat more.

To close

I am thrilled to share the insights I’ve garnered from my time at Stitch Fix with you (hopefully they’ve been useful!). Since I left, however, I have not just been editing this blog post. I’ve been scheming about building a platform myself. Stay tuned!

Also, special thanks to Elijah, Chip, and Indy who gave valuable feedback on a few drafts of this post; errors and omissions are all mine.

This post is public so feel free to share it!

Dive in
Related
Blog
What I learned after an hour with Jeremy Howard
By Demetrios Brinkmann • Sep 5th, 2021 Views 69
Blog
What I learned after an hour with Jeremy Howard
By Demetrios Brinkmann • Sep 5th, 2021 Views 69
Blog
What does a Machine Learning Engineer at Etsy Do?
By Demetrios Brinkmann • Sep 22nd, 2022 Views 110
Blog
What does a Machine Learning Engineer at DPG Media Do?
By Demetrios Brinkmann • Oct 13th, 2022 Views 92
Blog
What does Frata do? A Machine Learning Engineer tells all.
By Demetrios Brinkmann • Sep 1st, 2022 Views 80