Sign in or Join the community to continue

Eliminating Garbage In/Garbage Out for Analytics and ML

Posted Jul 17, 2023 | Views 378

# Analytics

# Product Conversations

# Upsolver

# Upsolver.com

Share

speakers

Santona Tuli

Head of Data @ Upsolver

Santona Tuli, Ph.D. began her data journey through fundamental physics—searching through massive event data from particle collisions at CERN to detect rare particles. She’s since extended her machine learning engineering to natural language processing, before switching focus to product and data engineering for data workflow authoring frameworks. As a Python engineer, she started with the programmatic data orchestration tool, Airflow, helping improve its developer experience for data science and machine learning pipelines. Currently, at Upsolver, she leads data engineering and science, driving developer research and engagement for the declarative workflow authoring framework in SQL. Dr. Tuli is passionate about building, as well as empowering others to build, end-to-end data and ML pipelines, scalably.

+ Read More

Roy Hasson

Head of Product @ Upsolver

Roy is the head of product at Upsolver helping companies deliver high-quality data to their analytics and ML tools. Previously, Roy led product management for AWS Glue and AWS Lake Formation.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Shift left data quality ownership and observability that makes it easy for users to catch bad data at the source and stop it from entering your Analytics/ML stack.

+ Read More

TRANSCRIPT

Hi, I'm Santona Tuli I'm head of data at Up Solver and I have a morning ritual. Every morning I do a pour over coffee for myself. Hi. Uh, takes about 10 minutes and it frees me, frees my mind. Um, I buy local coffee. Um, so now that I live in Northern Virginia, I get coffee from this local roaster called Vigilante Coffee.

They're, they're quite good. And yeah, uh, when I move, I always, I try out what's around there and, and, um, switch to something new. Oh, hi. Uh, my name is Roy Hassan. I'm the head of product at Up Solver. Uh, my, I don't actually drink coffee. I drink tea. I'm a big fan of teas, different kind of teas. My favorite is a, uh, a macha tea.

So I actually get, uh, my macha tea from a layered Hamilton, uh, superfood. It has matcha, of course it has some functional mushrooms. I drink that every single day, gives you some clarity and, and, and. This is awesome. I love it. Um, in addition to that, I drink, uh, I drink chai tea every afternoon as well. So, uh, that's my, uh, my daily visual.

Welcome, welcome, welcome to the ML lops Community Podcast. I am your host Dimitri Os, and today I'm flying solo as a host, but we've got two guests, two, Double the fun they're both working at up solver and it was an absolute pleasure to chat with them. Let's dive in real quick.

Just get a little background on who these people are that we're talking to before I give you my takeaway and we jump into the conversation. All right, so San is a PhD and she began her journey into data through the world of physics, banging particles together at CERN to detect rare particles, which is absolutely awesome.

I think I remember one time I read in the newspaper about how France was gonna bang some particles together, and we may blow up because I'm in Europe, and so watch out for that. But I don't know if Santon had a part in that or not. I just remember that date Now that I told this story, I realize it was not as good as I remember it.

All that ride with no destination. There we go. This is what you get to expect from me, folks. I'm on my own today. I got no one to save me. Anyway, San uh, win and she started working in Python when she realized that, hey, I really like data orchestration, and she started programming data orchestration tools like Airflow, helping improve the developer experience.

And I believe she went and she worked with the managed airflow company called. Astronomer. That's right. Astronomer. And currently she is at UP solver and she's got a bit of a product mindset. She leads data engineering and science, so data engineering and data science. And she is driving developer research and engagement for the declarative workflow authoring framework in sql.

I think Roy was kind on me and he gave me a much less detailed. Intro about himself or background. So, Roy is head of product at Up Solver. He's helping companies deliver high quality data to their analytics and ML tools. Previously, before he was at Up Solver, he led product management for AWS Glue and Ws Blake.

Formation. This conversation I loved because both Santon and Roy bring a product mindset to what they're doing, and we dove deep into that. We talked a lot about how you can be more successful coming from that product mindset, They mentioned how you want to talk to the end users, whoever it is that you are building for, and you want to ask them, and it sounds simple.

It's like, yeah, of course you do that, but how many of us actually do this? If we're building platforms like we're building an ML platform for our company, or we're building just any kind of data platform, do we go out and talk to the users who are using these platforms? Do we go ask them every day? How can I make it better?

What can I do to meet your needs? What are you struggling with? This is the product mindset that both of our guests today were advocating for. And it was cool to see. I mean, I think that everyone is going to need a little more product fundamentals in the way that they see the world, especially as more and more of these.

Hard problems that we thought could only be solved by using some complicated machine learning. They're starting to become less and less hard because you can ping in api that's just a large language model. Uh, maybe I'm oversimplifying and I am going to be the first to say. I don't know how true that is going to be, but I still think no matter what, whether or not that is true.

Product mindset and product fundamentals are going to be useful for you along your journey. The other thing that Santana mentioned that I wanted to point out was how she talked about how at her job she is so cross-functional and how important this idea of dog fooding is. So if you are building for others, you.

Want to ask them what their pains are, but you also want to work with the tools that you're creating to see what is painful so you can have a little bit more empathy for those users. So let's dive into this conversation and hear it from them specifically. But before we do that, I must mention that this.

Is a podcast that loves to be shared. It's like your aunt's Facebook post. They just love to be shared by the family and they get about two likes. But you know what? Might as well share it. If you found some value in it, please. Go ahead, send it to one of your friends. Let 'em know that they gotta listen to this.

It is an absolute joy that I get to do this every week. I get to come and sit here and be with you all. And interviews such bright minds and hear the way that they see the world, the way that they build product, the way that they're thinking about machine learning and ML ops.

Oh, I love it. And it is thanks to you that that is possible. Let's jump into the conversation.

We need to begin with the backstory on how this happened, how we are here with the both of you, because I think it is worth mentioning that. I was hosting the DRE Conn conference, or I guess that's a little repetitive Dre Conn Conference Con. Yeah. But I was hosting the Data Reliability Engineering conference and Roy, you were there, you were watching and, uh, soaking in all of the insights, as was I, while I was hosting and we were talking, we sparked up a conversation and I realized that you have a lot of.

Knowledge and wisdom that is worth sharing. And so I invited you on here and you said that's great, but it would be even better if I brought Santon with me because she is doing all kinds of incredible stuff with machine learning. So this is the tag team, if you will. Roy, you are more in the data engineering space.

Santon, you're on the machine learning engineering side. But I get the feeling you probably have a lot of overlap. And I'm excited to explore that overlap. I'm excited to explore what you all have been doing, and then I'm also very excited to talk about what the different solutions are and these buzzwords.

What is hype, what is not hype in your eyes these days? Because there is so much noise, whether it is just around data engineering concepts or if we. God forbid we start talking about AI and how much noise is in that field. So yeah, it's probably worth just doing a little bit of a round on who you all are, what you're doing, how you came to be, where you're at, and Royal, have you kick us off please?

Yeah, sure, sure. Um, so, uh, I'm, I'm currently the, the head of product at a company called Ver. We're a data analytics product that focused on making high quality data ingestion easy for, for companies and for users. Uh, I started in my data career working for AWS back in, uh, 2016. a lot of hands-on working directly with, with customers to build, you know, their data infrastructure, right?

The data platforms, all, you know, AWS services. But building data lakes, building data warehouses, all the infrastructure that kind of goes with it. And then throughout my years in aws, I kind of switched roles. I was doing some go-to market stuff for analytics for a while, and then I became a product manager.

So I was a product manager for AWS Glue, uh, and in AWS lake formation. So data processing and then data security. And then I switched over. I, I left and I went to, uh, to up solver. So I have a lot of kind of hands-on experience helping people write spark code, you know, debug hi whoa Hive QL code, you know, debug clusters and issues like that.

And just a lot of hands-on experience and, and building and dealing with these things. When I'm participating on, on events like DragCon and things like that, You hear a lot of high level talk about these things and to me it's always like, well, what's the next level down? Right? Like what, how do you actually implement and what does that mean?

So that's kinda what, what gets me excited. That is where I, I would love to go today in this conversation. Yeah. But before we do Cent, for those who are not already familiar with you, and if they aren't, I highly recommend giving you a follow on LinkedIn. Because you share so much insight on there.

Can you give us your story? Absolutely. Thank you so much, Demetrius. Kind as always. As you said, I. Traditionally been doing a little bit more ML than analytics, uh, or, or data engineering. So I started through particle physics, where, I mean, it's, it's really more end-to-end data science.

ML is used to extract the signals, but there's just so much more that goes into it. There's capturing the data, um, while it's being produced at really fast rate pipelining them. Feature engineering, heavy feature engineering. And then you go to the ml, and I think, I think that's true everywhere, right? You can't find an ML use case that doesn't involve just a lot of feature engineering.

and then, uh, from there, I worked as an, uh, NLP engineer, ML engineer, uh, for an NLP product. Um, so I owned the, uh, ML service that we had, so it was a lot of fun, a lot of like DevOps type work as well in addition to what you would normally, or, you might naively think of as ml. Then, uh, I did some, I did do some analytics work, at Astronomer, which was the last place I was at, uh, before up solver, uh, writing pipelines, in airflow, both for analytics use cases and ML use cases.

And then, um, I switched over here, to UPS solver, uh, with, with Roy. And yeah, we do, um, data. We're a data engineering tool. We help you ingest, uh, real time streaming or batch data into your warehouser lake. We also have a lake that does, uh, in place transformations very optimized, and it's just, it's a, it's a cool tool that I wanted to, help explore .

first question coming hot out the gate is as a product and machine learning profiles that we have on this call right now or this conversation, how do you all find is optimal to work together?

So I, I think there's, there's maybe two parts to this question.

I think the, the first part is how do I and Chandna work together to, to, to kind of evangelize that, that idea. And then the second part is how do we, the product teams, the data teams, and the ML teams work together.

So, I think from, from how do we, like myself and, and Chna work together, you know, CHNA is the one that spends a lot of time out in the community. I, I do as well. But she spends a lot of time with the community, data engineers, ML engineers, operators, business leaders, et cetera, et cetera, um, to kinda understand what some of the challenges are.

We all talk tech and we all can, can read slides, but when the rubber meets the road, right? There's a bunch of issues that come up that sometimes people just feel like they're just part of the problem and I don't really need to describe them because I'm just gonna deal with them.

Right. Um, like data quality be, be one of them. Where like, well, just the way things have been, like, I don't, like, why am I talking about this? But we want to hear those things, right? We want to get that information out. And then, you know, Sona does a really good job at kind of extracting those pain points, those paper cuts.

Uh, and then together, we sit and we try to understand, well, what are the best ways that we can think about solving these things? you know, is there technology on a market that's already addressing those, those things? Is there new technology that we should develop? How do we incorporate those things into our products so we can then help our customers?

Solve these paper cuts cuz the simpler we make life for our users, ultimately the better it's gonna be. The second part of how do companies work together, right? I think there's, there's a shift in companies mentality to become more product. Oriented, Initially we were data driven companies. Now we're be want to become more product oriented companies. It's really easy to say that. It's really easy to put that label, but in reality it's actually really hard to implement because having a product mindset is not just, you know, thinking, I wanna package this into a product.

Like you have to think about, you know, what do my users need? How am I gonna solve those problems? what technology exists, what, you know, what's the best way to solve these problems, not just to solve them, right? Um, so I think that's the piece that we're still trying to educate, not just us, right?

I think the community or the, the, the industry is really trying to learn better of how to apply product, best practices or product thinking to more parts of the business, ? The teams that are building the application, that's facing the users understand product better. They have, they do surveys, they do studies.

They understand, they get feedback. they measure everything and they study that and they figure out how do they make things better. The internal teams don't necessarily do that, especially on the data side. Data side has always been, We're, we're, we're basically the end, right? All the data flows downstream to the analytics.

Damn right. And then when you get there, somebody tries to do something with it. But there's very rare, rarely a conversation with the end consumer is to say, well, does this meet your needs? Uh, what can I do better? How can I improve this? That's what we always see, like finger pointing. The business saying, yeah, these guys aren't moving fast enough.

They need to do more for us. This thing's always broken, you know, how do we get them to move faster? You know what, they're not gonna move faster, so I'm just gonna do it on my own. Right? That's not the way we solve things, right? We gotta communicate, we gotta work together. Um, so we're definitely seeing that conversation starting to, to become more front and center, which is really awesome to see.

And we're obviously participating in that and trying to learn from that and see what can we apply on a technology side on, on a tool side to make people's life easier. Uh, and then of course there's people processes that you can't really, you know, build into a tool. , the straight most straightforward way I can think of, of answering that question is by expressing how much I love the fact that my role is cross-functional and just how cross-functional it is. So it, it's product and it's data and it's engineering and it's also user research, market research.

Like all of these things like have to work together. In order, uh, for me to do my job well and I beyond, it extends much beyond me. Like I feel like in this, in the tooling space, right? Like, um, both at my previous company and this company, like if you want to build for other data scientists and data engineers, a useful product that's also easy to use, like user-friendly.

You have to do the, you have to use, you know, we call it dog, dog fooding, right? You have to use the product, you have to use other products out there. Play around with it, right? And I mean, that's another part of it that I love, is I just get to play with tools. so like my, the, the number of tools that I know, you know, how, how to use is much larger than I would be if I was on a team where we were using one stack, So just, that, that exposure and, and talking to people and learning so much is, is what I love about this. This role. And do you feel like that is something that is common across different teams in, and especially when you're looking at the machine learning engineer position, because there are so many stakeholders, you do need to be talking to so many different people and looking at so many different tools because one stakeholder may optimize for one thing and another may be looking at something else.

I would definitely say so. and that's, I think the other angle of the data product discussion today is like, what about building data products or ML products? Right. As opposed to being a product manager or, or a data scientist for. A data tool and, um, yes, the, the degree of overlap there, um, in my experience is, is huge and like good.

Um, data scientists and good machine learning engineers will have a really thorough understanding of. Whatever they're building as a product. Right. And, and it's, it's, I, I've, I've found that it's a skill that you can develop with time. So certainly it's, it's a, it's a learnable skill. Um, but not, not only understanding, uh, it's, it's not just about talking to the different stakeholders and figuring out what everyone wants, right?

It's that you could have, um, scope creep and like things could get really large. Mm-hmm. Uh, but then distilling that down and, and figuring out, strategizing around what to build, what not to build, and how to, like, Addresses many of those stakeholder needs and use cases, um, by building one thing. So all of those, like thinking hard about solving a problem, I think is, is a common thread.

It doesn't matter if it's a mal work or data work or product work, it's really just problem solving. I was gonna just add to that. I think I really hit it on the head right. and we're, we're kind of throwing the word product around a lot. But many times in these, in these organizations, there really isn't a clear definition of what does that mean, there's a lot of things that go into building a product and building a sustainable, reliable product. And, those are the things that we really need to help companies unpack. Um, because remember, like within a startup, you know, you have ML teams, things like that. Like people are building a product, a product company they, they know.

But in, in enterprises, There's a product team that's out there on the side, and there's a lot of internal infrastructure teams that can say the word product, but don't necessarily know what that means and what that that entails. Mm-hmm. And it's really important to sort of distill that down into what does that actually mean?

Right? Like, how do you build a roadmap for your thing that you're building? What infrastructure are you gonna build to be able to serve that to your, your end consumers? What kind of guarantees and and reliability are you gonna give them? Right. Is there API spec? Right? Is this thing gonna be 99.9%, um, available and or not?

Mm-hmm. What does your support look like? Right? If something breaks, you know, what is that gonna look like? Right? So there's a bunch of things that go into making something a product that's beyond just, you know, have a roadmap and, and treat it as a product. Like what does that mean? Right? Yeah. So I, I think a lot of people are still trying to understand what that means.

We're sort of saying the words, but we're not actually doing the actions. Um, and I'm seeing like in our tooling, we're, we're trying to build more, um, automation around this. Like, I'll tell you, like in Olr for example, when we, uh, when a user uses it to, to deliver a dataset to, uh, an end consumer, right?

Whether it's for analytics or ml, we try to bundle a lot of the capabilities. That you'd need around data management. Um, consistently updating the data, making sure that stuff, you know, if, if your source system breaks or stops delivering data for some reason, like your pipelines don't break and all of a sudden you have to fix it, right?

Like we automate that, , recovery from those failures. And those are, those are infrastructure things that you have to build to have a reliable data product. now, if you're not using a tool, then you gotta do it all on your own. but I, I really see like our tooling in the, in the industry need to get better at simplifying this because the, the teams that are building products aren't infrastructure experts, right?

They don't have those tools. They don't have those capabilities, right? Like an application team will have a team that does infrastructure. They will have a team that handles, you know, the backend servers in the databases. Like they're designed, they're structured. To, to deliver a product. But an ML team is not designed to deliver a product.

They're designed to deliver a model. Right. Um, and they're responsible to go spin up Kubernetes or whatever with an API front end and deliver, you know, highly scalable, uh, inference endpoint that maybe they don't know how to do, right? They can technically put it out, but they don't know how to, you know, manage it as a product.

And I think that's, that's where , the disconnect is in a lot of the conversations. There's one thing I wanted to add slash um, offer a slightly, uh, maybe different view, than Roy, which is, um, it's so interesting to me when, uh, thinking about product for external end users, um, as opposed to thinking about data and ML products for internal usage.

Like it's, there are things that transfer and then there are things that don't. And so like, you know, I kind of wanna hear from Roy what you think about, you know, you just said, we kind of say product, but it, it's, it's not always the same product like concepts and ideas. but. I feel like there is something to be said for, you know, it's a different scale, right?

You, if you're often, uh, I mean, if you're building, um, an ML application, let's say that you want, you know, your total addressable market is, is very large. You know, everyone could find some potential use, use for your, for your tool. versus when you're building a. Prediction that's going to help, let's say your customer success team or something like that with their work, like that is an ML based data product that is internal and it's a much smaller user base.

It's a, it's a very narrow use case. So what are the things, that you think we absolutely should bring in the same sort of product concepts into that internal product versus what really doesn't, doesn't make sense to bring. scale is the one that, is always the first in the conversation.

And, and you're right in, in some cases, if it's externally facing, usually that scale is really high. It's internal, like, you know, like you just did the, the support team. That scale isn't the problem, right. But it's the reliability. It's the availability of that service if that service goes down, if your ML prediction service for your support team goes down and they can't use it, is it a deal breaker?

Is the world gonna end? Probably not. But if you're gonna call it a product, if you're gonna actually treat it that way, then treat it that way. if today your company has two support engineers, tomorrow you have 50 support engineers, well, if you built it to support two with some, you know, easy kind of python script that's running on your local machine, it's not gonna work for 50 users anymore, if tomorrow somebody says, oh wow, this is really awesome, I wanna embed this in dashboards. Right. So it's not an individual user logging in and, and, and clicking a thing. Now it's embedded in dashboards, and those dashboards are automatically refreshed every 10 minutes. Now your scale just went way up, right?

Without even thinking about it. so I think the scale is, is definitely one that can kind of like come and go. Like it, it really depends. But the other ones are reliability. Like what's the uptime of this thing? Like how can I depend on this? What's the quality of this? Like who's, who's in charge of that, that model or that data set that if it looks weird or it doesn't look correct, or maybe there's some missing values, like who am I gonna call?

Like who is gonna fix it? Um, if I have any, any additions to it. If I have roadmap changes that I wanna make to that, right? Like, I gotta have a process, I gotta have a way for somebody to come in and say, Hey, I, I want you to add this, or I want you to, to change that. The other thing that, that you often see is, Um, you know, feedback like, how do I learn from what's going on?

Like, I can throw a model over the fence and have people use it, but if I don't have any mechanisms to collect feedback, uh, I don't know if it's good or it's not. I only know if people use it or don't use it, and I says, stop using it. Do I just abandon it? Like, I don't know what happened. Not so feedback is, is huge, and it's something that I, I don't see a lot in internal tools.

it's usually based on emails and asking people, but it's not based on, so like collecting live feedback like we do in product, right? You know, how many, how many millions of events we are collecting from all over our product just to understand the usage. Uh, those are things that, you know, we talk about it, but they usually are a little bit harder to do.

And folks don't typically implement them, especially for internal, facing, um, sort of products.

how many times have we heard oh, this is the company tool and it's so painful to use. Yep. It really is. That's like the downstream effect of not having those mechanisms in place. And, and sometimes you have to use those tools.

There's no other tool, right? Yeah, exactly. To be dismay of the whole team. They're not stoked on it, but they have to, and so I totally feel you on that that's a perfect example. So like if, if that, if you hate that tool so much but you treat it as a product, then you'll know right away that you're using it aren't happy.

Right. Okay. If you don't treat it as a product, then like, you know, eventually you'll hear all the emails and you'll, you'll see, but you don't know why. Do Especially like we tend to get, um, attached to our tools sometimes too. I think especially if you're building like, you know, developer productivity tools, right?

Internally and they say, oh, this is the best way to do this. Let you know, let's get everyone of the company using it. But, you know, maybe it's a bad, bad UI and you haven't, you know, you don't have any way of telling that people are like actually going behind back doors to avoid using your tool. this is a great segue into another idea that I was thinking about as you were talking, which is around building for data engineers versus building for data scientists and how different those profiles seem to be.

I mean, you do have the occasional crossover data scientists who understands Kubernetes and understands data pipelines and understands everything that. You need to know when it comes to like DevOps and data engineering, and they are that unicorn that you wanna hold onto and never let them go. But for the most part, a lot of the data scientists that come into the fields have backgrounds in economics or physics, and they maybe are messing around with a bit of Python, but they're not necessarily so.

Advanced when it comes to the coding side of things, and I know there's a lot of people that are very vocal about this and trying to help data scientists become more used to using Git or more used to using Clean Code Fundamentals, but as we are talking about this product idea and building products, especially if it's like you're building an internal platform for your company.

You need to service not only data engineers, machine learning engineers, but also data scientists. What are some things that automatically come to mind I am going to be a little bit controversial. Uh oh, I like that. So I've had the maybe, uh, I've been especially fortunate in this, but I've had the fortune of working with, folks across various organizations being, um, more technical. Um, and I have actually found in, in a couple of cases that, um, we were not.

Treating our user base with a, like, we, we underestimated what, um, they were, they would've wanted. So, I mean, it's, it's all about, you know, figuring out what your users actually want. Right? That's the bottom line. So you can, I guess my point really is that you can go both ways. You can end up building something that's, you know, too complex, um, for, for someone to use for their workflows, and then it's useless.

But also if you build something that is, um, you know, doesn't, cuz the trade off is usually it's either complex or it doesn't do everything right. That, that's usually where the trade off happens. So if you sort of try to make everything like too streamlined or more point and click, then you end up in the situations where it, it's not a flexible tool.

personally too, I would rather like I. Learn, um, how, uh, you know, any, any given tool or, or library or whatever works and like get it to do what I wanted it to do, then, uh, be, be restricted. So I mean that's, that's one of my biggest things with like BI tools for instance, and visual data visualization tools in particular.

I would much rather generate a, a plot and plotly or something like that. Then, you know, make a pre-made, um, uh, sanky chart or something like that. So that's, Be That's because that's how I see it. I like, I do like to have the power user in mind when thinking about, um, about products, uh, or who we're billing for.

Uh, but you know, I want to temper that, , by also saying that, you know, when you're billing product, they say like, it's 80 20, right? Like, make sure that you, Build something that most people are gonna get value from. And then you can think about like, how do I, um, bring in, uh, more people into the discourse?

But yeah, so that's, that's my semi controversial answer. I, I'll, I'll just, I'll just add to that. I, I think, you know, I, I agree with Ashan. I was saying the, the way I, the way I see it or the way I sort of explain it is, Your product, the product that you're building needs to target a specific set of audience, right?

Specific target user, right? You're never gonna build a product that targets everybody, right? Because you're gonna make everybody unhappy. So you need to sort of focus on who is your target audience, what do you need to give them to make their experience the best and allow them to work? The fastest and the most convenient way that they know how.

So, for example, I may build a product that's targeted for data engineers, right? Because I want a data engineer to, to have an easy way to move data from a source system, define data quality, uh, expectations on that data, and then deliver it to your target system like Snowflake or Redshift or Delic or whatever, right?

So I wanna make that experience as easy as possible. , so I will build my product to focus on that now. I don't want to alienate the other users, like I'm a machine learning engineer may say, Hey, I wanna read data from Postgres, this particular table, and I wanna store it in a, in a data lake in CSV files, because I would then wanna, you know, build my TensorFlow model on top of that.

Great. Now the, do I need to make my user experience or my, um, my product fit them? You know, they're not quite my target audience, so I, I, you know, I can try to do that. But what I, the way I look at it is, Focus on the target audience for your user experience, and then allow your product to be extensible.

Right? Have those APIs, have the S D K, have the cli, have the hooks that users can come in and say, Hey, I'm, I'm, I'm an ML engineer. Actually, I feel very comfortable inside of a Jupyter Notebook. Um, I don't wanna use the, you know, your, your ui. I just wanna plug into it, right? I want to use the engine, but I want to use my own, my own interface.

And I think that's the most powerful way to do things, right? We've done, that's the way we're sort of thinking about think, uh, design in AWS too, is like, you have the core design for your target audience, but the APIs are there, the integrations. So if you wanna plug in something else that's easier for you, go do it.

Because at the end of the day where you're gonna make people most efficient, By letting them pick the tools that they know already and just run with it. Like having the, forcing them into a new user interface or a new tool and having to teach them how to do it. Okay. Yeah, it may work, but it's just gonna take a long time, right?

People aren't gonna always be happy. I already have all my libraries and all my tools in in this other system. How do I get them over here? Allow me to do it. Allow me to bring my tool and plug into your engine. And that's really the big benefit of a, of a data platform, right? A platform is not about solving everybody's problems, about providing an interface to the core engine that you can plug in different systems and make it easy for everybody to be effective.

Love that. I think with good products, it's almost like little Easter eggs that you can find if you're like a power user, right? Um, you know, like one, one easy thing that comes to mind is like you, you might have a pre-configured default, um, you know, Kubernetes pod that is gonna launch for any user, but then you, you let people configure their own.

You know, what they, what they want in their pod, as well as, as a, as a secondary thing. Like, so the person who knows how to do that great. Uh, but also the person who doesn't know is going to get something that should be fairly optimized for, for simple use cases. So, changing gears a little bit, Roy, you mentioned before we hit record about how you wanted to discuss what value we are solving for when it comes to these.

Users and we're creating solutions, but what are we creating these solutions for? Can we dive into that a little bit and maybe you can kick off that discussion for us? Yeah, I mean, it's a, it's, it's, it's a broad statement, so let me maybe try to focus it a little bit. Um, I'll tell you in particular, um, kind of like what, what we're trying to do at Up Solver.

Um, we originally build a data pipeline solution, right? So users could write sql. Builds a data pipeline for them to move data from point A to point B, they can do transformations in the middle, et cetera, et cetera. Um, sort of the, a lot of the, the focus in the industry has sort of moved on, not moved on, but sort of shifted towards, um, identifying and improving the quality of data.

So like we help users build the, the, the pipeline, right? The, the, the infrastructure to, to deliver data. But we didn't really pay attention to the data inside of it. Like that was sort of up to you. Right. Um, and as we start having more and more of those conversations with, with customers and, and potential customers, we've heard the, we've heard the challenge over and over again.

Like, yeah, okay, you can move data into the warehouse. Yes, you can get it in there, but it's kind of garbage in, garbage out. Like, how do we then address it now we can run tests and we can do all these things. But they're a bit reactive. They're after the fact. Um, if I run the test, the data's already in a warehouse.

People already depending on it, the model is already trying to run on it and it's already breaking. So what do I do now? The test failed. Okay, fantastic. Now I gotta go find where the problem is and fix it. But you already loaded the data in, right? So we sort of took a step back and we said, okay, you know, we can just be a dumb pipe and load stuff in, or we can.

Double click on these problems and find solutions for our users that will actually help them solve the real problem, like the how do we change the outcome, right? Just loading data in a pipeline does not change the outcome. Loading reliable data or loading high quality data in a reliable way changes the outcome because now you can start building trust right on that data.

Where before, It wasn't reliable. Like if something showed up, great. If some rows were missing, I don't really know that. Right? Until, you know, I write my report on my dashboard, all of a sudden my bar graph is a little small than it used to be. I'm like, what happened? Right? Um, I can't trust it. Right? And we see this in, in, in our users where they're like, you know what?

I don't really trust this. I'm just gonna unload the data from the warehouse into Excel, and I'm gonna do it on my own, right? Because I can trust myself. So, That's sort of what I'm talking about where we say there's a, there's a data problem, right? There's data quality problem, and then quickly unpacking what that actually means and finding those paper cuts that, um, allow you to actually solve the problem.

So like one of the biggest paper cuts that we hear, two of them are, how do I know when my schema changes? And the second one is, how do I know how fresh my data is? Sounds fairly simple, right? But these, these problems always come up in every conversation that we're having. The first thing is how do I know when somebody adds a column or drops a column or change the column?

And the second one is, how do I know when was the last time my table was updated? Like, this is super, super simple, right? But it seems to be a problem. So if these are really painful paper cuts, like why don't we go solve them? So, That's, those, those are some of the problems that we then, you know, Shanta, I and myself and engineering team kind of dove into and say, okay, let, let's take the big problem of data quality, unpack it, and then find those little things, those paper cuts that we can go solve.

So good. So yeah, just thinking about the bigger picture and thinking about the user experience from a holistic point of view as opposed to just, this is my space that I've carved out and I'm going to just focus on this space. Right, right. It's thinking about what is the end result and in, uh, I think in marketing, they have this term jobs to be done.

Have you heard of that one before? Yeah. Yeah. Yep. It's kinda like that, Hey, I just want the, I just want the end result. I want the picture. I, I've always heard it said as I want a picture on my wall. I don't want a hammer and a nail and a handyman to put that picture on the wall. Right, right. I just want the picture on the wall.

Let's get to that. How can I get to that the most efficiently, whether it's with a drill or a hammer or a handyman coming and helping me. That's right. That's right. And, and then so sorry to just double back on what you just said, like I think that's, that's a perfect example. Of what we see on LinkedIn these days, and, and I love LinkedIn.

I love all the, the people who post stuff there, but that's sort of what you see, You see these architecture diagrams, you see these, these, these long posts about, you know, the handyman and the hammers and the nails and the wall and the, you know, the tape measure, all that kind of stuff. I'm like, yeah, technically it's fantastic, but.

I, I, I just wanna, I just wanna know when was the last time my table was updated? Like just, just, yeah. Make that easy for me. Right. Gimme that solution and please don't make me buy a whole new system just to tell me that. Be a bit of information. And that, that's another thing that kind of judge me a little crazy in our industry these days is like, if there's a problem, there's a product for it.

Like, I don't need 15 different products to solve these little paper cuts. Right. Um, How, how do I, how do I consolidate the whole idea of us and I, I, I wrote something like that on, on LinkedIn I think recently. You know, as a vendor, my job is to help solve problems and make users' life easier, but it's also my job to help you simplify your, your end-to-end architecture, right?

The more products I'm adding, like it's good for me. But it's not good for you, right? Because now you have 15 different products. You gotta manage 15 different vendors you gotta deal with and pay money too. Support and blah blah blah, and all those things. Like just make it simpler. If you can simplify it, let's simplify it.

Yeah. The tooling sprawl is real. And it's not only real in data engineering, it's the same with machine learning engineering. it's everywhere and, and a lot of times I think with machine learning engineering and just ML ops in general, What you get is there's different tools that have a certain value prop, and that's what they message as their main value prop.

But they also do five other things, right? And then there's another tool that has a main value prop, but it does those same five other things. And so if you want those two main value props, you're gonna have to decide what are you gonna do? Those five other things, which tool are you gonna use for those five other things, right?

They may not be good at those other things. They may just say they do it or they have, you know, basic capability of it, but they not, may not be good for that. They may not be the right tool for that. So you gotta get another tool for those five, five other tools for those other things. So that's right.

That's definitely is a bit of a pain. And also navigating this space to know which tools do what and Yeah, I know, at least for me, it was even worse I think a year ago. And now I think it probably moved more when it comes to the large language model sphere. It is very much like this now where you are trying to navigate through these websites and you can't really tell what the different tools do because it all sounds the same, right?

And you just sprinkle a little bit of marketing jargon on it. And then talk about I, my favorite thing used to be to. Find websites that would quote the 80% of machine learning models never make it into production. And it's like whoever made up that statistic has got to be so happy because they're the most quoted by every tool out there.

And so these days I don't think I see it as much, uh, because all the machine learning models are large and they're all in production now. It's just, uh, how reliable they are. That is another story. I think, I think it's also really important to, you know, for all of us to, to clearly distinguish and differentiate between this is fun and I want to go play with it, and I want to check it out to, this is the day job and this is to be reliable, this needs to be accurate, this needs to be, you know, stable, playing with a lot of cool stuff and then quickly trying to bring it to your day job and say like, let's just throw it in here and add it like, No, like this is a business. You're running a business like you need to be reli. You can't just bring toys into your, into your day job, right? In your free time.

Go for it. Go nuts, right? Have fun, learn all this stuff, but draw a clear distinction, clear line between when do you shift stuff over. And I think that's where some of the maturity around these models and, and these tools really need to come in and like I can play with 'em, but until these things are mature, these things have.

A lot of best practices built around them. There's some, some good case studies of companies using them at scale and going through all the pains and tribulations of using those tools. Don't bring 'em into your, into your, um, your day job unless you have a big team that you're comfortable testing this and you're comfortable doing this.

Uh, but most companies don't have that. Like most companies don't have an army of data engineers who can go play with all these new tools. And if. Flink breaks for some reason, they can go write some code contributed back to the open source and fix the problem. Not many people have that. Right? Yeah. So you got, you gotta be careful when you bring, you know, these tools into your, into your day job.

Yeah. You gotta look at it more like a skunk warps program. Yeah. And recognizing this is r and d, this is not core to our business. So I wanted to ask a question here, because we mentioned also before hitting on, or before we started recording about things that you would've loved to have learned five years ago and you five years ago.

Talking to yourself, what are some things that you could tell that past self that would help accelerate their growth and their knowledge in this field? I think I would've, so five years ago I was finishing up my PhD, um, or getting close to finishing up my PhD. And I, um, I think at the time, hadn't fully committed to the idea of, um, doing data science outside of academia.

Um, I was considering different options. Um, so, you know, so it's gonna be a very different answer from, from Roy's answer, but, um, Yeah, I would've liked to know, um, more about what options were out there with a analytical problem solving mindset. I, I mean, I think I fortunately ended up, uh, in an area that's a very good fit for me.

Um, and I really enjoy doing the work. But, um, yeah, that, that's what would've been relevant to me. So I, I feel like that's, that's who I'm speaking to right now. Right. Um, grad students who are, um, about to figure out what's next. So, That's a part that that's relevant. Um, and then, uh, also, I mean, I, I learned this, uh, with time and kind of proved it to myself, but it also would've been nice if someone said, Hey, A lot of the work that you've done is super, uh, transferable to what you're gonna be doing five years from now.

You know, you're like statistics courses that you've had to take and, you know, figuring out confidence intervals and stuff like that. It's, it's all, um, it's all useful, impactful, and, and people are gonna, you know, respect your data analysis because you have that background. How about you, Roy? Five years ago.

What would you have liked to have known, or what are some things that you feel like would've helped you accelerate your learnings? Or even three years ago, two years ago, it doesn't necessarily need to be, cuz five years ago, I think maybe it's too far back.

Well, yeah. I think what would've helped me, I. Three or so years ago is sort of understanding, understanding better in how to navigate, um, sort of the strong opinions that people have about solutions, technology approaches, um, things like that. Um, you know, I feel like in, in, in the technology space, especially in data engineering and ml, like there's, there's, it's almost like religion, like different people have different religions within this, this industry, like the e ETL versus e l t, like, it's such a big religion.

Like you either one or the other. Um, you know, data modeling, there's people who live and die by data modeling. There's others say that this is, this is pointless, right? Um, I'm not saying one is better than the other. All I'm saying is that I think. For me, it would've been kind of better to, to be prepared that there is this, you know, this, this, this preference and this sort of like religion between them.

Uh, cause in my mind it's technology, right? I mean, it's multiple ways to solve the same problem. Um, you know, choose which one. If, if you want to do something a certain way, that's fine. That's your choice, right? Uh, but people are very, Sort of like, of a mindset, like, this is the only way to solve the problem.

This is the best win is a million ring. And you can argue all those points till you're blue in the face, right? I mean, you're, you're always gonna make the same argument. This one is scalable. No, this one's scalable, this one's performing. No, this one is performing right? Um, so just kind of knowing how to navigate that a little bit smoother, um, you know, it doesn't trigger a lot of these, um, emotional emotions from different people, um, that I think that would.

That makes things a little bit easier. Uh, it's very easy to spark those conversations and to get like these excentric views and, and get people wild up and, and crazy about that stuff. I don't know how productive that is, to be honest. Um, so maybe if I had to summarize, it's kind of knowing how to navigate personal opinions and preferences around technologies and, and ways of doing things, um, kind of would've helped me a little bit better to kind of connect with, with more people and.

Um, maybe to be Moree effective in, in the way I communicate. Well, thank you both for coming on here. I think we're gonna end it there. That was a perfect finishing point and. Both of you. For anyone that wants to follow you all, we're gonna leave all of the links in the description anyone who wants to continue the conversation can go and check that out and reach out to you all.

Awesome. Thank you.

+ Read More

Watch More

MLOps - Design Thinking to Build ML Infra for ML and LLM Use Cases

Posted Mar 29, 2024 | Views 2.5K

# MLOps

# ML Infra

# LLM Use Cases

# Klaviyo

# IBM

DevOps, Security, and Observability in ML

Posted Jul 21, 2022 | Views 893

# DevOps

# Security

# Observability

# tryhelix.ai

Data Engineering for ML

Posted Aug 18, 2022 | Views 1.5K

# Data Modeling

# Data Warehouses

# Semantic Data Model

# Convoy

# Convoy.com