Sign in or Join the community to continue

Building Safer AI: Balancing Data Privacy with Innovation

Posted Aug 08, 2024 | Views 159

# AI

# Data Privacy

# Innovation

# DataGrail

Share

speaker

Stephanie Kirmer

Senior Machine Learning Engineer @ DataGrail

Stephanie Kirmer is a staff machine learning engineer at DataGrail, a company committed to helping businesses protect customer data and minimize risk. She has almost a decade of experience building machine learning solutions in industry, and before going into data science she was an adjunct professor of sociology and higher education administrator at DePaul University. She brings a unique mix of social science perspective and deep technical and business experience to writing and speaking accessibly about today's challenges around AI and machine learning. Learn more at www.stephaniekirmer.com.

+ Read More

SUMMARY

The balance between AI innovation and data security and privacy is a major challenge for ML practitioners today. In this talk, I’ll discuss policy and ethical considerations that matter for those of us building ML and AI solutions, in particular around data security, and describe ways to make sure your work doesn’t create unnecessary risks for your organization. It is possible to create incredible advances in AI without risking breaches of sensitive data or damaging customer confidence, by using planning and thoughtful development strategies.

+ Read More

TRANSCRIPT

Stephanie Kirmer [00:00:09]: So thank you all for joining me today. I'm really excited to be here to talk about data privacy. My name is Stephanie and I am a machine learning engineer at Datagrail. We're a data privacy software company. We help businesses handle their data privacy needs, from their consent form on their website, to their request management, to mapping their internal data to make sure that they know where their high-risk or personal data is located. But that's not what I'm talking about today. I'm going to talk a little bit today about how AI development and AI engineering puts you at risk of data privacy, mishaps, violations, problems, and how you can avoid that by taking some pretty easy steps, actually in your development process. I've been in the data science machine learning space for almost ten years, and as I said, I'm a machine learning engineer.

Stephanie Kirmer [00:01:00]: I build models at Datagrail for our products, and I have a long history of experience building AI and ML based features for products. So I have some idea of what I'm talking about here, and I have made some of the mistakes that I'm going to try and help you not make in the future. Okay, so first thing I need to say before we go anywhere is I'm not a lawyer, this is not legal advice. None of the things I say in here, they may be my opinion, they may be my advice. They are not legal advice. So please take that into consideration whenever I say things. I might repeat this from time to time, but before we go too deep, I want to get into and make sure that we are talking about the same things and we mean the same things by the words that we're using. First thing, data privacy, data security, two different things.

Stephanie Kirmer [00:01:42]: And this is important for us to recognize because we need to know what it is we're trying to protect and what we're trying to take care to, to avoid risk. Data privacy is how we use the data. It's how we manage the data, what we do with the data, and it's being responsible, ethical, and obviously compliant with the law. So this includes giving people opportunities to consent to how we're going to use their data, things like that, and we're going to talk about all that stuff a whole lot more. Data security is the sort of infrastructure, it's the technical reg, you know, it's sort of the physical protections that make sure that your data is protected from breaches. They prevent hackers from getting into your data. We're not going to talk about this nearly as much today. We're going to sort of stay on the data privacy side of things, but they will both come up from time to time.

Stephanie Kirmer [00:02:28]: Both of these, though, are extremely important. And I don't want to de emphasize data security in any way, but we're thinking about data privacy. We're thinking about how you get the data, what you do with it, how you use it, what kind of interpersonal and legal protections. You make sure that existential. We also need to know what we mean when we talk about personal data. Are we talking about Pii, which is, you know, your Social Security number, for example? Are we talking about more expansive things? We may especially, I think, in the ML space, we think about PII. We think, well, you know, if you're working in healthcare or something like that, you have scary data and you can't do anything with it, but everybody else is just free to go wrong. That's not the case because we actually, especially in most of the legal frameworks that I'm going to talk about, really need to be thinking about personal data, which is the data that you could potentially combine together to then identify an individual, or it's the data that you could use if it's attached to an individual to find cause for discrimination.

Stephanie Kirmer [00:03:24]: Protected classes, sensitive stuff, the stuff you don't want about you put on the newspaper or shared around to all of your friends and family, necessarily, stuff like that. So you can see on the slide, we've got a whole bunch of examples, I'm not going to go through them all, but kinds of stuff that technically are counted under the law around data privacy as personal information and stuff you do need to be careful and protect that are not your Social Security number or your home address, or your credit card number or things like that. There's more to it than that. And so, as you can see, geolocation data, super important, exactly where you are at any given time. Very, you know, one of the examples of many things that you need to be worried about protecting if you're using any of this in your AI or ML development process. There's another thing we need to talk about, which is data localization. And I can't give this nearly as much time as it deserves because this is a very complicated concept. But the idea is that certain laws about data privacy regulate where the data can be stored.

Stephanie Kirmer [00:04:21]: And it's not talking about in the cloud we think of our data just lives in the cloud. It's not actually real anywhere it is, though. There's a data center somewhere, there's a server somewhere. It has electricity and cooling systems. It is some place where your data is actually stored on a piece of hardware, and that is what the data localization means. So, for example, if you get data from a customer who is, say, a russian citizen, for part of the data process, that data needs to be stored on a server in Russia. Then it can be moved elsewhere. There are whole rules and systems about how that can work, but this is just one example.

Stephanie Kirmer [00:04:54]: In many, many countries, especially in AIPAC, have rules around the data localization you need to be aware of. So this is a thing you need to consider. You can't just dump your files in s three using east one and be happy. You might need to put that data somewhere particular, depending on the data and depending on who the data is about specifically. So now let's talk about laws. Isn't this gonna be fun? Everybody loves to listen to talking about laws. It's actually very interesting, I promise. There are things that you can generally say that you need to be able to do in order to be in compliance with a comprehensive data privacy law.

Stephanie Kirmer [00:05:30]: Now, that's a concept that can be complex, but usually comprehensive means it has the components that I'm gonna talk about here. At least. There's variations, there's nuances. Once again, not legal advice, but this is the stuff you need to at least be aware of and be sure that you've considered, okay, informed consent to data usage. You need to be the person who's giving you their data, needs to get information about what you're going to do with it, how you're going to use it, and they need to be able to say yes or no. Second, if they say no, you can't discriminate against them on behalf of that. So you can't say, hey, you said no to our, to our cookie policy. You can't use our website anymore.

Stephanie Kirmer [00:06:06]: It's not allowed, at least under comprehensive data privacy legislation, which is not in every jurisdiction. It also defines covered data, as we talked about, as more than just PII, more than just the bare minimum. But comprehensive means covering all kinds of data about individuals. You need to have public, transparent communication about what you do with data, even communication not just to the person giving you the data, but just available to the general public to know what you're doing, where you're storing it, how you're using it. You need to have a method for people who did say yes to say no later. So they need to be able to say, hey, I told you you could use my data, but not anymore. And if you're thinking about training ML models, you're suddenly having a cold sweat. I understand.

Stephanie Kirmer [00:06:47]: We're going to talk about that in a minute. What it means when you revoke consent, but you also finally need to allow individuals to access, correct and delete store data about themselves. This is why our product through request manager is so, so very valuable, because you need to be able to manage all that, figure out, oh, we have this person's data somewhere, and how are we using it? What are we doing with it? Can we correct it? Can we delete it? Losses, you have to be able to do that. There's specific things about ML and AI as well. In some of these legislations, there's minimization, which is the general idea that you must use the minimum amount of data. You absolutely need to train your model to do whatever project you're doing, and you had to be able to prove more or less that you took consideration of exactly what the minimum was. How did you decide what the minimum was? And you took that into account. You need to prevent your model from being used for illegal or discriminatory or both kind of uses.

Stephanie Kirmer [00:07:40]: You can't use ML models to discriminate on protected characteristics, especially in the EU under the EU AI act, but this is something that you might find in other jurisdictions as well. All of this stuff is not necessarily part of a comprehensive data privacy legislation, but these are things that are coming along more often and are being found in more jurisdictions going forward because people are more aware of AI and they're more aware of wanting to protect their data than they ever have been. And finally, some jurisdictions will require impact assessments for large, large models that are going to be applied across large populations. Basically saying this algorithm, you want to put it in your product and you want to deploy it, you need to take a look at it and do sort of an evaluation to determine what risks there may be to people. In the worst case scenario, if this model really gets out of hand or does something unexpected, you need to, it's kind of like you're doing human subjects research. If anyone's been involved in that kind of thing, you have to think about the worst case scenario, write it up and make sure that you've submitted that to the regulatory authority if that's the sort of thing that applies to your particular business. So now let's talk about the jurisdictions and the regions. I have a couple of slides in the appendix to this deck, which I think we'll be able to distribute later on that give specific country by country evaluations of some of this stuff.

Stephanie Kirmer [00:08:54]: We don't have time to go over all of that today, unfortunately. So I'll just say in the United states, we have very fragmented policies. Many states have different data privacy laws. Some of them are comprehensive, some of them are not. But keep in mind that when products are used by EU citizens, even if you're a company based in the United States, the EU citizen is whose jurisdiction matters. And so you need to follow the law that applies to the people whose data you're using, not necessarily just the law that applies to where you happen to be standing at a given time. In the EU, they have very strict policies. We have the EU AI act, we have GDPR.

Stephanie Kirmer [00:09:27]: Those are serious. They do not mess around. And they have real implications for your business. If you violate them, are found to have violated these laws, they will fine you and they will be very, very serious about making sure to enforce these rules. But it can be a little bit easier, honestly, because these regulations are pretty consistent. You can rely on these being across the entire EU. Some EU individual member states can have additional or different laws, but they can't necessarily be contradictory to the overarching laws elsewhere. There are actually strong nationwide data privacy laws really popping up a lot of places.

Stephanie Kirmer [00:10:03]: Latin America, AIPAC, you're seeing India, Thailand, Vietnam, China. You know, obviously Brazil's coming on with these kinds of things. And so it's important for you if you're doing business with, you know, customers or collecting data on people who are residing in these places, you need to be aware of all the laws that it could be applicable to all of these individuals and the data that comes from them. So talk to your general counsel's office. They should have awareness of this kind of thing. Just for a quick review of the us state level, just because we're here. And it's interesting, this green in the, on the map, what you're seeing there are the states that have a comprehensive data privacy law. And now this is from the International association of Privacy Professionals.

Stephanie Kirmer [00:10:46]: I highly recommend checking out their website if you have any interest in this kind of thing, because these are the states where it's a comprehensive data privacy law. That doesn't necessarily mean there's no data privacy laws in these other places. It's just that they're nothing comprehensive. It means they don't have the full complement of regulations and protections for individuals that the organization finds, you know, meet the standard. But there still could be other stuff, which means it's really, really hard to follow all of the different little levels of regulation and rules. And so this could be very annoying. So this is when we get into okay, I've given you the scary stuff, right? This is all the different varieties of laws and all the ways you could get in very big trouble. These are my tips for how to cope with all of this and how to actually get your AI development, your machine learning, done without causing yourself terrible headaches.

Stephanie Kirmer [00:11:36]: Okay, before you start modeling, these are the things that I recommend that you should do. First one is know your responsibilities, know whose data is going to be involved, know what it is you plan on building, know what jurisdictions may apply so that you can tell which experts or professionals in this space you might need to talk to, and consult your legal department if you have any questions. And your legal department is who should be responsible for really knowing what your obligations are and how to protect your business. That's their job. You are a machine learning engineer or a data scientist. It is not necessarily your job to also know all of these laws, but you should be the person who goes and asks the right questions. Then we need to collect data. You might already have your data collected, you may already have data stored, and that's going to be.

Stephanie Kirmer [00:12:24]: May create challenges. But if you get the opportunity to start before the data has been collected, get involved in consent process. Figure out what your company is writing in that fine print. When someone says yes to the website or yes to the terms of service, find out what they're writing in there so that you can make sure that you got the permissions you need in order to do the kind of machine learning or AI development that you want to do. And that may be a conversation with your legal team as well. But this is the stage where you need to understand what consent you're actually getting, not later on when you've already got the data and you're like, what did these people agree to? I don't know. Don't get yourself into that position. It's terrible.

Stephanie Kirmer [00:13:01]: Then collect and use the minimum amount of data, not less, because then you'll have a problem where you won't be able to build a model that's worth a damn. And that's a problem, but not more either. Don't collect extra data just on the off chance that maybe this might be handy someday, because you're just collecting risk. That's all you're really doing. You're collecting risks to your business in a situation that you don't wanna be in. Set up processes to monitor that consent as well. Cause when the consent changes, when someone decides, hey, I said yes, but now I'm gonna say no, you need to. Legally, you are obligated to enable them to do something with that, and then you are obligated to act on it.

Stephanie Kirmer [00:13:36]: And then just make sure you're meeting the data localization requirements. Once again, talk to that legal department and figure out exactly where your data needs to be stored. And when in doubt, don't use personal data in the model that you're building. And you might be saying, but how the hell do I build a model with no data? I will talk about that in a moment as well. So let's go back to the question of revoking consent, though, because this is a big part of comprehensive data privacy legislation, is that someone should be able to say, I told you you could use my data, but now I changed my mind. So what do you do if you're building a model, or you've already built a model and you've got data in that miles data in that training set, and I customer was in that training set, and you're like, well, they've said no, now what do I do? Can I use the model general consensus? Once again, not legal advice, but the general professional like consensus amongst the experts is you can still use the model that you built for inference purposes, even if that consent has been revoked later. But you cannot train any more models on that customer's data going forward. So it's important to keep the dates exactly tracked down, figure out when the consent was revoked, because after that point, you cannot train anymore.

Stephanie Kirmer [00:14:40]: And you need to be able to erase that customer's data, personally identifiable or personal data, from your data set. Okay? Second, I want to really emphasize that alternatives to individual personal data are valuable still. You can use synthetic data. You can create a data set synthetically that has the same general characteristics as the personal data, and then wipe the personal data. And you don't have to use that anymore. You can aggregate this data as soon as this data is no longer able to be pulled apart to identify individuals, and it's no longer able to be associated with individual people. Even when it's pulled together, it's not personal data anymore. Then none of these laws apply, and then you can go on your merry way.

Stephanie Kirmer [00:15:18]: What exactly counts as personal data? Of course, you should talk to your legal team and make sure that you have exactly the right advice about what counts as personal data and what does not. But aggregating it can really help. Using open source data is always a great option. We have open source data for all kinds of use, cases all around. It may be relevant for your case, it may be not, but keep it in mind. Deidentification hashing the data is great. Getting all the data, pieces of the component data that could be used as personal data. If you don't need them, don't save them.

Stephanie Kirmer [00:15:49]: Delete those things from your data set and you will not have to worry about them anymore. And that's a real like, it just will reduce your risk tremendously. So make sure that when you de identify that data that it's nothing reversible in any way. Make sure that it could never be returned to a state that could be personal data, and then you can sleep well at night and everything will be great. Now, sometimes you might need to use individual data. There may be reasons why this is absolutely necessary for your use case, and that is a risk reward trade off situation. And you need to, you know, understand the risk that you are taking when you do that, and that may be okay. And once again, this is your conversations with your colleagues, conversations about your business.

Stephanie Kirmer [00:16:30]: So now let's go into, during, and after modeling. These are the bits of advice that I recommend for this particular part. Know where your data is going. If you transmit data outside your network to tune a third party based model, know what the data the provider's going to be doing with it. Consider your applications. Remember, no discrimination and no illegal activity using your model. Certain jurisdictions may have different opinions about what's illegal in these cases. And then finally test your model.

Stephanie Kirmer [00:16:56]: Red team it build guide, sort of guard rails. So make sure that the data inference process will not accidentally reveal any scary personal data about individuals. Now, my closing thoughts. Think about it like it's your data. Think about what you would want to be used and how you would want consent to be handled when it's your own data or that of your family or friends. Regulation is for rights protection. This is for protecting us as individuals and protecting our privacy and our security. And in the face of very large, very powerful businesses, I think that this trade off for regulation is our sort of offset to the power of businesses.

Stephanie Kirmer [00:17:37]: Think about that risk reward situation. Like I said, you might need to use personal data, it might have to happen. But this needs to be a valuable model that's really going to make a difference, and it's going to be super, super high reward for you to want to take that risk and then plan ahead. Do this stuff before you collect your data. Do this stuff before your model is already out the door. Then you will not have to be disrupted halfway through. Turn around and figure out how to do it all over. So with that, thank you very much for your time.

Stephanie Kirmer [00:18:07]: And we have time for one small question. Anybody? Yeah, what's up? I know you guys. I think you guys don't do HIPAA compliance, but are you familiar with it? Yeah, I worked in healthcare for a long time. So HIPAA compliance is specifically around making sure that data that is housed by healthcare providers is not unexpectedly revealed to the public. It's like a data security sort of problem more than anything, because you're trying to make sure that unauthorized users do not have access to personal health care data that is collected by a healthcare provider. So, like, your doctor can't go telling his buddies about that weird growth or something like that. Right. But that doesn't necessarily fall under these coverings of data privacy laws because these are meant to be a broader thing where this is about how the data can be used by other kinds of organizations.

Stephanie Kirmer [00:19:08]: HIPAA compliance is. Is a whole field of law. There are lawyers who just make that their whole business, and I don't, unfortunately, and like data girl doesn't. Doesn't offer tailored solutions for that at this point, but it's adjacent, but a different. Yeah.

+ Read More

Sign in or Join the community

Watch More

Data Privacy and Security

Posted Apr 27, 2023 | Views 1.2K

# LLM

# LLM in Production

# Data Privacy

# Security

# Rungalileo.io

# Snorkel.ai

# Wandb.ai

# Tecton.ai

# Petuum.com

# mckinsey.com/quantumblack

# Wallaroo.ai

# Union.ai

# Redis.com

# Alphasignal.ai

# Bigbraindaily.com

# Turningpost.com

Building Data Infrastructure at Scale for AI/ML with Open Data Lakehouses // Vinoth Chandar // DE4AI

Posted Sep 17, 2024 | Views 1.3K

Building Better Data Teams

Posted Aug 04, 2022 | Views 1.6K

# Data Teams

# Data Tooling

# RN Production

# Financial Times

# Ft.com