Robustness, Detectability, and Data Privacy in AI

Vinu Sankar Sadasivan is a final-year Computer Science PhD candidate at The University of Maryland, College Park, advised by Prof. Soheil Feizi. His research focuses on Security and Privacy in AI, with a particular emphasis on AI robustness, detectability, and user privacy. Currently, Vinu is a full-time Student Researcher at Google DeepMind, working on jailbreaking multimodal AI models. Previously, Vinu was a Research Scientist intern at Meta FAIR in Paris, where he worked on AI watermarking.
Vinu is a recipient of the 2023 Kulkarni Fellowship and has earned several distinctions, including the prestigious Director’s Silver Medal. He completed a Bachelor’s degree in Computer Science & Engineering at IIT Gandhinagar in 2020. Prior to their PhD, Vinu gained research experience as a Junior Research Fellow in the Data Science Lab at IIT Gandhinagar and through internships at Caltech, Microsoft Research India, and IISc.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Recent rapid advancements in Artificial Intelligence (AI) have made it widely applicable across various domains, from autonomous systems to multimodal content generation. However, these models remain susceptible to significant security and safety vulnerabilities. Such weaknesses can enable attackers to jailbreak systems, allowing them to perform harmful tasks or leak sensitive information. As AI becomes increasingly integrated into critical applications like autonomous robotics and healthcare, the importance of ensuring AI safety is growing. Understanding the vulnerabilities in today’s AI systems is crucial to addressing these concerns.
Vinu Sankar Sadasivan [00:00:00]: Hi, my name is Vinu Sankar Sadasivan. I am a final year PhD student at the University of Maryland. Currently I am a full time student researcher with Google DeepMind working on AI jailbreaking. So today we'll be discussing more about the hardness of AI detection and retiming of these AI models, especially with the focus on generative models. And yes, I generally don't do not drink coffee. And if I do get coffee I go for latte.
Demetrios [00:00:35]: What is happening? Good people of the world, welcome back to another mlops community podcast. I am your host Demetrios. And today we get talking about red teaming and also how to jailbreak those models. Vinu did a whole paper and PhD on being able to identify LLM generated text. So we talked about watermarking and if it is a lost cause. Okay, so let's start with why watermarking is so difficult. And you basically told me or you didn't say it, but I read between the lines in our conversation before we hit record, which was all of that stuff that you see where you can turn in a piece of text and it will tell you how much percentage is AI generated versus not. That's kind of.
Demetrios [00:01:43]: Or what.
Vinu Sankar Sadasivan [00:01:46]: Okay, so I wouldn't say it's bullshit, but I would say something which we should be not completely relying on. So we had this paper where we were researching on different kinds of detectors. So to start with, watermarking is just one kind of detectors that we are looking at, which has been predominantly a famous method of watermarking, has been there forever. Like it has been there for images and for text from a very long time. So for text, earlier we used to put like say a spelling mistake in between text and say if the spelling mistake keeps repeating it, say a watermark or like a double space or the pattern of the spaces or the punctuations or these kind of things could be like a watermark. But now things have changed. AI is blooming, people have made new method for watermarking. So what I'm saying is watermarking is a really good technique.
Vinu Sankar Sadasivan [00:02:48]: But the problem is when there are attackers in the setting which you're looking at, then it might not be as effective as we think it is. Okay, so in the paper which we looked at, we look at four different kinds of detectors and one of the most important one is watermarking, which also really works well. And the other kinds are using a trained detector, which is what I think most of the detectors out there are using right now. Because language models are not yet completely Watermarked, not all of them are. So it's just basically a classifier where you give an input. It just says it's like a dog or cat. It just says AI text or not an AI text. And the other one is zero shot detectors, where you don't have to train a network, but you just use a network to somehow look at the statistics, say, look at the loss values and see if it's less loss value.
Vinu Sankar Sadasivan [00:03:43]: It is probably a part of age generated text. If it was not a loss value, it is a human generated text because generally AI text quality is higher, so the loss values are lower and hence it's AI text. And retrieval is another method where you store all the AI generated text in a database and then basically search given a candidate text if that text is present in the database or not. So we look at all these kind of detectors and we break them empirically in two ways. One is to make AI text look more like human text, and the other way is to make human text look more like AI text. So type 1 and type 2 errors both. And we also show a theory between fundamental trade off of these detectors. And we show that as language models get bigger and bigger, detection gets harder.
Vinu Sankar Sadasivan [00:04:41]: Because if you think intuitively about it, language models are highly capable. When they get bigger, they can easily mimic the style of human writing if you give it the relevant instructions to it. Yeah, or even enough data. So if I give like a longer text of okay, this is a Donald Trump the way of writing text or talking. And if I ask it to mimic with a lot of data given to it, it probably would mimic it very well. And it gets harder as language models get even bigger. So that's one of the other concerns. So watermarking is just one of the tools which we analyze in the paper and we show it's easy to break them if an attacker really wants to.
Vinu Sankar Sadasivan [00:05:21]: So the takeaway from the paper would be if someone wants to really attack something, there'll exist no foolproof technique right now. And our theory shows that there will not exist anything like that. So it's really good to have one layer of security like watermarking for now, just to tackle cases where people are directly using AI text out of an AI model. And we can detect such text very well using watermarking. But if I really want to remove it, it's easy to remove that.
Demetrios [00:05:55]: Now what are the cases that folks A would want to know when it is only AI generated text, and then B, when someone would want to be adversarial and not Let the person know that they are only using AI generated text.
Vinu Sankar Sadasivan [00:06:13]: Yeah. Yes, that's a good question. So the cases where I wouldn't want to show that it's an AI text is when I'm a student or when I'm trying to do plagiarism, I'm submitting my assignment. But I used ChatGPT to write my answers. So in that case I wouldn't want my professor to know that I used AI to submit my assignments. That's one major scenario which people are looking at right now, because that's where a lot of the cash is, where the lot of money is. So all the leading text detection tools basically focus on that for plagiarism purposes because there's a lot of money there. And the other case where you want to really detect is again, plagiarism would be one case.
Vinu Sankar Sadasivan [00:06:59]: It's like a min max game here. So students want to make it look like human text, but. But professors want to actually make it look like AI text. If it actually was AI text.
Demetrios [00:07:08]: Yeah.
Vinu Sankar Sadasivan [00:07:09]: And the other case would be like spamming, phishing, these kind of attempts where you might be actually talking with a chat agent who is not a human. They might be scripted to fool you into some scams. And it's easier for them to scale up these scamming atoms if they have access to AI, which is really dangerous. What if there's an AI model later on which is like so natural, it converts text to speech and they're making like, they're basically simulating a call center, talking to multiple people parallelly trying to scam them and make a lot of money out of it. So here they would really want to make AI text or AI speech or whatever modality it does look like more human, like so that humans don't detect it. But still they get to do whatever they want to do, the adversarial objectives they have without getting caught.
Demetrios [00:08:11]: Now you mentioned there were ways that you broke these different settings or the detectors. And one way was making AI generated text look more human. Was that just through the prompt?
Vinu Sankar Sadasivan [00:08:25]: No. So what we do is. So that is one of the methods which you can actually do. But as we discussed before the conversation, the evolution of the systems, it has become harder to do that. So now I think the models are well, fine tuned to somehow imprint the AI signatures better. So it's harder to actually use AI input prompts to just change the detection or affect the detection. Well, so I was recently trying with all These tools, Gemini, ChatGPT and Anthropics Cloud to see how actually you can give prompts to make them look like less AI text. So I can't really make a study out of it because it's very hard to do manual prompting on like thousands of text myself and then do something.
Vinu Sankar Sadasivan [00:09:19]: But what I figured out mainly was commonly for most of these AI models was generally if I give it prompts like convert passive text to active text or something like use simpler sentences, don't use like longer sentences, avoid and, or punctuations and things like that. If I give such, such features, which models use generally to write very high quality text and make them try to be little low quality by using okay, use less rich vocabulary like how humans would stop using lot of punctuations and write shorter sentences and things like that. I get to make them break sometimes, but it's hard. So I've been noticing this because I try them out sometimes with some gap in the timeline. So every, say few months I try this out and I find that it's getting harder to do that. So the way which we did in our paper and which aligns with the theory of our paper is basically use a paraphraser. So if you use. So basically paraphrasing is where you rewrite a text.
Demetrios [00:10:37]: Okay, yeah.
Vinu Sankar Sadasivan [00:10:38]: Yes. And why we did that was to get our attack, the empirical attack which we do to be in line with our theory. And one other reason is because also we wanted to attack watermarking and one of the attack methods which are watermarking. The first AI watermarking text paper, the AI text watermarking paper showed was to change words, just replace words with another word. So that's the first naive attack you would think of. So given an AI candidate text, can you do minimal editing to it by changing the words to add some synonyms or change some of the functions made like add words like and. Or. And things like that to make minimal edits in terms of the edit distance to see if I can break it or not.
Demetrios [00:11:31]: But what is basically like you saying control F for all the instances where delve is in there and then you replace all of the delve with another. Okay, yes. And then it passes with flying colors.
Vinu Sankar Sadasivan [00:11:50]: So that would be the case for some other detectors, but it's not the case with watermarking.
Demetrios [00:11:54]: So.
Vinu Sankar Sadasivan [00:11:55]: So watermarking is quite robust to it. So if you really want to attack watermarking by just changing words, you might want to change almost like 50% of the words in place, which is going to be a very hard task. Right, because you can't change a lot of the words in. Even given a sentence with 10 words, you can hardly change like two or three words in place without affecting the quality or the meaning of the sentence. So that's where paraphrasing is really important. So basically, given a sentence, I can completely change the structure of it. I can make active voice sentence to a passive voice. So it completely flips, changes the structure.
Vinu Sankar Sadasivan [00:12:32]: I can join different sentences with and, and ors and things like that using a paraphrase, which was not the case with if I simply changed different tokens or words in the passage. So I think to explain the attack. Well, it's important to go into how watermarking works.
Demetrios [00:12:52]: Yeah.
Vinu Sankar Sadasivan [00:12:52]: How the first text watermarking paper did it was. It's a simple algorithm, but it works really well. So you're writing a passage. Think of it like this way. You're writing a passage. You start with writing a passage or an essay about a dog. So you say the dog was playing. This is how a human would write a text.
Vinu Sankar Sadasivan [00:13:18]: You wouldn't think exactly how to pick the words. You just want to pick the words so that it makes sense. But how the watermarked AI would think is, okay, I started with the word the. Now, for the next word, I have to only pick words from 25,000 of the words out of the 50,000 words I have access to in my vocabulary. So Suppose I have 50,000 words in my vocabulary. The AI would partition that into two halves. They call one half the red list and the other half the list the green list. Now, the AI would focus on always picking the word from the green list.
Vinu Sankar Sadasivan [00:14:00]: So eventually, when it writes word by word, it would try to make most of the words in the passage coming from the green list. So what a human would do is a human does not know what a red list and green list is. So he or she might end up writing, taking almost 50% of the words from the red list or the green list. Right. And what this watermark AI model would do is it will end up having a passage, which is mostly having the green list words. So this is the watermark. So the a AI model knows, given a passage, it has 90% green words in them and 10% red words in them. So it's very highly likely they are AI text.
Vinu Sankar Sadasivan [00:14:44]: But given a human text, okay, it's around 50% red words and 50% green words. Now, there are more advanced versions of this, which makes it better in terms of quality. So what they would do is, instead of having a hard partitioning of Red list and greet list. Because sometimes when I write the word Barack Obama, the word Obama might be in red list, but I want it to be green list because I wrote the word barak before. I want Obama to happen after Barack. So what they would do is I would make this red list, green list partitioning dynamic. So every step I would choose a new word. My red list and green list will keep changing.
Demetrios [00:15:25]: Damn. All right.
Vinu Sankar Sadasivan [00:15:26]: That would change based on the previous word which was written. So if I had written barak for the next word, the green list and red list will be determined by that word barak. So it will be modeled such a way that Obama will be in the green list mostly something like that.
Demetrios [00:15:42]: So this begs the question, all of the model providers or all of the model creators need to have this inside of their models for it to be valuable. Right? And you also kind of all have to be on the same page on how you're doing it. So that then when you have some kind of watermarking detector, it can detect if it's AI generated content. I guess you could take all the different ones. Each flavor of model has its own flavor of watermarking. And then the detector could just have all of these different ways that it can detect it. But you need to start at the model level. And if the model providers aren't doing this, then you can't detect it.
Vinu Sankar Sadasivan [00:16:34]: That's a really great observation. That's one of the major limitations which watermarking models have right now. I'm glad you pointed it out. So if OpenAI has its model watermarked and Gemini does not have, or vice versa, it does not make any sense because these are models and attacker, if they want to use the model, they would go for the non watermarked AI model which exists out there. And other main concern, which I don't know, which no one is talking about right now, is we already have open source models which are released which are not watermarked. So I have llama 3.2, the larger models I can always download and keep them in my hard disk. They're pretty good in doing whatever I want right now for scamming and things like that, or plagiarism even. I can always use them for writing AI text, at least to what good quality we think is good right now, maybe, I mean, at least at par with humans.
Vinu Sankar Sadasivan [00:17:37]: So it's also crazy that we are still trying to do watermarking. Okay, I understand in future it might be more powerful. Watermarking might be something which we want, but to some extent we have done the damage already. Someone who needs to do the damage in future could still do a lot of automation with the open source models which are already out there, which is not watermarked. And also if all these AI companies come up with different watermarking schemes, it might be harder for detection in the future, like you mentioned. So they all always have to be on the same page, someone regulating them, how to watermark and how to do detection and things like that. Because, say, in future There are like 10,000 AI companies and you don't know what is coming from where. You might be making a lot of compute on detecting which came from where and looking at the provenance of the AI text, which might be hard.
Vinu Sankar Sadasivan [00:18:33]: Yeah, so there are like technical limitations to this problem which we haven't addressed yet.
Demetrios [00:18:41]: Yeah, the genie's already out of the bottle. So what are you going to do on that? It does feel like, especially with the SEO generated content from AI models that you can see. I don't know if Google does or doesn't punish you if you throw up a bunch of different blog posts at one time and you're just churning out AI generated content. I think I read somewhere on their SEO update that if it is valuable to the end user, then you're not going to take a hit on your SEO score. But you have to imagine that there is a world where they're looking and they're seeing, hey, this is 90% AI generated. If they can figure that out, they would want to. But right now it's almost like, yeah, maybe they can figure it out a little bit. As you're saying that if they watermarked things and if everyone was on the same page with watermarking, then it would be useful.
Demetrios [00:19:49]: Or we could potentially see that. But at this point in time that's not happening.
Vinu Sankar Sadasivan [00:19:54]: Yeah. Yeah, that's true.
Demetrios [00:19:57]: So then, what else about watermarking before we move on to red teaming? My other favorite topic.
Vinu Sankar Sadasivan [00:20:02]: Yeah, So I think I was coming to the attack which we were doing in the paper, to remove water.
Demetrios [00:20:08]: Yes.
Vinu Sankar Sadasivan [00:20:09]: So the really important thing with the text watermarks that exist right now is this dependence on the previous word which it had sampled. So if I had sampled Barack, I want to sample Ababa next. So I might have a random red list or green list for the next word. But with I would increase the probabilities of sampling the word Obama so that it is sampled. That's one thing. So that the text quality is not affected. And the other thing is the green list and red list is now dynamic, so it changes with every word depending on the previous word as the it is basically the seed to the random number generator for partitioning the vocabulary to red list and green list. So the way watermarking works is basically this.
Vinu Sankar Sadasivan [00:20:56]: And if you want to attack it, if you just think of it, if I change one word in the middle, it might affect the red list and green list of the next word, right? Because that might not be ending up in the red or green list before that. But the problem is if I do just one word, it only changes the red dot we miss for the next word. And so if I need to completely change all the number of green words in it, I have to essentially make a lot of edits. But what if I rearrange these words? Then the structure of this, the ordering is completely disturbed and the red list of green list is completely random now. So that's what happened. When you rewrite a sentence, you generally don't try to preserve the exact ordering of the words, but you write them so that it's essentially rewritten. So if I say a longer sentence, I can probably swap the sentence A and B. I could probably say B and A even within A.
Vinu Sankar Sadasivan [00:21:52]: I could change the grammar maybe, or I could change active voice to passive voice and so on, or even synonyms and things like that. So the attack which we look at is using an AI model itself to paraphrase the AI text. So ideally you would want the AI paraphrases output also detected as AI text, because it is again, AI, not a human. So what we observed was if you use a paraphraser model to paraphrase the AI text, the output you get is mostly detected as human text for a lot of these detection techniques. But one of the most robust technique is watermarking. It is still robust slightly to paraphrasing a little bit, because the current paraphrases are not trained to do this. So if we give really good manual prompts to the paraphraser, we could actually do it in one shot. But what we ended up showing was something called recursive paraphrasing, where the AI text which you paraphrased once would be again given back to the paraphraser to paraphrase once more.
Vinu Sankar Sadasivan [00:22:59]: So it will be paraphrased twice and then you can keep doing this multiple times with whatever is, whatever you want based on the strength of the watermarking. So we find that just after two rounds of paraphrasing, recursive paraphrasing, the watermarking scheme's accuracy goes below 50%. So we just see that two rounds of paraphrasing is enough for breaking the watermarking algorithm. And this is what we essentially show in our theory too. So the theory goes like this. Suppose if you have like a distribution of text, which is basically AI text plus human text, which is a subset of that, right? So it can be anything. So given a sentence, a passage A, I can also look at another set B, which is essentially all the sent passages similar to A in meaning. So even if I take something from B, I'm okay to replace A with that, right? But the problem is for watermarking in the set B, suppose there are a hundred passages which I can replace A with the sentence the passage A with.
Vinu Sankar Sadasivan [00:24:11]: For a watermarking agent, I can't say out of all the hundred fifty of them are watermarked, because if that's the case, it's likely that a human writing a passage would be 50% coming in the watermarked label. So I have to make sure that the likelihood of a human writing a passage with similar meaning is less because it has to be watermarked for that out of the hundred, I have to say, okay, one or two of the sentences are only labeled as watermarked and others are not. Still the false positive rate, which is a human writing a text and detected as AI text is 1% or 2%, which is quite high. But okay, let's be, I mean, lenient on that. Just say, okay, one or two of the 100 texts are labeled as AI watermark text. But now the problem is easy. So I have the first sentence labeled as a watermark. What if I use a paraphraser which is say really good, and I hop from the first text to one of the random hundred text, so it very likely I'll drop onto a text which is not watermarked.
Vinu Sankar Sadasivan [00:25:15]: Now, because the watermarking was designed in such a way that it's likely to drop into another text which was not watermarked. So if you look at it, this is a trade off. If I try to increase the watermarking strength to paraphrasing, I have to increase this number 1 out of 100 label as watermark to say 10 out of 100 labelless watermark or 50 out of 100 label as watermarked. But if I do that, I'll end up making a human falsely accused of plagiarism with a higher chance. So it's essentially a trade between type 1 and type 2 error if we use this kind of detection systems. So this is what Our theory shows you. So our theory says even for the best detector that can exist out there, we upper limit the detection performance of this model using the distance between the distributions, which is jargon and we don't need to go into that. But essentially, yeah, but essentially for the best detector out there, we are not claiming that's watermarking, but even if something that's better than watermarking, the best theory, detective, that can exist theoretically out there is upper limited by a quantity which we characterize in our paper, which is still not 100% reliable.
Vinu Sankar Sadasivan [00:26:28]: And the performance of that still has a trade off with respect to a true positive rates and a false positive rates, which is the type 1 and type 2 errors. So you have to make it go down on one of the errors to make it better on the other error. So that's the major highlight or the major results which we show in our paper. So following up on that paraphrasing is something which is really considered a good attack right now. And lot of the leading text detection tools like Turnitin have been using methods to deal with it. So what they are doing in the recent blog post they had mentioned was to use a paraphrase detection tool which tells you if an AI text was paraphrased or not. If it was paraphrased, you can say it's an AI text or not. And the problem is if you do that, you are ending up hurting your true positive rates.
Demetrios [00:27:21]: Yeah, well, yeah, that's exactly what it sounds like is that you have this upper and lower bound or the right and the left side of the spectrum. And the more that you go onto the right side of the spectrum, then you're going to get these false positives and you can't really win no matter what you try and do.
Vinu Sankar Sadasivan [00:27:41]: Exactly. So if more money isn't detecting innocent kids as plagiarized, okay, you can choose to make money like that. But if most money is okay as an AI detection tool, I would probably want to say that, okay, I caught more students plagiarizing, maybe that's more efficient for them. But in, in. But it can actually be ending up having bad reputation for them, falsely accusing students.
Demetrios [00:28:08]: So.
Vinu Sankar Sadasivan [00:28:08]: So it's a choice they need to make. But the real question is, do we really want to use this for such strict plagiarism detections? Because I think as we go ahead and as these technologies come up, we have to find a way to use them collaboratively for our work because they improve our production instead of replacing us or something like that. I I believe it improves our production and we have to learn to use them as a tool instead of using them completely for plagiarism.
Demetrios [00:28:43]: So yeah, that you were talking about there for a second, you were mentioning how you feel like it's becoming more and more difficult to make it make these models write more like humans and their almost being trained or they're being red teamed to not write like a human and have their own distinct AI voice.
Vinu Sankar Sadasivan [00:29:10]: Yeah, yeah. So I think this is just a speculation. I'm not sure what actually goes into the training of these models so. But yeah, from what it looks like, I think there has been recent steps taken by these AI giants to make the models more easy to be detected. It could be either that or it could be the AI detection companies doing a better job at making their detectors. It could be either one of them. But I feel also in at some point of time when watermarking was introduced, some of these models text quality actually went down a little bit. So while I was seeing some of the comments on Twitter I saw sometimes people were saying is it me or just, is it just me or everyone else finding that ChatGPT text quality has gone out a little bit.
Vinu Sankar Sadasivan [00:30:05]: I'm not sure if it's because of these kind of trainings added on top of it. It could be that, it could be some other safety alignments which they have added on top of the model which actually tapes off on the performance.
Demetrios [00:30:18]: Yeah, it's a side effect.
Vinu Sankar Sadasivan [00:30:19]: It's the side effect of it. So there's a trade off in everything. If you try to make detection better you have to trade off on the text quality or on the type 1 or type 2 errors of detection and things like that. But I think, yeah, my speculation is models could have been fine tuned to make detection more possible because recently governments have been pushing these tech giants to have watermarkings embedded in them. DeepMind recently released their watermarking paper on nature and Meta already has been trying, has been in the game for image watermarking and things like that. OpenAI I'm not sure what their scenario is but I think all these companies are pushed to have watermarking in them. So they are trying to possibly do potentially they're doing something which helps detection better. Yeah, but these are some of the trends which I think which I've been noticing which is again a speculation.
Vinu Sankar Sadasivan [00:31:22]: But yeah, I think one common features with all these models have been they have been trained to have really good text quality. So that could be another side effect of it. They try to make Use literal, very poetic words and poetic devices, things like that, to make text look very nice, ornamental words and things like that. So if we give some instructions, it was earlier, very easy, I think, in my experience. So I think that's because of the training which these models have been gone through to make detection far more easier. So recently if I tried to make it okay, I just say okay, make it sound like how Donald Trump says, still the text which comes with the model still is data as AI text, which was not the case earlier when I had checked like a year back or so. So, yeah, potentially it's either the models being trained to do that or the detection tools are improving with time.
Demetrios [00:32:21]: Yeah. Okay, so let's take a turn for red teaming and just give me the lay of the land on how red teaming has changed over the years because of all of the models getting better. I think there's a whole lot more people that are red teaming models, whether they are getting paid to red team them or not. Everybody loves to think about or loves to be able to say I got chatgpt or insert your favorite model to say this or to do this. It's almost like a badge of honor that we can wear on the Internet. So what have you seen over the years and how has it differed?
Vinu Sankar Sadasivan [00:33:01]: Yeah, so I think it's like a very recent advancement which we've been seeing in AI people trying to jailbreak people trying to do red teaming. But I'd like to see more of it. Like it started off with people and then it went to like automated things. Because the efforts which people have been putting is almost similar because it's just manual attempts, trial and error. Some insights which you get from the model, the feedback you get, you put it back to the problem. It's an iterative process. So it started off like that. It was called do anything now dant.
Vinu Sankar Sadasivan [00:33:38]: So there was a page where people were compelling different ways to jailbreak these models. You could either write manual system prompts or write manual input prompts to make the model think that you are innocent, you are not doing to do something harmful. So some of the, I think classic examples are where the question is you have to make the model say how to make a bomb, which for some reason you can find it on Google, but you don't want your model to output. But.
Demetrios [00:34:08]: Oh, I never thought about that.
Vinu Sankar Sadasivan [00:34:10]: Yeah, but I mean, yeah, sure, it's still a good objective to keep in mind. But yeah, I mean, you can just take it as a toy example for now. We don't want the Model to talk about something. Okay, it's totally cool, we just don't want it to talk about it. And how do we do that? Because bomb can come in different contexts. Right. We can't just use a word filtering algorithm. Bomb could be something explosive.
Vinu Sankar Sadasivan [00:34:34]: Or I could also say, okay, that was bomb. Maybe to say that was like a fantastic thing or something like that. Yeah, I don't know in what context you would be using it unless I understand what it is. So I can't just use a string matching algorithm. See, okay, if there is a bomb in it, I don't reply to that. That's not going to work. Even the word screw, if I say I'm going to screw that it might be okay. But if I'm saying I'm going to screw you up, then it could be something offensive.
Vinu Sankar Sadasivan [00:35:02]: So there are things which comes in context. So it's not to just give a context of things. It's not easy to do word filtering in most of these cases. And also if you have to make like a list of a blacklist of these words, it's going to be really large and you can't end up doing that. Especially when these models are getting multilingual now it's hard to maintain a list of words which you want to filter on. But the thing which people used to do was to write manually. Crafty system where they say my grandmother is sick or things like that and I have to make like a magical potion for her or say it's for my school project, I really have to get a good grade. It's funny, it's exactly how you would try to fool a human to get them to answer something.
Vinu Sankar Sadasivan [00:35:47]: Saying, acting, working on the sympathy aspects, emotional aspects of it and making the model say since they are human aligned, somehow it ends up breaking it. Because that's how they were trained. Also they were safely aligned with human values. So probably it is expected that they break to these methods. Where human would break, an average human or a below average human would break based on the way in which you be trained it. Yeah, so that was initial evolution of red teaming. Like where it started where people showed, okay, there are these manual techniques where I can write prompts to break them. Which started off as a prompt engineering trick.
Vinu Sankar Sadasivan [00:36:28]: Where people initially use prompt engineering to improve models performance. But now they started using it to break models. But now it's more like an iterative process. So you give it a question with a prompt, it does not answer it. You refine the prompt eventually, but you essentially get not much signal. It Ends up saying, I can't answer your question, but it's really not a good signal for you. But you end up doing trial and error to reiterate and refine your input prompt such that the model somehow breaks in the some later iteration of your attack. But from there it has advanced a lot more.
Vinu Sankar Sadasivan [00:37:07]: People have come to a point where it is more automated now, which is more dangerous because then the attacks get scalable and do much more harm than what it could do when you were manual system prompting. So the thing is, what the models have been doing after, so what the attackers have been doing after that. There was this paper last year which was from Andy Zhu, where they introduced an algorithm called gcg, which is essentially a gradient based algorithm. So what they do is if I have a question how to make a bomb, I get add random 20 tokens after that and then use gradients to optimize that to select like a random set of 20 tokens, basically the suffix tokens, which makes no sense. But when added that as the suffix to the input, how to make a bomb, the model just breaks. So. Yeah, exactly. So it doesn't make any sense to us, but the model somehow interprets that and breaks to it.
Vinu Sankar Sadasivan [00:38:12]: So this is something which was really expected if we come from machine learning adversarial literature. Because machine learning models, which are essentially neural networks, can be broken with some input perturbations. This is a well known strategy in the adversarial literature. But the thing was, it was hard for language models because the way they work is very different from the traditional machine learning model which we have used. Because in traditional machine learning, people were mostly focused on computer vision and continuous data where you have, say, images, which is like a different kind of data where you have continuous data, right? It's just pixels between value 0 to 255 or between minus 1 and 1. It's just continuous data. But for text, when you come to text, they're just discrete data which is just say 50,000 tokens. So you just say this token is 255 or maybe 50,000 or something like that.
Vinu Sankar Sadasivan [00:39:07]: So they're discrete. So the thing is, when you do gradients, take gradients on this with respect to the input, since the input is discrete, it's really hard to take gradients on them because they're not defined at all the points. So the attacks were not effective most of the times when we tried to do attacks on language models. But this recent work did multiple tricks or multiple hacks to make that work out. And this is the end result. So you add random jargon, random suffix to end of the question, it will.
Demetrios [00:39:38]: Break and it's just random words that if so like I can have my prompt and then I can just think of random words. Or is there specific words that you throw on to the end or just specific letters?
Vinu Sankar Sadasivan [00:39:54]: Yes, it looks random to us, but to the model it does not say specific words according to the model, just optimize to make it break.
Demetrios [00:40:02]: But how does the attacker know what the optimized words are?
Vinu Sankar Sadasivan [00:40:07]: Yes, so one is if you have access, direct access to the model, you can optimize the model with taking gradients of the model and updating the words. The other method which they show is something called transferability, which is again a non phenomenon in adversarial replication. What you do is you have access to say three or four open source models. You find a suffix which breaks all of them. Right now if you use this suffix to a new model which you never had access to, it might end up breaking to that. So that's transportability and they show that this suffixes can be transferred to a new model. But you can train, use llama to train for the surfaces and use that to big chatgpt.
Demetrios [00:40:50]: And so it doesn't matter, it's not about the training of it, it's really about the model architecture.
Vinu Sankar Sadasivan [00:40:57]: Exactly. So it's one thing probably because most of these models are transformer based, that's one thing. The other thing is the data. Most of these models end up using very similar kind of data. The structure is same. So the way you can break them similar probably. That's why I think translability works because in lot of the cases which I have looked in, in language models, transferability works quite well. I think it's because of the underlying similar text which was used for training.
Demetrios [00:41:27]: And so this is one vulnerability we could say how are the model providers combating it?
Vinu Sankar Sadasivan [00:41:36]: Yes. So one thing is again to use automated red teaming to combat this. That's one way. Okay, so let's go again in a sequential way, the evolution of the different systems too. So the first, this particular method which I discussed to you suggest adding a random jargon. Right. So if you look at here, the quality of the text goes bad. So one easier way to do it, I look at the input prompt text quality, if it's really bad, I just don't answer to that.
Vinu Sankar Sadasivan [00:42:08]: I just say I don't understand your question. So that's one way. Which is why latent later attacks came to improve the readability of the suffix. So they actually make text which is readable and do the similar attacks. So the models can't really detect based on the quality of the text. So what models end up doing there is one thing is chain of thoughts. The other is to have something a llama guard where they use a copy of the llama model and make it to train like a classifier. So it is trained to see if this input is harmful or not.
Vinu Sankar Sadasivan [00:42:46]: Right. So a pre trained llama model is taken and it is trained to just do a classification task where it is given a set of harmful prompts and a set of not harmful prompts and made to give a label 0 or 1. If 1 it's harmful, if 0 it's not harmful. And you can also add a training data set of this classifier, these kind of adversarial pumps which we designed earlier to make them robust to this kind of attacks. But again, if you use AI to defend AI, the AI in your pipeline, if one of them breaks, all of them breaks. Yeah.
Demetrios [00:43:20]: And because this is all on the input, right. There's none that are happening on the output or is there also another filter that's happening on the output in case it goes through the input? Catch all.
Vinu Sankar Sadasivan [00:43:34]: So there are multiple ways these systems work. Some work just on inputs, because if you want to really save on the compute time, it's better to look at the input. But if you have the capacity to look at both outputs and inputs, that's the best method. So Dhamma guard ends up looking at output and input, if it has a capability of doing that. And some methods even look at the internal activations of the model. And there are detectors which are trained to see if my. If I have a layer of transformer and if I look at the activations of different neurons of them, if say a set of these neurons activated more, it's probably harmful prompt. There are even methods which even look at the internal activations of these models to see if it's the harmful text or not.
Demetrios [00:44:23]: So it's like basically the dark side of latent space. They can know where that is, the back alleys. And they can say like, hey, if you're traveling in these back alleys of the latent space, you're probably up to no good.
Vinu Sankar Sadasivan [00:44:35]: Exactly. So they call it circuit breaking, where you try to make the model understand if its activations are towards the dark side of it and then just break it off right there and stop the generation. And just that's why sometimes you have seen the models just stop generating Some things they're probably uncertain about it or they just are just the circuit is just broken so that they don't continue it because it's probably going to something harmful. Wow.
Demetrios [00:45:04]: And have you seen, because I know you're doing a bunch of red teaming right now for DeepMind, right. And I can only imagine that you've been having fun with all of the models, not just DeepMind's I. And so have you seen different ways that certain models are stronger and certain models are weaker?
Vinu Sankar Sadasivan [00:45:27]: So I think that's again, red teaming is to me is somewhat broad term because in my last work which was published at iheml, we look at red teaming algorithm where we find an automated algorithm to make again readable suffixes which breaks the model but in a fast way. So the GCG algorithm which I mentioned to you takes like 70 minutes for optimizing the suffix, which is really large. So what we did was our attack would take like one GPU minute to take the model and that's what would be published at icmn. And in that particular paper we just propose the algorithm. But the capability of this algorithm is tremendous because we since it's fast and from an academy setting back then with less GPUs we could try out different attacks. So one is jailbreaking, which is one kind of red teaming, I would say it which exists other we do is something called hallucination attack where we change the prompt such that the models end up hallucinating much more. And the third one is something called privacy attack where we attack the prompts such that the existing privacy attacks performance is boosted. So for example, there's something called membership inference attack where you want to see a given say I give a caught a text passage from Harry Potter and ask if it was part of your training data or not.
Vinu Sankar Sadasivan [00:46:56]: So there were attacks to do that well but in dependence, let me guess, it was. So yeah, we can't really give a guarantee on that right now, but definite most of these models used Harry Potter for training. They actually generate sometimes word bad Tim exact text from Harry Potter.
Demetrios [00:47:19]: Yeah. Oh, that's classic.
Vinu Sankar Sadasivan [00:47:22]: Yeah. So these models. So what we did was the third attack was privacy attack where we attack the input such that the performance of the privacy attacks improves. So there are different kinds of attacks which you can think about and people have been only looking at jailbreaking for red teaming, but we have noticed it really depends upon your training. So LLAMA is one really good open source model which I know is good at resisting to jailbreaks. So when you compare it to other methods, models like Mistral from Mistral's model, or Vicuna or Llama and things models like these, we find that LLAMA is quite robust to jailbreaking Attempts. So is ChatGPT or Claude or Gemini, but Claude does really well in terms of defense most of the times. They have a lot of safety filters and chain of thoughts going in the background, which I understand is why they are very good at it.
Vinu Sankar Sadasivan [00:48:27]: Because they're mainly an AI safety research organization, so they are focused more on safety. I'm assuming they have put more compute on making them better at safety. But yeah, coming back to open source models, LLAMA has been really good at not jailbreaking. So we found that it's easier to jailbreak LLAMA when you ask to make it to generate fake news. So LLAMA has a vulnerability where it generates more of fake news. It's easier to make them generate fake news when compared to other models. And one thing to note here is that this is just jailbreaking attacks. And all these models have been fine tuned to be resistant to jailbreaking attacks.
Vinu Sankar Sadasivan [00:49:07]: But when we move to hallucination attacks, we find that LLAMA is equally breakable when compared to all other models for hallucination attacks. So the hallucination attack is essentially we add a suffix and the model ends up saying, okay, eating watermelon. So it's a dangerous for you. You might even end up dying, eating watermelon seed or walking into a wardrobe. You might end up dying if you walk into wardrobe. So things like that. So we have actual examples where LLAMA ends up doing this after our attack. And it's crazy how it works because it really depends upon how you fine tuned your model.
Vinu Sankar Sadasivan [00:49:39]: If you did not, if you forgot to fine tune your model to be robust to hallucination attack, your chances are gone. So in once you deploy the model, it's out there, people can attack to make it hallucinate more, make misinformation out there. So yeah, it really depends on your training, how your training is done. So when comparing these open source models to the production models, which are already out there and not open sourced closed models out there, I think they have been extensively fine tuned to be robust to these kind of techniques. And also they are in academic setting at least we actually let them know this attack is there, we are going to publish them so they get time to adapt to it if it's a really important attack. And also these companies have these programs where you were earlier asking if they're paid to attack or not, where they have like a bounty program where you red team them and find vulnerabilities and report to them. And the model is actually trained to make them better on that. So they are actually doing good red teaming research to be well ahead in the game before someone else outside their organization breaks it.
Vinu Sankar Sadasivan [00:50:51]: So I think as this gets like how the open source community grew, if the red teaming community grows very large and it outgrows the community that the companies have, it might be harder for companies to cop up with the red teaming approaches which would exist. But we could make it little more harder for the attackers by adding these kind of defenses which the models which these companies use right now. But yeah, again, it's the same problem which existed in detection. I believe it's not easy to have a complete solution for jailbreaking because if you look at it, the definition of jailbreaking assets is not fully clear to us. What are the kind of questions we are not supposed to answer to the model if we do not know that and we don't know how to train the model for that? That's the fundamental problem which we are looking at. So if we can't define the problem, how do we find a solution which is well defined? So we need to define the problem first, which is very ambiguous because the context changes and the scope of harmful questions changes and things like that.