MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

Reliable Hallucination Detection in Large Language Models

Posted Mar 04, 2024 | Views 373
# Hallucinations
# LLMs
# Intuit
Share
SPEAKERS
Jiaxin Zhang
Jiaxin Zhang
Jiaxin Zhang
AI Staff Research Scientist @ Intuit

Jiaxin Zhang is a Staff Research Scientist at Intuit AI Research. He is passionate about building AGI capability in solving complex real-world tasks ranging from CV and NLP. His research interests span multiple areas, including large language models, reliable and robust AI, uncertainty quantification, and optimization. He has published several papers in top-tier AI conferences, including NeurIPS, CVPR, EMNLP, AAAI, AISTATS, UAI, etc. Prior to joining Intuit, he was a research staff in Oak Ridge National Laboratory, US Department of Energy, where he won the "Promising Early-Career Researcher Award". He received his Ph.D. from the Johns Hopkins University.

+ Read More

Jiaxin Zhang is a Staff Research Scientist at Intuit AI Research. He is passionate about building AGI capability in solving complex real-world tasks ranging from CV and NLP. His research interests span multiple areas, including large language models, reliable and robust AI, uncertainty quantification, and optimization. He has published several papers in top-tier AI conferences, including NeurIPS, CVPR, EMNLP, AAAI, AISTATS, UAI, etc. Prior to joining Intuit, he was a research staff in Oak Ridge National Laboratory, US Department of Energy, where he won the "Promising Early-Career Researcher Award". He received his Ph.D. from the Johns Hopkins University.

+ Read More
Adam Becker
Adam Becker
Adam Becker
IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More
SUMMARY

Hallucination detection is a critical step toward understanding the trustworthiness of modern language models (LMs). To achieve this goal, we re-examine existing detection approaches based on the self-consistency of LMs and uncover two types of hallucinations resulting from 1) question-level and 2) model-level, which cannot be effectively identified through self-consistency check alone. Building upon this discovery, we propose a novel sampling-based method, i.e., semantic-aware cross-check consistency (SAC3) that expands on the principle of self-consistency checking. Our SAC3 approach incorporates additional mechanisms to detect both question-level and model-level hallucinations by leveraging advances including semantically equivalent question perturbation and cross-model response consistency checking. Through extensive and systematic empirical analysis, we demonstrate that SAC3 outperforms the state of the art in detecting both non-factual and factual statements across multiple question-answering and open-domain generation benchmarks.

+ Read More
TRANSCRIPT

Reliable Hallucination Detection in Large Language Models

AI in Production

SlidesZ: https://docs.google.com/presentation/d/1KAPSg8hvHOwirVq4Hl6fy-YbMerMOUfX/edit?usp=drive_link&ouid=112799246631496397138&rtpof=true&sd=true

Adam Becker 00:00:05: I've dipped my toe in to just the fascinating and vast scale research that's going on right now about hallucinations. But I look forward to seeing you take us on a deep dive into what the latest research has to say. So would you like to share your screen?

Jiaxin Zhang 00:00:23: Yes. So go. Let me share my screen. Can you see that?

Adam Becker 00:00:32: You can see it? Okay, thank you very much. I'm going to be back in 30 minutes to check up on you.

Jiaxin Zhang 00:00:42: Okay, cool. Hello everyone. Very excited to have this opportunity to share my thoughts around the hallucinations as Mihail just give a very great ram up. So today I want to present some of our research work on reliable hallucination detection in large language models. So my name is Jiaxin Zhang. I'm a research scientist from Intuit AI Research. Okay, so let me give a brief overview of our company. So if you're from United States, you are familiar with techs and small business.

Jiaxin Zhang 00:01:20: I think you are maybe familiar with our company, our product from like Turbotax, credit cardmark, Quickbooks and Mailchimp. So overall intuit, our target is complete financial confidence. So I highlight the confidence means for financial companies. So the confidence, the trustworthiness, the reliability are very critical to our customers. So I think the relationship we build is strongly relying on our confidence and our trust from each other. I think before January comes here, Intuit has shifted to want to build like AI driven expert platform. So to help us achieve financial confidence with turbotax curriculum, QuickBooks, one person, one community at a time. And now I think entered to the geni zone.

Jiaxin Zhang 00:02:22: So we have built two very impressive products. So I think that's the topic of today is AI for production. So first is we call this AI assist, it's AI powered financial assist. So right now if you log in your either Turbotax quick comma or QuickBooks, you will see like the Intuit assist as a new feature. So now it is a new financial assist that used the power of Gen AI to give you intelligent and personalized recommendations. So it will do the hard work for you to improve your small business and the personal financial success. So I think that's the product side. I think again, this accuracy, reliability, trustworthiness in Geneva, AI product are very important.

Jiaxin Zhang 00:03:13: So beyond the external product, so we have also this internal product, we call this the platform development here. We call this intuit genos. It's a generated AI operational system. So for that one, if you look at, you will see we want to build this platform to accelerate development velocity in the Genevi info. So here, thanks to our Genos architect Maureen. So we have multiple features and components. So like Gen Studio is a developer, Gen UX is a library of user interface tools and also like genre gen orchestrator. So an agent tools we also build is the financial large language models we call fume.

Jiaxin Zhang 00:04:08: Sorry. And then again, like I said, it's the accuracy, reliability and the trustworthiness. In Geneva AI info also very important. So okay, let's move to the RM reliability side. So for the RM reliability, I think the hallucinations are very important and in Agtupo because one recent paper, they call it a limitation, I think inherent limitation of large language models. So typically we have two types of hallucinations. One is the factuality hallucination. That means like you ask a question about who is the first person to work on the moon and you may get hallucination answers.

Jiaxin Zhang 00:05:00: But if you checked online using some external tools like Wikipedia, you will see, okay, so the correct answer is different. And another side is the faithful news hallucination. For example, if you want to summarize the following news article, and given this context, and you may generate some answers with some hallucinated information that you cannot trust it, right? And another type of Calgary is like one type of hallucination is the intrinsic means like directly conflicted with the source material, introducing factual inaccuracy or logical inconsistency. Right? And another side is the extrinsic hallucinations. That means they're not trunk predicting, but I think are unfavorable against the source, imperising the uncomfortable elements where some additional information that we didn't mention before. So hallucination, I think that the research for this area is very active. And if you look at some recent survey paper around this area, I think I just listed this three important survey papers. Most of are pretty recent, from maybe 2002 to 2003, even 2004.

Jiaxin Zhang 00:06:21: So regarding the hallucination, as I think Adam mentioned the question like where is the hallucination? Why we get this hallucinations? So this survey gave a brief, I think it's very comprehensive overview of the hallucinations research in large language models from three aspects. First is the causes, the causes like hallucination from your data set and hallucination from your training period, and the hallucination also from your inference. And the second component is the hallucination detection benchmark for this part is very important because most of the applied side we would just rely on some business model or black box model like chat, GBT, GBD, four this kind of models. So we didn't very understand how they handled the data that the training or the inference. So we can just run their model, but we want to detect if it is hallucinated or not in some cases. And also research area want to build this hallucination benchmark to verify. Okay, what's the most effective hallucination detection measures and what's the current status of different models and what's the accuracy, what's the performance? Right. And the third point is to the mitigation.

Jiaxin Zhang 00:07:42: So the mitigation part means once we detect that the hallucination, we also want to reduce the hallucination or even mitigate that. Right. There are multiple different methods like the data related and the training related and the inference related. I think it's corresponding to the causes in the first component. Here I just listed this, a benchmark using different data set and different methods to taste the hallucinations. And the last one I think is from our paper published in the EMRP 2023. And today I will bring you guys with more details. Okay, again, hallucination.

Jiaxin Zhang 00:08:22: I think the talk will cover two components rather than the first one like the calcis, but we will mainly focus on the detection and the mitigation part. So the detection hallucination in RM is very critical for assess the reliability and the trustworthiness of the generated context. So typically there are multiple ways by defining the metric like fact based metric, classifier based metric, QA based metric, and also using some uncertainty estimation methods to give an estimate score of your generated content. And also we may leverage the strengths of different rms by just prompting to check the user query and RM generation and define some boundary judgment or other rows to give the answer and know if it is hallucinate and you will know. Most of the dismembers are, for example, they may need some external knowledge. We're pretrained models, we're defining some agent, we're using some pretrained or a lot of additional resources. But in practical, can we pursue one kind of like theoretical resources methods and can be applied to any different type of models? I think motivated by this requirements, one recent approach proposed, they call it a step check GPT. So the target is to propose one kind of zero resources, black box detection methods in large language models.

Jiaxin Zhang 00:10:05: So the basic idea is, I think is strongly motivated by this intuition. So they consider that if RM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent effects. That means like if you ask one plus one, in most of the cases it'll be two. Right. However, for hallucination effects stochastically sampled responses are likely to diverge and contradict one another. That means if you ask a very large number and you want to say okay, can you tell me this number is a prime number or not? So it's hard to say. Maybe each time RM generates different answers. I think this one is contradictory responses and based on this intuition.

Jiaxin Zhang 00:10:57: So self check GBT proposed like we want to generate different sample responses given yours query and want to check this kind of responses are consistent or not. I think this method is also inspired by the self consistency cot papers before. I think it's pretty efficient and useful, but we want to reexamine this intuition. It is always true, probably not because we observed that solely checking the step consistency of RM is not sufficient for detecting factoriality. So for example, this illustrative example to show rm step checking for this case. So you ask the query like is PI smaller than 0.2? I think it's pretty classical trick question or query before, but I'm not sure Chet GPT or GBT four today that the recent model have fixed this problem. But if you ask this question several times you will always get the wrong answers that we call this consistent run. So that means the general responses to the same query may be consistent but not factual, right? And another scenario is like okay, so the general responses may be inconsistent with the original answer that is factually correct.

Jiaxin Zhang 00:12:21: That means you ask this query and then you start the most determinist answer by setting the temperature to zero and get the target answer. And then you generate different samples. Want to check if the sample answers are consistent with your target answer. So for this case you will see, oh, the answer is no because if we increase the temperature we may increase the randomness and get more stochastics, right? And then we will get a wrong answer. So just rely on the assumption of the subchat GBT, you will get a total wrong answer. So I think that's our key observation and we want to propose some new mechanism to fix this kind of mismatch or the gap between the true detection performance and the subcheck GPT proposed method. And this one is another illustrative example is for the same query. And we observe that if the user query, but if we want to perturb this query a little bit using some semicivalent prompt and if the RM knows have the knowledge, they should give the same answer.

Jiaxin Zhang 00:13:33: So first we perturb this original question. I get different answers but semi equivalent. And then call RM to answer that and you will see they will give a different answer. Another case is like we do not just want to rely on the specific model like GP run five, turbo or GP four, but we want to incorporate more different model. We call it verifier models like different type of black box model like bar as robbie cloud or even open source model like llama. And for the same query can we get different answers. So for that case we check that the self consistency itself and also like the cross query consistency. Cross model consistency and also cross model and the query consistency.

Jiaxin Zhang 00:14:19: And then we want to put them together to give you a more comprehensive or assembled assessment of this query and the answers. So based on this insight, we proposed this. We call this sac three, it's a semantical wire cross check consistency. So we reexamine existing detection approaches based on the self consistency like self check GBT and cover two type of hallucinations resulting from the question level and the model level. I think the question level is some kind of like if your query is ambiguous or is impressive, maybe we use some perturbations and get more clear expression such that RM can understand better. And the model level means like we want to leverage the strengths from different perturbin models and see if we can get more consistent responses rather than just rely on one specific model. And based on this motivation, we propose stereo sampling based methods. And I think this method expands on the principle of a self consistency check.

Jiaxin Zhang 00:15:28: We have a typical three stages. One is the question level cross checking and then it's the model level cross checking with additional verifier model. And this model can be black box, can be open source models. And the stage three is to how to calculate the consistent score and put them together. I think it's all the responses and check the responses are consistent, were not with the targeted response. And here we also propose to combine the query and answer together as a QA pair to check the consistency. And then eventually we put ultimate together to define overall consistent score as a metric to evaluate the consistency degree. And also we incorporate additional model weight hyperparameter because for some DOB specific problem you may have some high confidence of this model and you can add a hyperparameter to increase the weight of this model.

Jiaxin Zhang 00:16:24: So now we want to test our performance and we build this benchmark. As the survey paper mentioned, we test two type of QA scenarios. One is the classification QA stack and the task is to determine if it is number is the prime number. And the second one is to check this specific people is the senator graduate from specific school. And here we compare our proposed methods with the naive self check GBT, where we call this SC two, you will see our SCC three with all components like the question perturbations and the model perturbations and the cross check get very great performance and with a significant improvement. So here we incorporated two smaller models. One is a Falcon and another is an organical. But feel free to add the Fairfax model as you prefer.

Jiaxin Zhang 00:17:22: So I think it's pretty flexible. Another task is the generation Qa means for hapa Qa where nq open this kind of scenarios. And we also saw very outperformed resource compared with the naive baselines. And also another key question is how many samples do you need? And here we taste a different number of the samples and see, okay, so we may still use maybe three to five to reduce the cost, but we will get very competitive performance with the larger sample size. I think that's not a big deal. And this one to show this effect of the verifier model width on the final performance. And here is our GitHub, if you are interested you can just take. And also we recently released like the first SeC package means to significant reduce the cost because it's a sample based.

Jiaxin Zhang 00:18:16: But we can use paralyzation like multistrat strategy to make all the sampling fees to be paralyzed. And also like the consistency check is paralyzed to be paralyzed. So right now for the hardback qa for each query, we can reduce the time from maybe 15 or this kind of time second to just a two or three second. All right, so after the detection, another task is like the hallucination mitigation means, okay, we detect this hallucinate, how can we improve that? Or how can I fix that? Right, so typically we know some common use the way like rock or advanced prompting or some prompt tricks, or either from a model side, we can fine tune a model or use rhf or choose different models to figure it out. But right now what we want to do is like, okay, we don't want to rely external knowledge or make some additional assumptions to make this solution more restricted on specific resources. What we want to do is we still want to propose this kind of zero resources black box methods as a plugin for any type of hallucination detection and mitigation as a whole framework. So here we proposed two mitigation solutions built on top of the consistency concept. And both are zero resources black box methods.

Jiaxin Zhang 00:19:45: So the first one we call is the RM check and fusion. The inside of this method is kind of like the ensemble. So you will see some related paper like the RM blender means to leverage different type of open source models and merge them together to give you a more reliable answer. So I think that's our related work. And another one is the divide concur reasoning methods means like we want to make the whole long context shorter and shorter and check the sentence by sentence and also leverage the strengths of rm with self correction to further improve the inconsistency to mitigate hallucinations. So the first is the scalable consistency number of methods. So the insight, as I mentioned is kind of like from your input you send to different rms and it can be black box, can be open source, it's fine. And then you will get a bunch of responses and the next step is how to check the consistency among each other, right.

Jiaxin Zhang 00:20:52: So here we proposed the strategy. We call it you only prompt once. I think it's a new prompt. Tricks to check the power bytes consistency by just using one inference to get the consistent resource of QA pairs. And then we will get the voting to get the most consistent responses and also rank the responses from one to k. And eventually we define another step to get a fusion responses to merge them together. I think also leverage the strengths of each responses and also mitigate a potential weakness. For this case we would just want to merge the stillfire, the three models, the 3.54 and the palm two.

Jiaxin Zhang 00:21:44: And for this case we can see our SC meesters performs much better than the single use case specifically for this four data set. Here is our metaprom template for yopo and also for the fusion component. And for this work I think it's pretty flexible to incorporate more open source models rather than just rely on black box models. Depends on what's the specific use cases. Okay, the second one is the consistency DCR consistency. We call this the divide concrete reasoning for consistency evaluation and improvements of large knowledge models. So for this case, like I mentioned in the beginning, another type of hallucinations is from the summarization task means okay, given this long paragraph you want to get the short abstract. How can I guarantee there is no hallucinated information extracted from the paragraph? And for this kind of long context comparison word, a consistent check is pretty challenge.

Jiaxin Zhang 00:22:51: So that means we propose the objective is to check the consistency between one reference and one candidate. The reference can be your original context paragraph and your candidate can be the extract, abstract or shorter graph. So rather than just direct compared them as a paragraph level. I think if you are familiar with the evaluation metric like g wall, this is a state of art method for this kind of paragraph evaluations, consistency. So rather than using the paragraph level comparison, we propose this divide conquer solution means we define several RM agents. So the first agent is to divide conquer. We call it divided conquer evaluator. That means that we divide the whole paragraph consistent check to sentence by sentence level, the consistent check.

Jiaxin Zhang 00:23:47: And then the second part is we call this the automatic automatic converter. That means we want to based on the sentence level consistent decision and regions, and came up with a numerical score and use this score to calculate the consistent evaluation. Rather than just say okay, rm say okay, yes or no, hallucinate. We're not hallucinated. We're consistent. We're not consistent because we use this agent to get a decision. Also get some regions. So the regions means like if they are consistent, okay, we say it's fine, but if they are not consistent, we may get some regions.

Jiaxin Zhang 00:24:28: So based on these regions, we want to leverage that and propose a third agent is a region assistant improver. That means given this region, can we think about how to improve the candidate to exactly match your reference. So that's what we call as the third agent to improve the candidate with the standards level and make the consistency with more improvement. I think that's the overall idea. And we have used this method to test like semantic consistency check and also the summarization task and get a very significant improvement that. So this is just an illustrative example means from the reference candidate, we first implement this one to get the sentence level check and get a consistent we're not. And then we use this one to analyze the region and assign score minus one where positive one and then calculate the final score to evaluate this kind of sentence and then leverage the regions from this agent. We can say okay, if they are not agent, it's green.

Jiaxin Zhang 00:25:35: So we want to use that one to regenerate the new candidate to fit the original reference. And you will see the whole process is inter team and we can implement this multi round improvement to get a more consistent results. Here we showed us the benchmark and the experiments. So we outperformed the Stefan measures by a large margin for the summer evolved data set. It's a very important data set and we get a very good performance compared with like Geval GB score or conventional score metrics. And also like I mentioned, this method is not only for consistency check. We also provide this capability for consistency improvement based on the reasoning part. So we significantly reduce near 90% of output inconsistency.

Jiaxin Zhang 00:26:33: So very promoting for effective hallucination mitigation. So I think that's the resource for this. Three benchmarks. All right, I think that's all my prepared stuff, but give a brief summary and leave some challenge and open questions. So, as we know, I think hallucination is very challenged. It's an inherent limitation of larger models and typically for production. We want to implement hallucination detection first and then try to think about some mitigation ways. And the third one, I think is consistency is pretty essential and helpful for detecting, evaluating, and mitigating hallucinations.

Jiaxin Zhang 00:27:20: And shaft check. GPT is a good way, I think, based on the intuitions, but self check consistency may not be good enough. So that's the reason we proposed SCC three. It's a cross check over questions and model perturbations to get a good performance. And also we propose two solutions of the mitigation solutions. One is the ensembles, another one is divine conqueror reasoning. And beyond that, I think even now we have a bunch of papers, research progress, and the bigger models, we still have a lot of challenges. For example, the hallucination in long form test generation is a paragraph, very paragraph level or documented level.

Jiaxin Zhang 00:28:03: And also we have a rug. It's a retail argument generation. But we cannot figure out totally the hallucination. Like zero hallucination is impossible in the current stage, so how can I figure it out? And the third one is the hallucination in the multimodality. Like large venture language models like Mihal showed the case. So maybe it's more challenged than the pure text only models. And also we leave some open questions for discussion. For example, like can staff correct mechanism help in mitigation reasoning hallucinations? From our case, we say yes, but somehow we still have a small gap and also the trade off between the RM cost and the multi round improvement.

Jiaxin Zhang 00:28:51: And the second is, can we actually capture RM knowledge boundaries? So here, most of the case we focus on zero resources and without additional knowledges. But most of the cases, like rug where knowledge graph, we incorporate additional knowledge and the knowledge, and it's up to date and now rely on the pre train models. Can we make sure all the knowledge are correct and can leverage this kind of knowledge to mitigate hallucinations? So the third one is like, can we strike a balance between creativity and factuality? So that means if you ask some creative question, like write a very fantastic or romantic poem, this kind of thing, maybe some cases hallucinating will be beneficial, right? But for some cases it's not. We want to just get a very rigorous answer and a resource specifically for domain specific problems like health care, financial. So if you just get a wrong number, so it will give a big loss. Right. So how to trade off the creativity and the facilitator? I think this is also very interesting research problem. And here, if you are interested in this kind of work, feel free to look at our paper.

Jiaxin Zhang 00:30:10: And also our GitHub repo is for SC three and the DCR repo we already released online. And finally, let's appreciate my team members. We have Wendy and Damon, Kamalika and Kumar. They are from Intuit and into AI research. And this work, we also have the academia collaborators. Johan is a PhD students, and Brad mulling from Vanderbilt University. So wonderful collaborative work, and thank you everyone. I think that's it.

Jiaxin Zhang 00:30:45: Any questions? Thank you for your attention and joining this talk.

Adam Becker 00:30:49: Nice, Josh, and thank you very much. Wonderful talk. However long we have, I'm going to try to get as much out of this, and I'm going to ask you a bunch of questions. I think there's a couple of ones from the audience. Let's see. One of them has to do with different models for GPT. So Slava is asking, have you measured consistency score for different GPT four models?

Jiaxin Zhang 00:31:18: Oh, that's a good question. The short answer is no. But we noticed this, and we want to verify this recently because ourgb model, we use Azure API, and we also build a weber, like the gen studio, on top of the GBD API. So in our company, we didn't frequently update this version, but we want to try to see the effect of different versions. Like again, for the prime number case, we already noted that the previous GBT four version, the resource, are total different than the current version. So maybe, I think open. I noticed this kind of thing and fixed about.

Adam Becker 00:32:08: Okay, so until I get other questions from the audience, people do want to read the paper. Can you scroll up to one of your first slides? I have some questions about that.

Jiaxin Zhang 00:32:20: This one.

Adam Becker 00:32:21: Okay, go to the next one. Next one. Okay, so here we're already beginning to categorize two very different types of hallucinations.

Jiaxin Zhang 00:32:34: Right?

Adam Becker 00:32:35: This is sort of where it seems like we're beginning. You have the factuality hallucination, and you have the faithfulness hallucination. Factuality is where the answer that it is giving you is absolutely wrong in the first place. Right. We know that it is not factual in the faithfulness one. It isn't that it's wrong, it's just that it's not actually relevant to the context that we asked it to collect it from. And so it's misleading. Even though it might be true, it's misleading.

Jiaxin Zhang 00:33:03: Right.

Adam Becker 00:33:05: Okay, cool. And then let's go on to the next one. And there are going to be different ways of detecting this.

Jiaxin Zhang 00:33:16: Right.

Adam Becker 00:33:17: You can then detect whether or not the hallucination is a faithfulness one versus a factuality one.

Jiaxin Zhang 00:33:22: Right.

Adam Becker 00:33:22: Like the different detection mechanisms are going to help us to parse which one is which, correct?

Jiaxin Zhang 00:33:28: Yes.

Adam Becker 00:33:29: All right, let's go on to the next one.

Jiaxin Zhang 00:33:32: Yes. So one comment is like for the factuality, if we can introduce additional external knowledge, that'd be helpful, for example, like rug, this kind of thing. But for a second one, I think it's kind of like related to some specific task, like summarization. So this kind of hallucination may introduce by the extract information from the given reference, then they will. Misleading. Like you said, misleading some information. So it's a two type of. For this one maybe the theory sources may be better, but for this one maybe, I think incorporate knowledge will be better.

Adam Becker 00:34:09: Okay, so even for the factuality one, it seems to me that there's a possible, we can continue to divide this up into different subcategories because it could be that it simply is lacking the factual information in the network itself. Right. Where the fact it was Neil Armstrong was just missing from this model, or it could be that it actually did know it, that it was Neil Armstrong, but at that point it is inconsistent. So there's consistency versus completion or.

Jiaxin Zhang 00:34:46: Right, yes, exactly. So yeah, the proposed method, I think is pretty generic to handle different scenarios, but I think if we know some prior information, maybe help us to build on top of the existing method to figure it out better.

Adam Becker 00:35:02: Justin, this was absolutely fascinating.

Jiaxin Zhang 00:35:05: We appreciate you coming here.

Adam Becker 00:35:07: And a bunch of folks are already going to look at the paper and they've asked for the GitHub link. So thank you very much for coming.

Jiaxin Zhang 00:35:14: All right. Yeah, thank you so much.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

13:10
Posted Apr 18, 2023 | Views 2.9K
# LLM in Production
# Vector Database
# ChatGPT
# Redis
# Redis.com
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com