Entity Resolved Knowledge Graphs
Expertise in machine learning, natural language, graph technologies, distributed systems, and related topics. Known as a "player/coach", with +40 years tech industry experience, ranging from Bell Labs to early-stage start-ups. Started coding in 1972; used email for work since 1981; software lead for neural network hardware accelerators 1989-96; guinea pig for AWS 2005-ff; led early teams identified as "Data Science" in 2009-ff.
Board member for Argilla.io; Advisor for KUNGFU.AI, DataSpartan.co.uk; Lead committer on pytextrank
and kglab
. Formerly: Director, Community Evangelism for Apache Spark at Databricks.
Werner Herzog is his spirit animal.
Knowledge graphs have spiked recently in popular use, for example in retrieval augmented generation (RAG) methods used to mitigate hallucination in LLMs. Graphs emphasize relationships in data, adding semantics — more so than with SQL or vector databases. However, data quality issues can degrade linking during KG construction and updating, which makes downstream use cases inaccurate and defeats the point of using a graph. When you have join keys (unique identifiers) building relationships in a graph may be straightforward, although false positives (duplicate nodes) can result from: typos or minor differences in attributes like name, address, phone, etc.; family members sharing email; duplicate customer entries, and so on. This talk describes what an Entity Resolved Knowledge Graph is, why it's important, plus patterns for deploying entity resolution (ER) which are proven to work. We'll cover how to make graphs more meaningful in data-centric architectures by repairing connected data: Unify connected data from across multiple data sources. Consolidate duplicate nodes and reveal hidden connections. Create more accurate, intuitive graphs which provide greater downstream utility for AI applications.
Slide deck: https://docs.google.com/presentation/d/1kq0KoovXORof2EuN8XUu3TsxpuVSTRJ8/edit?usp=drive_link&ouid=103073328804852071493&rtpof=true&sd=true
Paco Nathan [00:00:09]: There maybe there's two Bob R. Smiths, or maybe there's Robert Smith and Bob R. Smith Junior. How do you know whether or not they're the same person or a different person? Okay, so just real quick, entity resolution. This is something. If you've ever seen the movie 21, anybody's seen about the MIT card thing in Vegas. Yeah. So sensing has been around for over 20 years.
Paco Nathan [00:00:35]: This was actually used as part of the bust for the scene in the movie there. The idea is that if you have two or more data sets, and they could be structured data or semi structured data, if you have two or more data sets and you have at least two or more features, columns, fields, whatever, in each of those data sets, then you can start to triangulate on what are the entities that are in common across those different data sets. And so it provides a kind of semantic overlay. And so, as you bring your data sets together, one of the things that comes out of this is to generate, here are the entities, here are the relationships between them. Here are some properties of supporting evidence. And this is a way of basically creating building blocks, like the backbone of a graph. And so one of the ways I like to look at this is to think of it as an imperfect map. If you think about data sets inside of your data set, somewhere there are entities that are being described.
Paco Nathan [00:01:33]: It could be ships at sea, it could be molecules in drug discovery. It could be a lot of different kind of things. But then there's also relationships between those entities, and there's also a lot of annotations, properties that need to be recorded for either the entities or the relations. So I like to look at datasets in general, data in general, as this kind of imperfect map of what would become billing blocks for a graph. And if you've done natural language work, you've probably. Anybody heard of named entity recognition before? Ner. So this is different. Ner is where you have unstructured data, and I you parse it, and you get tokens in a stream, and then you identify spans of tokens, which are noun phrases.
Paco Nathan [00:02:19]: And then ner is where you go and label noun phrases. Is it a person? Is it a place? Is it a thing? Is it a unit of currency, etcetera? Entity resolution is different. This is where you're looking at structured data, and you're trying to find which are the common parts that connect. And actually the two can work together pretty well. There's also another term, entity linking, which is something entirely different, but also related in the tool chain. So I just want to give some background, because I know this is kind of a nourish area. I've recently just joined the company in the past month, and I'm building out the graph practice. But what we found is that in a lot of places where traditionally they're doing entity resolution, that's very important.
Paco Nathan [00:03:00]: Like police departments, before they go and, like, respond to an incident, they want to make sure they're going to the right address. We find that in these kinds of applications, usually there are graphs, because usually police investigations will also do some type of graph work to understand, you know, is there money laundering in effect, or were there priors? There's usually a graph somewhere adjacent to it. Ben Lorca has a really good breakdown here, a white paper describing this, and also just, I wanted to surface some use cases, some of this here. Esri has a nice one. Kenneviz was recently exploring how can you take an, essentially use LLMs to start to understand the things that are buried inside of SEC filings. So if you're familiar with, like eight k reports that companies make, say a company has had a big security breach, they can bury that somewhere inside the text of a disclosure document, and it's very difficult to parse that and bring it out. But with graph technology, you can start to surface these kinds of things. A lot of this is basically beneficial ownership, money laundering, illegal phishing, a lot of catching bad guys.
Paco Nathan [00:04:05]: So there's some use cases here. This one's pretty fun. It's from Esri, and it's actually using geosignals plus entity resolution plus like chain of ownership of boats to catch boats that are out at sea that go dark, turn the transponders off, maybe turn back on with a slight spelling error in the transponder. But they're actually part of an illegal fishing operation. How do we catch them over time? So cases like that, our practice area, we've definitely got some community forums. I've got a playlist of various different vendors working in this space, but trying to show what's going on off of YouTube videos. And by the way, I'll just do a little shout out. A couple people I know here have done a great book that I'll recommend.
Paco Nathan [00:04:49]: Large graph enhanced rag from Tomaskane that just came out. It's an early release. I have a talk where we go through working with some open data sources. So wizard is wage compliance from Department of Labor, PPP is from chamber of Commerce about PPP loans during the pandemic. And safeguard is about places, businesses in a particular place, and the tutorial is linked here, but it's basically take a bunch of different data sources, find out where did they link up, what are the connections between them through the entity resolution. Build a knowledge graph, and then you have downstream use cases. Might be a matter of visualization to do some investigation, or it might be a matter of analytics to look for patterns. Or more and more, it's a matter of doing graph drag and building agents.
Paco Nathan [00:05:39]: That's the most interesting part. Now, if you want to run through the tutorial, I'm not going to do it here, obviously, but it's just really four steps, and it's on GitHub. And it's just this idea of understand this technology using some open data and some open source takes about a half hour to run through. And like I say, we have a lot of data here about businesses and the locations and the addresses, they don't quite line up. But it's interesting because you can see who are the bad actors. This is all in the Las Vegas area. And like, who are the employers, who are maybe cutting some corners with their employees. And we pull it all together, and then we do this, like, semantic overlay of basically, what are the entities that will link out to the records and make sense of them, which ones have supporting evidence that could go through an audit, and what are the relationships between them? So, like here, I'm showing something about Mandalay Place, which is casino has a lot of side businesses.
Paco Nathan [00:06:35]: And we put it into a graph just looking at this. Yeah, we do have a data quality problem here, because some of these have a real long tail as far as how many records. It's a lot of ambiguity that's in there. But you can see, it's like, if you just take the raw record, you get this. But once you start to do semantic overlay with entity resolution, you start to see patterns jump out. And you can drill down into that and see, like, here is a casino and businesses it owns. And then I'll just say that one of the things I like to describe about using graphs is to think about graphs in terms of levels of detail. So you can think at a very low level, you've got sources, and then at a higher level, you've got some organization of data engineering work to ingest it, and a lot of provenance and evidence of how it might link together, but not necessarily the part that are linked.
Paco Nathan [00:07:22]: And then up above, you have another level where we would call it more knowledge graph, where you actually have the entities define what the relationships are. You can use this to do inference. You can use this, be running the rag and using it all in. I just want to throw out a few resources about graph enhanced Ragdez. Definitely we're seeing a lot of lift by using a knowledge graph to help ground what's going on in LLM. And there's a study that came out from LinkedIn recently of almost 30% lift ac in customer service. There's several others. I've got a huggy based collection of different papers that are showing lift from using graphs and rag.
Paco Nathan [00:08:02]: And we also have a discord channel that Neo Four J is sponsoring about graph rag from developers who are working on this. So definitely join us. I think this is a really exciting area, but I also, you know, I think there's upsides and downsides. Just real quick. If you're just using rag as a way of grounding your data, it's interesting, you know, it's kind of repurposing recommended system technology. The thing that a graph buys you is being able to hop, go through multiple hops to find better answers. So for instance, here a cat is related to a kitten, is related to a tiger and a lionesse. If you're just looking at like what's the similarity in words, you're not going to find relationships like that.
Paco Nathan [00:08:42]: But if you can go a few hops out, you can leverage graphs to find out like where the connections and inform what you're going to provide as a result. And lastly, there's a much bigger talk, and there's a lot of links and primary sources and whatnot here. So please load up the slides and let me know if this is of interest. Thank you very much.