GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, & Visual Analytics
speakers

Paco leads DevRel for the Entity Resolved Knowledge Graph practice area at Senzing.com and is a computer scientist with +40 years of tech industry experience and core expertise in data science, machine learning, natural language, and graph technologies. He's the author of numerous books, videos, and tutorials about these topics. Paco hosts the monthly "Graph Power Hour!" webinar, and joins Ben Lorica for a monthly AI recap on "The Data Exchange" podcast.

Weidong Yang, Ph.D., is the founder and CEO of Kineviz, a San Francisco-based company that develops interactive visual analytics based solutions to address complex big data problems. His expertise spans Physics, Computer Science and Performing Art, with significant contributions to the semiconductor industry and quantum dot research at UC, Berkeley and Silicon Valley. Yang also leads Kinetech Arts, a 501(c) non-profit blending dance, science, and technology. An eloquent public speaker and performer, he holds 11 US patents, including the groundbreaking Diffraction-based Overlay technology, vital for sub-10-nm semiconductor production.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
Existing BI and big data solutions depend largely on structured data, which makes up only about 20% of all available information, leaving the vast majority untapped. In this talk, we introduce GraphBI, which aims to address this challenge by combining GenAI, graph technology, and visual analytics to unlock the full potential of enterprise data.
Recent technologies like RAG (Retrieval-Augmented Generation) and GraphRAG leverage GenAI for tasks such as summarization and Q&A, but they often function as black boxes, making verification challenging. In contrast, GraphBI uses GenAI for data pre-processing—converting unstructured data into a graph-based format—enabling a transparent, step-by-step analytics process that ensures reliability.
We will walk through the GraphBI workflow, exploring best practices and challenges in each step of the process: managing both structured and unstructured data, data pre-processing with GenAI, iterative analytics using a BI-focused graph grammar, and final insight presentation. This approach uniquely surfaces business insights by effectively incorporating all types of data.
TRANSCRIPT
Weidong Yang [00:00:00]: So I'm Wei. My full name is actually Weidong Yang, but Wei is easier to pronounce. And I'm CEO for  Kineviz at Data analytics, visual data analytics company. And I love coffee. I think the civilization starts with the invention of coffee. So I have to drink a coffee. I do add milk to coffee because the black coffee is a little bit too strong for me.
Demetrios [00:00:26]: Welcome back to another mlops community podcast. Today we are lucky enough to have not one, but two graph experts who have been doing this for a very long time. I got schooled. I felt like I learned a ton about how to use graphs as tools and ways that we can leverage them better. Let's get into this conversation with Paco and Wei. As always, I'm your host, Demetrios. And you know what is a huge help? If you can hit a little review and whatever you are listening to this on, that would mean the world to me. Boom.
Demetrios [00:01:00]: Let's jump into it. We were talking about PII and using different methods to anonymize data. Right. And Paco, you, you had said something that I didn't fully understand and then way you said something else that I didn't fully understand. So maybe we can rehash that and I can understand it the second time.
Paco Nathan [00:01:26]: Awesome. Well, I was going to ask if you all ever came across. There's another podcast that I follow called the Dark Money Files and it's people who, there's a couple of consultants who have worked in banks and understand a lot of the ins and outs of financial crimes and investigations. And so I was just going to preface it because they've had a great series recently. If you've ever heard of this thing called a sar, It's a Suspicious activity report. And the laws are really weird depending on what country the bank is in. But basically this, if you're at a bank and you see some suspicious activity, like there's a money transfer and the counterparty is like a known terrorist group or something, you see something weird going on. Okay.
Paco Nathan [00:02:12]: Number one, you have an obligation to report a crime to a criminal investigation unit. If you see something suspicious and you don't report it, that's a crime. Yeah. At least if you see something suspicious, you have not an obligation but a responsibility to send it up the chain so that other financial houses might share. But if you send too much information, you might get sued. And then so there's these reports and it costs on average about $50,000 to process each report, so you don't want to generate too many of them. And like machine learning models could generate thousands day, which would be like, you know, tens of millions of dollars of liability. So this whole space of like, what do I do? I'm getting, I'm getting attacked and what do I do? Because I mean also these people are taking money and you might have to, they're under some situations as a bank, you might have to compensate if, if there is some kind of scam.
Paco Nathan [00:03:09]: So you could be losing money and facing like legal threats from three sides. And, and meanwhile there's this thing called a sar and like I've actually been yelled at for asking what I was supposed to integrate with something and I was like, can I see what the scheme is? Like, no, you're not allowed to. No, it's too confidential. So it's like, it's just this whole tangle of worms about how to, how to, what do you actually do with, with. Once you have financial crime, evidence of financial crime, or even suspicion of it, what next steps you take are really tangled. And I think Weidong, you probably have a lot more experience about this in certain theaters too.
Weidong Yang [00:03:48]: So I have some similar experiences where even the schema is not allowed to see because the schema may actually reveal some secrets or certain activities may become liable to certain parties. So that can be pretty tricky.
Demetrios [00:04:07]: And so it basically gives, it gives away information that if you were looking at it, you now, because you know the schema, you can guess a few other parts of this puzzle and get information that people don't want out there.
Paco Nathan [00:04:22]: The bank, the banks are using a lot of data that come from providers. There may be other cases where there's data that's coming from say, public sector agencies, crime investigations. There may be intelligence reports and so there may be parts of the schema that are highly sensitive and only certain people are allowed to see.
Demetrios [00:04:42]: But you were saying that with graphs and anonymizing that pii, you're still able to gather insights, right?
Paco Nathan [00:04:52]: Yeah, that was cool. We were just in a talk and Brad Corey from Nice Actimize was showing where like they're preparing to do rag and they were using, I think bedrock and they know that they've got a hot potato. They know they've got a lot of customer PII that just can't go outside the bank. So what they were doing is substituting PII with unique identifiers. They generate tokens, they generate on the fly and then they make the round trip after they've run through LLMs and made a summary and they replace the tokens with the highly confidential material. They just have internally. And so this is a way of being able to use some sort of external AI sources, AI resources, but still manage a lot of David Privacy.
Demetrios [00:05:40]: Oh that's cool. Yeah, I've seen it with. We had these folks on here from Tonic AI and they were talking about how they would use basically the same information but swapping it out. So if it is someone's name, they just change the name. So it went from Paco to John. And if it is a Social Security number, they would swap out the Social Security number and totally randomize the number but it still is a Social Security number. So you, at the end of the day you get almost like this double blind. So even if you're a data scientist looking at the information, you can understand it but you don't know if it is the true information that's going to reveal that.
Paco Nathan [00:06:29]: PII Interesting. Interesting. Yeah.
Weidong Yang [00:06:34]: Although I do see situations where even the structured document itself gets reviewed. It reviewed information that you do not want people to know. Like reviewed it in the investigation space very often you do not want people being investigated. No that being investigated but certain information, even the structure, structure the document being reviewed can become a problem. So pagua at some point I felt like the in house on prem alm might be like necessary like especially just red news that like what is it that the M3 Ultra Studio with a 500 gig RAM can run large language models at 2020 token per second? That could potentially. Yeah, a interesting solution for that.
Paco Nathan [00:07:37]: Yeah, I mean for our end use cases, you know like 60% of those are air gapped and so you know the largest chunk of that they're going to be a lot of like public, public sector agencies running in SCIFs. So they, they can't do any data out. Yeah. And, and, and there's, there's good news for running really interesting LLMs on you know, local hardware. There's a lot of really good news I will shout out to my friends over at Useful Sensors, Pete Wharton and company. I'll put that in the chat. You can do a lot with hardware. With local hardware.
Demetrios [00:08:18]: Yeah. What are they doing?
Paco Nathan [00:08:22]: Useful Sensors. So Pete Wharton and Manjarath Kudler, they were part of the TensorFlow team at Google and for I think like eight years they evangelized use of deep learning inside of products at Google like internally and then they left and the team has a startup in Mountain View now and what they're showing is hey, here's like $50 worth of hardware. Here's, you know, here's an ARM chip with A neural network accelerator on it and we can run three LLMs on battery power. So it is pretty cool because they came out of like the tinyML. I don't know if you've ever seen the conference.
Demetrios [00:09:04]: Oh yeah.
Paco Nathan [00:09:05]: And so this is a lot of the specialty that Pete has and Manjana, he was on the CUDA team at Nvidia before. So I mean these folks really know how to make AI infrastructure run on hardware and particularly how to handle a lot of low power and low latency kinds of situations and where to punch through the bottlenecks. You don't necessarily have to have an enor a ginormous GPU cluster, although in some cases it helps, but especially when you're running inference, you can be running on much lower power and doing really interesting things out in the field.
Demetrios [00:09:48]: So wild. Now I know that we had originally wanted to chat a bit about this idea that I think wei you had proposed and it's a little bit of a, a differentiation on graph rag and so maybe you can set the scene for us because I want to go deeper there.
Weidong Yang [00:10:14]: Yeah, I, I, I, I run in dangerous danger of pulling way far because I fundamentally I think with LM how machine process information has changed. Before LLM, everything is exact, symbolic, like matching, like all the APIs, all the rigid data structures. Just think about Deep Blue and beat chess. Everything is rigid knowledge as rules and things. LLM changed everything because LLM start to understand things in the contextual base, start understand fuzzy things. And it suffers the same weakness of a human being not exact like we glide over information, we draw conclusions, we make leaps, make jumps. But at the same time LM's ability to reason like human that that for me is fundamentally changed how we approach the computing. And so in the applying LM to analyze document my feeling is my analysis is now we can let LLM work more like human rather than like a machine.
Weidong Yang [00:11:44]: We understand in the past that also implies what the data structure is preferred for LLM, which might I would argue that a data structure, a data management that preserves as much contactual information as possible, preserve as much nuance as possible. The subtle nuances may come out to be important. So I use the example of My wife is Brazilian. The American tourist to Brazil gets invited to a host party says the party start at 6pm so so as a good American guy show up promptly on time at 6pm and the hostess comes out still wrapped in the shower the shower tower and totally confused. And so right. And it turned out over there when this is 6pm is where the hostess start thinking about the party start like going out shopping, preparing food and getting ready. And the people usually don't show up until like two or three hours later. And it's a bad culture difference.
Weidong Yang [00:12:55]: Yeah, right. If we try to capture that in a knowledge graph, what kind of construct allow us to capture those subtle culture nuances and that might become important in understanding the document later. So I think that's the challenge. Yeah. Paco, you want to add something there? I'd like to hear what you think.
Paco Nathan [00:13:18]: Well, from a perspective of natural language, something that the models bring in, but it's kind of a nuance and I don't think it's talked about a lot. There's a very recursive nature to how we as people talk with each other and tell stories and share information. We do reference it in the sense of going down the rabbit hole. Like if you follow a thread too far, you're kind of going down the rabbit hole. And there's this very recursive nature of how we think and especially how we express. It certainly comes across in written language. Although we tend to think of written language as something linear. There's paragraphs and sentences and it can all be diagrammed, but when you look at the actual references that are inside of those sentences, they're making recursive calls throughout a story, throughout somebody's speech or throughout a book.
Paco Nathan [00:14:09]: And you know, we can try to linearize that and come up with like an index or a bibliography, but at the end of the day it's a graph and you get this very self referential thing in any text. And this is something that the LLMs have really, I think, pulled out. And we were also just in a part of the talk we were in also Tom Smoker from wihow was showing about how they leverage ontology, they leverage schema and chase after information recursively. So that's just another kind of view on this. But I love how you all are approaching this. You have a very powerful view of kind of relaxing the constraints up front, but then having the context propagated through.
Weidong Yang [00:14:55]: I realized there's an important philosophical approach difference between east and West. And the Eastern philosophy very much drives towards the nature of things and it's important, which is that's very curiosity about nature of things. The desire to have a definitive definition of nature of something is led to the great scientific discovery over the past several hundred years. The Eastern philosophy very much on the outside is focused on the contextual focus on shifting, changing nature of things. Like the Chinese, the Bible, the Daoism, the Bible. The first verse it says means if you name something, you get it wrong or it's like it's not permanent. It's really focused on impermanence of things. It focus on everything, change its nature in context with other things.
Weidong Yang [00:15:52]: So that is essentially a graph. Now you're putting both things together. Okay. I have to say that the attitude towards like oh, everything changed, thus we cannot see anything, thus everything is fuzzy is very much contribute to the Chinese technology science developed very far in about a thousand years ago and stalled. And a lot of its attribute to this philosophical like things like reduce a lot of curiosity and drive down deeper into the nature of things. However, in practical things there's some practical application of that approach which in today with LLM and graph we really see that it's like a great combination of you allow certain things to be drilled down to be very definitive, defined, to be clearly defined within the context. But a lot of information, contextual information stay fuzzy. So in fact Paco, I felt like I'm really excited about integrating sensing and our GraphXR and kind of as a solution together because the sensing helps to drive this definitive part.
Weidong Yang [00:17:09]: Once you have definitive part drew down named defined, it really speeds up to make a lot of assessment fast, definitive and precise, which is co crucially important. But on the other hand, you allow this loose structure of information decomposed as a graph that you can easily retrieve and without losing the nuances, the subtlety in the culture, different things, you still preserve that. So those things come together I feeling is the one how you want to ground LLM to create a precise, accurate and know the limit, know when it does not know not to make a judgment. I think that's also very, very important. So in my mind is like the graph and AI right now is present opportunity to allow this western way of driving the nature of things and eastern way of focus on the contextual information come together to work together to solve practical problems.
Paco Nathan [00:18:18]: So, so very well said. And you know, the challenge that we face is we don't really know what the downstream application will be like. We're doing investigation, we're doing some kind of discovery. Whether you're trying to find, you know, money launderers or whether you're trying to find, you know, who's my best customer for this hotel. It's a discovery process and by nature of discovery, you don't know what the answers are. In fact, in a complex system you don't even know where or how to, you know, it's Unknown unknowns. Right. So by preserving that context then you are sort of fortifying yourself so that when the time presents, you'll be able to make the right discoveries.
Paco Nathan [00:18:58]: You won't have cut them off in advance. I think it's. If you, if you go back to like before relational databases came out, you know, you go back to some of the earlier writings from Ted Coddle and one of his colleagues was William Kent, who did a book called Data and Reality. If you go back to some of the early, like 1970s thinking about data management, it's really interesting to see where the lines are drawn because in this Western view, so much of data management was about, let's have a data warehouse, let's pretty much throw away the relationships, let's focus on the facts. We have a lot of, as we were saying, a very Western view of like, I just want to know like millions of facts and I will piece them together with a query. I'm not, yeah, I'm not really interested in preserving the context. So I mean, I, I think we have a long history from like data warehousing of going too far on the Western side.
Demetrios [00:19:54]: Well, what is interesting to me is the conversation that we had with Robert Kalkon here probably three months ago, and how he said we've completely thrown out ontologies and for his specific use case that isn't the way that they wanted to go. And I wonder if you guys have thought through that and what that looks like, what the benefits are and is it one of these things where you potentially are experimenting on those levels too?
Weidong Yang [00:20:28]: In my perspective, ontology is important, but you have to know the boundaries. Like I give a parallel into all the theory in the physics theory, like Newton's law. Like Newton's law is important. It captures important truth in the nature. However, like, just like any physics, I'm a physicist, any physics theories, the moment when the theory is proposed, it's very important fact. Important concept is you're waiting to be disapproved. So you never accept as the truth of everything you have a theory. Well, Parker is mad scientist, so I think he's also very familiar with the concept.
Weidong Yang [00:21:08]: When you propose a theory, be test true, but you're always looking for situations, looking for the boundaries where the theory will stop to be true. So I don't think ontology is anything different. It's just like ontology needs to be very well grounded, need to be the contextual context needs to be defined within this context. This ontology knowledge is real. The problem, what I see as a lot of traditional knowledge graph approach is people ignore the fact that ontology has to be confined within a specific domain. The moment you step out of the domain you have problem and but the other thing is within this domain, ontology is fantastic. It help you to solve problems so much faster, so much precise. But again, just like as long as you can define the boundaries, define the domains, it's great.
Paco Nathan [00:22:12]: What Rob Kalk and Ellen Tornquist and others at Ask News, what they're doing is they're looking at news sources, especially regional news sources across the world, and they really are finding like hard evidence, groundbreaking evidence on the ground. Literally. If you're doing ESG work and you're trying to do due diligence on a company or a set of suppliers and you want to find out what are their operations really like over in that other country where they're based. And then you find out they're engaged in like, I don't know, child labor or something and you want to make other arrangements before your shareholders find out. So I think with Ask News, you know, they're out and they're looking, they're working with those publishers and they're, they're collecting that news and representing it in a graph. And yeah, as you were saying, I mean ontologies really don't work across domains. You really want to focus more on like closed world within a domain, having a full enterprise wide ontology. Nice idea, but I rarely see it work.
Paco Nathan [00:23:22]: And I think that in the case of like understanding news reports and the world, you don't know what the domain is in advance. You only know this is what is being published. And so I think by relaxing that constraint at Ask News, they're able to come up with a graph of like, here are things that are related, you can, you can follow this evidence and you can find more historically about this area. I think those are very important. But ultimately it will be shaped by some kind of context, some type of shared definitions. And ontology is really more about sharing definitions and making sure we're describing the same thing. Because I swear, you go to a big company, use the word customer in front of one VP in sales, it means something different to the VP in charge of procurement. So even the words themselves don't cross domains.
Paco Nathan [00:24:15]: The graph is basically our idea that we know that there's connections. Like if you, if you do have your, your operations data, but then you also have your, your like sales data, you know, there's some connections across there. It's not exactly the same, but some stuff is connecting. So graphs show where those connections are. But I think, you know, think about like the example of Google Maps. Like there's different levels of detail and of course, any video game of course has this too. But you know, if you're taking satellite data and like trying to stitch together a map, you zoom, can see the beach and you zoom in, you see the car tracks and you zoom in further. At some point you're going to get to pixels, right?
Demetrios [00:24:53]: Yeah.
Paco Nathan [00:24:54]: And you zoom out and maybe you see this landscape of like a beach next to the ocean. But then probably you zoom out at some level and they've got like the name of the beach. Right. So there's like a high level detail. I think graphs are much the same. There are connections at the low level, like Ask News is saying is like, you know, here's reporting from Zimbabwe. This is like the reporters on the ground. But then you zoom out and you're like, okay, well you know, what impact does this have on our supply network? Do we have to really make different plans? Is there going to be like a war breaking out that causes all those shipping containers to be delayed by three months? I think at some level you need to think of the graphs as sort of collecting higher and higher into more abstracted, more refined concepts, if you will.
Paco Nathan [00:25:43]: And so the stuff at the low level is kind of like, let's see how it all fits together. The stuff at a higher level it's like, oh, actually we can maybe do some inference on this or we can use this to help structure other data that we're going to piece together.
Weidong Yang [00:25:57]: So Demetrius, you actually touch up a really big subject that things now.
Paco Nathan [00:26:06]: In.
Weidong Yang [00:26:06]: The exploratory process come up with the questions. Knowing what question to ask often is 80, 90% of the work. So a prescribed things to give you the answer often miss the point or missed important subtlety. But the problem is how do you discover the question you need to ask? And so in the way that our brain, our perception, our visual perception, our brain is a fantastic. I don't want to call it a machine, like I don't want to even call it tool, but has this great power of see patterns in, in, in. In the information. Like we, we look at the out in the sky, we see the cloud. We have some concept, we have some kind of we, we like you are performer.
Weidong Yang [00:26:58]: I look at your performance, your dance like their information being expressed without being able to verbalize to define it. But you have to watch it to feel that. Maybe you watch long enough, you stop being able to describe it. You start be Able to say, oh, this is something is there. So in a way, what the graph does is the graph is a fantastic medium for visualization. You look at the information expressed, it's just like hope our brain. When we think about you, Demetrius, I immediately think about Paco because we in the same pot in room together. So that's association.
Weidong Yang [00:27:39]: So this association of multiple pieces, information entities in the space, if you visualize effectively, it help you to see the patterns, help you to see all the missing links, missing patterns, things that get our attention. And then we start be able to formulate the question, to formulate, to answer the question. So more than a tabular data structure, I have to say, the graph really helps us to engage our brain in this way to spot important information. Just go like, go watch a dance performance. You see something definitive happening, but you know it before you engage your, like, language, like logical thinking. Afterwards, you start, Things start, concepts start to form, and then you can start to build things around it.
Demetrios [00:28:41]: Oh, dude, how cool is that? Yeah, it's. You know it before you can express it in that way.
Weidong Yang [00:28:47]: Absolutely, yeah. I think a lot of analytics workflow is work the other way around. We focus so much on building up the queries, build up the programs to drive it, to drive answer. But as park and we in the investigative space, we all know that too often getting the hint is 80% of work.
Paco Nathan [00:29:18]: Like, if you know that you're being attacked, you know that they came in through some vector, there's probably some set of machines that are compromised. You're not seeing that. You're seeing where, you know, the bad things are happening, stuff is being stolen or whatever. So looking across your network, just building up a graph of, like, the associations of what's happening during an attack, there's some placeholders. There are definite questions that could be generated, like, which machine was compromised? Maybe I should fix that. So I think from an operational perspective, you know, I mean, you kind of have to think of. I mean, we do think about that, right? We do think about, like.
Weidong Yang [00:29:52]: Right.
Paco Nathan [00:29:52]: How do we identify those unknowns? But the problem is that the more complex the problem becomes, the more that those unknowns are not something that can really be charted. They have to be sort of poked at and explored.
Weidong Yang [00:30:05]: Yeah.
Demetrios [00:30:05]: And I think that's why weigh what you're saying. With the graph being this visual medium that we can poke at and we can explore, and it gives us a different perspective with which we can work with and wrestle with the data is some. Something that I hadn't heard before, but it Makes complete sense from a historical.
Paco Nathan [00:30:27]: Perspective in terms of data. You know, something to bring out would be to consider about spreadsheets because like spreadsheets are sort of my go to example of this is all in tabular form. It's very, very sort of, you know, left brain. Everything is very buttoned down. But the thing about spreadsheets that you never see is there's a really complex graph behind it and it only works because of that. But they never show that, they just show the tabular part. But all the real knowledge and dynamics and all the real information you're capturing in a spreadsheet is about those different dependencies and how that graph functions.
Demetrios [00:31:00]: Classic. Of course we don't see it because that would be absolute chaos for us. Right.
Paco Nathan [00:31:05]: Mind blown.
Weidong Yang [00:31:07]: The graph is this fantastic media for this perceptive thinking. Well, the challenge is like when we talk about graph, I think that we need really like separate two things. Graph in the media of information capture and the graph in the media to help us to think there are two different things. Graph as information capture. The sole purpose is to capture information as precise as possible, as complete as possible. You want to capture as much truth as possible. However, graph as a way of thinking, if you take the raw graph captured preserve a lot of truth. Well, the problem is we can only hold seven piece information in our brain at any given moment.
Weidong Yang [00:31:58]: We'll be overwhelmed by all those graphs. Let's think about our brain in that way. Even the vector embedding, I call it a implicit graph because vector embedding give you medium to compute the similarity effectively. You can construct graph.
Paco Nathan [00:32:18]: Construct a graph out of it.
Weidong Yang [00:32:19]: Exactly. You can manifest a graph out of it. So you will see that the graph being captured at the layer, at the stage that's really designed to preserve the ground truth as much truth as possible. But then you need a way to work the data into a form that we can easily digest with our perceptive power. That is challenged. This is also why in my mind is a lot of graph in theory, people know the graph is how we think, thus is important. But in practice that is a barrier. And how do you reconcile the need between graph as information capture medium and the graph as our to support our perceptive thinking medium.
Weidong Yang [00:33:11]: It's a very different thing.
Demetrios [00:33:16]: Just going back to what you were saying with we can relate each other because we're on this podcast together, we've done stuff together. Maybe there's certain things that come up in our memories that are going to be the most pertinent to that graph that we have in our head, but it's never going to expand more than seven hops or seven different parts of that graph.
Paco Nathan [00:33:44]: Have you ever worked with there's a kind of, I guess rubric, might be a way to say it came out of Carnegie Mellon, out of cmu. Jeanette Wing had this idea of what's called computational thinking. And so it was sort of like a four step process of like breaking down a problem and then being able to abstract it back out. It's really powerful and I've used it a lot in courses teaching people. But I think that there may be something kind of emerging as graph thinking. And so just to throw out a strawman here, this is kind of thinking out loud. But one of the things that we see in fin crime in financial investigations is a kind of graph thinking, a four step process repeated over and over where you know, you, you do your best to build out this graph and it might have hundreds of millions of nodes or billions of nodes or some ginormous number, something beyond human scale, beyond, beyond human comprehension. But then step two partition.
Paco Nathan [00:34:46]: So like can we break out this enormous graph into some areas of, of subgraphs of patterns that are interesting? Like hey, this, this looks like a really good customer. Or hey, this looks like a money mule, you know, fraud scheme. And so you go, you do this dimensional reduction then because you go from like 5 billion nodes in a graph down to maybe 10 or 20 that are interesting. And so that's a, there are graph algorithms like Louvain or like you know, weakly commit connecting components or there are different ways to get down to that scale. And in like in machine learning in general, we're looking a lot at dimensional reduction, right? So um, once you've got down to that scale now you can use other graph algorithms like maybe betweenness centrality or different forms of centrality to understand how are these parts connected. And gosh, maybe there's like one node in there who's orchestrating the whole crime ring, which is typically the case. There might be like a person with a bunch of shell companies, right, and they're doing fraud. So that's step three is like leveraging certain types of graph algorithms to sort of think of PageRank, let's bubble up to the top the parts that are probably first good steps to investigate.
Paco Nathan [00:35:59]: And then step four, put it through a work process. And I mean if you're working with people in a bank, put it through case management tools, a level, a analyst gets assigned it, they go and they start poking around the graph, they do something Interactive, they work with the visualization and they apply what they've learned. Or you may have some agents involved there too to help like summarize and dig up parts. But it's a workflow so it's kind of a four step process of sort of graph thinking, if you will, that can be applied and can integrate people and also AI technology together.
Weidong Yang [00:36:37]: Yeah, I want to add one more thing to Paco said it's really, really important to be able to narrow it down to build identified things to reduce, reduce, reduce. But there's also another aspect which is a simplification abstraction. Like very often when you capture the data, you don't really like the domain or you don't need to know the future question. So the domain is wide, but when you look for the information answer, the domain is narrowed. When domains narrowed, for example, I call Paco as a mad scientist. At some point I can just refer Paco as a mad scientist. I don't need to add information because mad scientist is Paco and that only valid in a specific domain. So the reason I say that is because a lot of information when you domain wide I call it when you capture information I prefer I call it a pure edge approach.
Weidong Yang [00:37:41]: In the graph, edge has no properties, it's just edge, it's just association. Anything you need. The property means the things maybe amended up on maybe have something pointed to it or point out to it. You keep it as a node. Now as you thinking very often like I know Paco, but I know Paco this relationship, I can carry a lot of context in it already. I don't need additional information to show, to tell how I know Paco. It just can be in there. I know Paco itself is sufficient.
Weidong Yang [00:38:19]: So what that means is when we present like I know Paco, that relationship as a single relationship, right in the data layer, there might be tons, thousands or tens of thousands of piece information there, but it come out as a one single piece of concise information. I think that is where I think a analytical workflow or visual analytic workflow should be is to be able to go from a very detailed, broad, big, large information distill or aggregate down to a simple representation, but is grounded in that particular domain in the particular context. So for us to, so we can communicate, we can, we can communicate in simple language rather than carry a lot of information when we have to I know Paco, that's it. We don't need to know how we know each other. Where do we know each other in certain contexts.
Demetrios [00:39:23]: Is it almost like the Data underneath is, is like an iceberg in a way. And you knowing Proco is like the tip of the iceberg. You have that one piece of information. But then if you wanted to get more granular, you can go down and see the whole iceberg.
Weidong Yang [00:39:43]: Yes way.
Paco Nathan [00:39:45]: Could, could we, could we say then that you know, we, we pull everything, we connect everything together. It's very noisy. We can go up different levels of abstraction but to your point then we're going up levels of abstraction in particular domains like for purpose. So we have some shared definitions and then we can start to say okay now let's do our Louvain partitioning or whatever. Then we start to drill down into subgraphs. It's like maybe a five step process.
Weidong Yang [00:40:14]: Yeah. Even with Louvain community calculation or any centrality calculation, the graph has to be simple because very often I think the graph we talk about is, I call the multimodal graph. Multi domain graph has different type of information in one graph. So computing a centrality in that kind of a hybrid graph as a hypergraph is very challenging. Or what does it mean as a result if you mix human and the emails and it's difficult. So that process itself to me is we already need to prepare our, transform our graph data into a form that is suitable for that centrality computation. Very often you have to already project into a specific domain for that computation to happen.
Paco Nathan [00:41:18]: Very good.
Demetrios [00:41:19]: That that's what I was thinking is like the data that you have only becomes relevant once you've narrowed it down in a certain way and you're looking at a certain plane of that domain and you say okay now, now we're going to be focusing in on this plane. That's when certain nodes and certain data and certain connections become relevant. Because you're looking at that layer almost in my head if I visualize it and we're talking about that Google Maps example again, you're diving deeper and deeper and you see different structures depending on the layer that you're looking at.
Paco Nathan [00:42:06]: And, and this fits very well with like data mesh kinds of concepts, you know, Jamaic, Dagani, talking about how different domains share. You have to abstract and you have to come up with relations. I think Chad also has the idea of like contracts, you know, where you have relations across domains. So you share some definitions. You have to condense down to that level before you can go across domain. So yeah, if we use the domains in an organization to kind of guide when and where and how do we condense down, then we can really, really take advantage of this kind of abstraction.
Demetrios [00:42:43]: But it's, it's almost like I realized after I said it the there's two vectors or there's two dimensions that you are looking at when you are zooming in or zooming out because you're playing on the field of granularity, but you're also playing on the field of the domain and what is relevant in that domain. So if we have that X and Y axis, you can get more granular inside of the domain, but then you can also just go on the X axis and change domains. And so that like a kaleidoscope, when you turn it, you see a whole different set of relationships.
Paco Nathan [00:43:31]: Yeah. And I mean in an enterprise context, this gets really bizarre because you know, you, the people in the domains that you depend on may not even know that you're out there. You know, you may be consuming from some log files from another application that are like totally driving your product. So like can we have some sort of contract so that we know about each other? But yeah, scooting across the domains, that's the key challenge to like leveraging these kinds of, of technologies because usually you are in a particular domain when you're making those decisions. But for most applications you have to combine a couple domains. Right. So it's usually like there's something interesting going on between like sales and procurement or sales and marketing or you know, some other business unit. So usually oftentimes you will have to.
Demetrios [00:44:21]: Combine and do you then try and create a graph? Two different graphs that are connected to each other or is it one larger graph? How do you look at it in that regard?
Paco Nathan [00:44:35]: Well, federation sounds good. I think trying to have one ginormous graph is usually weird and those projects usually don't ever end. But federating and being able to go across domains and say okay over there, let me send you something, I'd like to know what you can, what results can you bring back? So are you making a prompt in graph rag across a different domain? Are you making a query? Are you running some algorithm, whatever. There's some kind of information transfer but federation.
Weidong Yang [00:45:06]: Yeah, I can talk about couple my personal experience. First bring information to graph is a step forward, step up. Because information as a tabular format it needs to be confined to a very specific definitions. It's pretty narrow domain a graph is. There's one example I look at the US flight record. You can download it from department of Transportation. They release every two weeks after the damn thing has 140 columns. I think 100 really really wide and the reason because is because the flight maybe get diverted.
Weidong Yang [00:45:53]: Whenever the flight gets diverted, you. You add about 10, 15 columns of informations. So then you need to capture the. The flight may be diverted more than once. Like you need twice. Is that enough? No, some three times. Three is not. No, some is four times.
Weidong Yang [00:46:10]: So they actually have five diversions. But if you have divert six times, too bad it cannot exist. So that's the limits of tabular format in the information capture. With the graph it relax a lot. Naturally you can have a thousand diversions. I don't care. You can just like the graph can keep amending to it. So that is really, really a big improvement with the graph to allow you to have a lot more flex flexibility in capturing the information.
Weidong Yang [00:46:46]: And the other thing is like very often in the tablet format, it's very difficult to check the mismatches. Like we have example like bringing two data set manager from two or three different department in the same organizations, everybody know other body, other person's data has a problem. But you can't force other people to fix it. But with the graph, when you bring things together, you immediately see the mismatches. So we have one example of a company spent a couple of years, they could not reconcile the data. But once they bring the data into the graph, they start to see the mismatch. In one month they fix the data problem.
Demetrios [00:47:31]: But they start to see the mismatch because of the dependencies.
Weidong Yang [00:47:36]: Because once now because we let's say you know the records unique, right. But then when you link the other record together, you need to just see, oh, this record is actually duplicating other systems that they recorded differently. Somebody made a mistake there.
Paco Nathan [00:47:54]: Yeah, we see that a lot for entity resolution where you think like a Social Security number should be unique, but then you're bringing in data from some other sources. And there was an application where maybe early on the product manager said, yeah, we need to collect the Social Security number. And then later on they said, oh no, we can't do that. Just put it in, you know, a dummy number. And so now you've got like this data set that has, you know, 5,000 instances of the same Social Security number. So once you start to put in a graph, you're like, wait, isn't that supposed to be unique? How come there's like this enormous node with like all these things connected to it? Something's wrong.
Demetrios [00:48:33]: So it's really also a great way to figure out data quality issues.
Paco Nathan [00:48:38]: Yeah. Although there's security. I mean going back to what we were talking about before. If you are looking in financial investigations, if you're looking at some sort of criminal investigation, okay, maybe you've got some open data like here's you know, sanctioned shell companies or whatever. And then maybe you've got some private information like customers. But maybe you've also got some feeds of like oh yeah, here's an active investigation. We're looking at these people, but then these particular people, they have, you know, immunity because they're diplomats. So like there's all these different levels of security and you, you start to pull it all together in a graph, you get a very comprehensive view.
Paco Nathan [00:49:21]: Maybe not everybody can even see that. Like you don't, you know, you don't want the police officers who are doing parking tickets to know that, you know, XYZ diplomat might be investigated for a crime. Like that information should not go out. Yeah. So where do you draw the line? Because the graph really brings it all together that then how do you handle security issues?
Weidong Yang [00:49:44]: Yeah, the access control with the graph is automatically harder than the tabular, the relational database.
Demetrios [00:49:52]: Well it feels like one of these what you were talking about. Away with the ways that you visualize it. You can almost create different access controls on the visualizations. So I, I don't know if you've thought through that in a way, but is that, is that kind of how you go about it?
Weidong Yang [00:50:14]: So fundamentally access control needs to be in the data management layer. Like if the database can support access control, you're great. We actually however run into situation that database do not have the sufficient access control that supports business need. So in that situation we actually have to implement a filter layer in the data access when we pull the data from the database depends on the roles and teams and we actually prohibit certain information from being accessed. But that's not a fundamental solution. Fundamental solution has to be in the data data management layer.
Paco Nathan [00:51:00]: It's a hard problem in previous work which is more like knowledge graphs being used for large scale manufacturing. You know one of the things we ran into is security access because you take like procurement data plus some operations data plus some sales data, put it all into a graph, suddenly you have a picture of like how the company works. But it's like a really confidential picture. It's like maybe the board could see this but nobody else in the company should see it. So there's a real power there, but there's always a risk. And how do you manage that is a mind bogglingly difficult problem.
Weidong Yang [00:51:41]: I read a book talk about the certain like intelligence communities when they Go to another countries. In the past you use like falsified identities, but today it's not good idea anymore because all the open source intelligence out there, even you want to withheld certain information, but people can stitch together picture because of a related piece of information sit there outside on the social media like maybe there's a picture of you with somebody that you did not take a picture, you did not post it, but somebody posts on Instagram. And so all those information out there can essentially is a graph can link back to you even though you try really hard to stay hidden. That's a fundamental thing problem in term of the privacy security or you want to control the access information, but because you have all those connections in the graph that make it really, really hard.
Paco Nathan [00:52:54]: And a correlated with that. When I talk with people in enterprise who are doing large scale knowledge graph practices, the one thing that I keep hearing over and over again is companies using graphs for market intelligence. Or maybe sometimes you would say competitive intelligence. But a lot of this might be for sales win back strategies, trying to understand who's the competitor that got our bid away from us. How can we go back and try to like, you know, give them a better quote?
Demetrios [00:53:25]: Oh wow.
Paco Nathan [00:53:25]: And so I've, I've heard this over and over again where like that's one of the first graphs that starts making a lot of money is like literally doing intelligence inside the enterprise.
Demetrios [00:53:38]: Yeah, I was, I was going to go down that route of like let's talk about a few other cool use cases that you have seen. Whether it's just graphs or it is graph rag which is a hot term these days. You know.
Paco Nathan [00:53:56]: I mean, you know it's interesting. There's a lot of graph database vendors and they really kind of lean heavy on the, on the graph query side of how to run this. And that's something that's very familiar with people in data engineering, data science, you know, using a query. But I think in the graph space there are other areas that aren't query first like using graph algorithms or using. There's a whole other area of what should be called statistical relational learning. But you know, you've probably heard of like Bayesian nets or causality or different areas over there of using graphs. But then there's also graph neural networks, like how can we train deep learning models to like understand patterns and try to suggest, hey, I'm looking at like all the contracts you have with your vendors and I noticed that these three here are missing some terms. Do you, you know, is that a mistake? So I I think that, you know, there's.
Paco Nathan [00:54:53]: There's the queries, there's the algorithms, there's the causality kind or you know, that area of. There was also graph neural networks. There's a few other areas too. But these are, these are all like different camps inside of of the graph space. They don't always necessarily talk with each other. But I think it's really fascinating now that we're starting to see more and more hybrid integrations of them.
Weidong Yang [00:55:21]: Yeah, I like to point out that fundamentally graph and table are two side of the same coin. As a physicist, like we look at the sound music both from frequency domain like is A C, D, E, F, what's the frequency distribution and also look at what waveforms like time domain. Like some situation you want to filter or you want to access more on the frequency domain sometime makes more sense on the waveform domain the same data like a graph essentially is a joint. I call if you think about the large language model neural network it's a graph, but it's a gigantic extremely sparse matrix which is table Right. And the fact because it's such a giant sparse matrix causing today Nvidia is really hard because Nvidia has this GPUs that can process those matrix. But guess what? My brain consume about 19 watt energy. The GPU running large language model consumes tens of thousands of watts of energy to get similar computation needs. And that's extremely inefficient.
Weidong Yang [00:56:48]: Even though the computation unit is much smaller than my neuron, you think it should supposed to compute a higher efficiency. That's precisely because they're dealing with extremely sparse matrix. They're not dealing the neural network as a graph, the DD neural network as a matrix. And that's fundamentally the problem for the power efficiency. So there are certain models that come up that really deal with AI as graph that several automatitude save in energy consumption. So in the real world application one of the reason why graph hasn't been taking off as we all think for the past 20 years. Like oh, graph can take off graph going to take out. But no, it did not.
Weidong Yang [00:57:35]: The fundamental problem is because we are so familiar with all the tools and methodologies like workflows is well established in the tabular based way of thinking. It's like the Department of Transportation do not release the flight data as a graph. They released it as a table is easy to access. We have all the toolings the mature to change that is extremely difficult. So in the way I would argue that AI is always almost made for graph Because AI suddenly allow you to process unstructured information like emails, reports, this like podcast transcriptions, like videos into a structure form that computer can access. But guess what? It is a graph that AI will convert those data into. So now you suddenly have this. Some people argue, I think it's like 80% information existing unstructured form.
Weidong Yang [00:58:41]: Some people argue that even the percentage even larger. So the AI suddenly make this like the majority of the information available for analytic workflow assessment. And the funny thing is it needs graph to do that. So in, in the way that I. My assessment is because of AI, because generative AI, we actually entering a. The boom like exponential growth error of a graph. Because the availability in the data, it's.
Demetrios [00:59:16]: Like the Internet of things. We've been waiting for it to happen since 20, 20, 10 or 2005, whenever, and it's always just around the corner. But now it does make sense that if you have all of this unstructured data and you have these relations, then that sounds like a graph to me.
Paco Nathan [00:59:37]: Yeah. And going Back to like 1980s era hard AI, you know, whether we're talking about like A star, B star kind of algorithms or talking about planning systems, all of these were expressed as graphs. And like, you know, some of the early thinking that that was like pre Google that led to Google, they were talking about graphs. Some of that work actually came out of like groupware, but based on graphs. So it's there.
Demetrios [01:00:05]: Funny you say that because we had one of the talks at the AI Quality Conference back last year was from the guy who created Docker Solomon. And his whole talk was really like, everything's a graph if we really break it down. It's just, it's all graphs and how one thing relates to another thing.
Paco Nathan [01:00:26]: I'll throw, I'll throw something else in to kind of go back to our early part. We were talking about east meets West. There's a book that, a really favorite book though, from early days. This is like going back to the early 90s, but early days of neural networks, about this idea of like, yeah, there's some conventions in the west. Maybe we can back off. It's by a USC professor called Bart Cosco. It's called Fuzzy Thinking and sort of his critique of science, but more from a lens of more Eastern perspectives. I know that this book is like more than 30 years old, but I think that there's some really great perspectives there that weigh in a lot.
Paco Nathan [01:01:09]: Especially what Wei was saying about like, where are we now with LLMs and how are we leveraging those in the context of graphs.
Demetrios [01:01:17]: So I think the other thing. Was there anything else that you guys wanted to talk about before we jump? I know there's a lot of cool data visualization stuff that you're doing way.
Weidong Yang [01:01:30]: Yeah, I just want to add one thing. I just want to say the visualization is not the end. The goal is to support analytics. So I know everybody when it comes to the graph talk about graph visualizations but in my mind what's really is what we need is visual analytics. How can we visually transform the information? How can we visually go from information that was suited for data management, for data capture that you can access work them step by step towards information that's suitable for presentation, for answering the specific questions in that particular domain. So that steps requires a transformation of data. It's not just like a filter but also fundamentally in the graph schema mutation, the schema you have for the data capturing is not a schema suitable for presentation. There are two different things if you think about in the big data era the developer of the MapReduce allow you to have this step by step flow of information from the original captured tabular format into final.
Weidong Yang [01:02:54]: A very different table that you can present in graph is the same thing that the graph analytics needs is a step by step we call it calculus or operators to transform your data from the form that being captured to the form that you want to present. To answer the question now that calculus is based on, I think it needs to be in two forms. It needs to be the form that you can process data in large quantity like a large graph step by step mutates but also needs to be visually. You need a SIM set of a parallel set operator that a data analyst but ideally a domain expert. Not a. Not a data, not somebody who can write Python or cipher queries or GQL but somebody with the domain knowledge. Look at it because graph is so visual, you're like hey, I want to simplify this. Oh, I know Paco and way has so many meeting points.
Weidong Yang [01:04:07]: Let's abstract that out. Let's just create a single relationship that Wei inference like Wei and Paco that they know each other and get rid of all the other inferences. So this or maybe say hey, Paco knows like a million people. Maybe I underestimated a little bit Paco. So sorry about that but no kidding, you probably know more than that. But let's think from the graph we can quickly compute this number and put it in the Parker make Park very, very big Because Paco knows a million people, right? So that kind of operation is highly intuitive. So I want to stress this. The visualization for graph is not end.
Weidong Yang [01:04:46]: The visualization for graph is tool you use to transform the graph to get you the answer.
Paco Nathan [01:04:53]: Waypoint. Very good.
Demetrios [01:04:55]: Yeah, it, yeah. That is very in line with what you were saying earlier on how when you don't know the question, that's sometimes the hardest part. And so being able to wrestle with the data in different forms, one being the visualizing it in different ways, that's one tool to hopefully help you get to the answer or first step, the question, which can then lead to the answer you're looking for.
Weidong Yang [01:05:26]: Yeah, and to mutate the graph visually so you can, you can, you can, you can start poking it.
Demetrios [01:05:33]: Yeah, yeah, exactly. It does feel like the ability to just mutate the graph is such a strong tool because of all these different reasons that we had mentioned when it comes to the, the depth and the way that you're able to look at the domains or you're able to just find anomalies or find different data quality issues. Whatever it may be, whatever your use case is, it, it's very cool. It does sound though instinctively a bit manual though.
Weidong Yang [01:06:11]: Right.
Paco Nathan [01:06:13]: So far I think Wade has brilliant examples what they're doing like with Sidexr of leveraging 3D visualization, zoom in, zoom out. In conjunction with algorithmic ways using graph algorithms to sort of focus the lens, focus the search light. I think that more can be automated over time and maybe this is where agents come in, is actually helping determine how to be the cinematographer there on the graph.
Weidong Yang [01:06:42]: Yeah, so there's definitely a way of helping you to look at perspectives. And very often we deal with the data that's both graph connect nature, but it's also dimensional. Like each node has so many properties. Each property is a dimension, so it's high dimensional information. So which dimension set you want to take in combination with the network information to help you to see, be able to have a versatile way, flexible way of choose the dimension set. Or it's very often like when you shift from one dimension to the other dimension, you reveal some flockings or things go together, some clustering start happening. You realize, hey, those things always move in the same direction. So those signals help you to formulate a lot of ideas, instincts from the data.
Weidong Yang [01:07:36]: And then when you see that information, next thing you want to know, hey, I want to capture that as a feature. Now can you represent that as a feature to what you see, become a thing that become an entity in your visualization? Yeah, I can put Back in there. That is the visual analytics.
Demetrios [01:08:01]: Whoa. So capturing it as a feature and then you can feed it into the tabular data in a way.
Weidong Yang [01:08:09]: Yes, exactly.
Demetrios [01:08:11]: Guys, this is awesome. Is there anything else that you want to hit on before we stop? I feel like I've learned a ton just from talking to you all. I knew it was going to be a great conversation. I was hanging onto my seat this whole time, like, oh my God, I'm learning.
Weidong Yang [01:08:27]: Yeah. In terms of cross domain, I. I want to share one funny examples like how difficult cross domain is. So. So in this example, it's extreme cross domain. So. So I organize a kinetic arts, dance and science like nonprofit. So one thing we do is every week, every Wednesday, we bring people in the engineers science domain and people in the dance, art, music domain together.
Weidong Yang [01:08:59]: We explore something together and have a conversation. The very first meeting when we bring people together, that happened about like 11 years ago, we had about 20 people sitting in the room. Everybody like a very vibrant conversation. And then I suddenly realized something, that it is true that everybody speak English, but nobody can understand each other because they're using same vocabularies. But because a domain just like Paco talked about earlier in the enterprise setting, because of domain difference, they mean totally different things. A physicist talk about energy. We have very concrete things that we call energy. A dancer called energy is a very different way of energy.
Weidong Yang [01:09:55]: Yeah. When the computer people talk about Python, we're not talking about a snake, but the dancer when they hear Python, they're like, why you bring a snake to the conversation? So I think just echo what Paco said earlier. In the enterprise data context, that domain is very, very important. Be aware of the domain, like knowing the limit of the domain and how to find a way to cross domain. For us, it's generated a lot of composition. I think it's a human problem. It's not a technical problem or techno can help. But owing to that much, we had.
Demetrios [01:10:44]: A conversation on here a few months ago with folks who had created a data analyst agent. And they said one of the hardest parts for the success of this agent was to first create a glossary of business terms so that the agent and really trying to nail down these fuzzy words and these words that maybe for one person they mean one thing and another person they mean another thing. And the quintessential example of this is in mql, when you're at one company in MQL or when you're on one team and MQL is one thing, and when you go to another team. And MQL is another thing. They all mean marketing qualified lead. But when does that person become a marketing qualified lead? What do they have to have done, or what stage are they in? And so the agents may understand, and the LLMs understand what an MQL is, kind of, but you really have to flesh out this glossary to let them know all of these different terms that you use and that are in your database. So when the agent needs to go and pull. How many MQLs did we have last week? It understands what that means.
Paco Nathan [01:12:14]: Yeah, that. That's your semantic layer right there. That's your. That. That's a controlled vocabulary that, you know, you put enough of these together, you get your own ontology.
Demetrios [01:12:22]: Yeah, exactly.