MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Unleashing Unconstrained News Knowledge Graphs to Combat Misinformation

Posted Invalid date | Views 1.6K
# Knowledge Graphs
# News
# Emergent Methods
Share
speakers
avatar
Robert Caulk
Founder @ Emergent Methods

Robert is the Founder of Emergent Methods, where he directs research and software development for large-scale applications. He is currently overseeing the structuring of hundreds of thousands of news articles per day in order to build the best news retrieval API in the world: https://asknews.app.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Indexing hundreds of thousands of news articles per day into a knowledge graph (KG) was previously impossible due to the strict requirement that high-level reasoning, general world knowledge, and full-text context must be present for proper KG construction.

The latest tools now enable such general world knowledge and reasoning to be applied cost effectively to high-volumes of news articles. Beyond the low cost of processing these news articles, these tools are also opening up a new, controversial, approach to KG building - unconstrained KGs.

We discuss the construction and exploration of the largest news-knowledge-graph on the planet - hosted on an endpoint at AskNews.app. During talk we aim to highlight some of the sacrifices and benefits that go hand-in-hand with using the infamous unconstrained KG approach.

We conclude the talk by explaining how knowledge graphs like these help to mitigate misinformation. We provide some examples of how our clients are using this graph, such as generating sports forecasts, generating better social media posts, generating regional security alerts, and combating human trafficking.

+ Read More
TRANSCRIPT

Robert Caulk [00:00:00]: Robert Caulk. I'm CEO of Emergent Methods and I take my coffee black.

Demetrios [00:00:06]: Knowledge Graphs. We're talking knowledge graphs today with Rob. Welcome back to the MLOps Community Podcast. I'm your host, Demetrios. And just as it says on the tin, we are all about them Knowledge graphs. The conversation with Rob centered around how he ingests thousands of news sources, how he uses a combination of vector search, keyword search, LLMs, AI and of course knowledge graphs to give two sides to every news story. I thought it was fascinating, especially when he went into the part about spinning up very small knowledge graphs for the end user and then blowing them up when the user is done with the query. As always, if you like it, tell one friend.

Demetrios [00:00:58]: That's all we ask. And last but not least, there's a really cool initiative going on in the mlops community, which is you can, if you want, just meet somebody else in the mlops community. We're doing curated intros, so if you want to do that, reach out to me because you get to meet somebody else in the mlops community that is also looking for a new friend. And you can choose do you want to meet somebody that's about your same seniority level or maybe higher. Maybe you want somebody like a mentor. And I think that is really cool. I got to give a shout out to the folks that are running that initiative. This is all Benoist brainchild.

Demetrios [00:01:41]: We've got about 90 people doing it right now and only two people have not repeated in doing it after the second round. So we've done these twice already, two different rounds of introductions. Two people said no, they didn't do it, but that was only because they couldn't meet and so they didn't have the time for it. If you want to do it, hit me up. We'll make sure it happens. And let's jump into this conversation. Knowledge draft and the ontology. You said there's, there's kind of some.

Robert Caulk [00:02:20]: Hot takes you got start with saying hey, I'm Rob, I'm, I'm the founder of Ask News and you know, we're building this, this data source which is real time data source for news and I'm sure I'll chat a bunch about that. But we kind of stumbled into knowledge graphs and stumbled into ontologies and so you know, probably our opinion has some, some temporal bias, right? Because depending on when you enter the one of these domains you have some bias about the latest. It's kind of like if you're young and you, you are listening to a song that's actually a remake of a song from the 80s, but the first one you heard is the, the new one and then you hear the original one you're like, ah, it's just not so good. It's kind of the same when it comes to, to some of the AI domains. And so we stumbled into, you know, should do we build an ontology for this massive news knowledge graph that we're building or do we go ontology free? And the. At the end of the day, all the signs pointed to ontology free because of our end user and what they needed from. It doesn't mean that the ontologies don't have a great use case still and always will. But when you, you enter this new world of AI interactions with AI, interactions with people and trying to convey information and it's very different from interacting with a graph database in 2018 when it was hey, you know, back then with an ontology, I've got my.

Robert Caulk [00:03:52]: Trump is a part of this article which is a part of this publication which was published on this day in this month. That's like an ontology and you could potentially navigate through to get to other people a part of similar. But then when you, you're trying to convey information quickly to journalist or analyst or another LLM, you're missing some of the high resolution relationships at that point. You're, you know, Trump is a part of this article which is a part of this publisher. And you know, Biden is a part of this article which is also a part of the publisher or maybe he's a part of the same article. But you don't really know how Trump and Biden relate. And that's where we've stumbled into what if we go ontology free and let the relationships kind of form themselves?

Demetrios [00:04:40]: Can you explain what the difference is there? So ontology is just basically all the metadata around the certain subjects or whatever it is that you're looking at.

Robert Caulk [00:04:53]: Yeah. So, you know, traditional ontology would be a very structured knowledge representation of your domain, might be your company, might be whatever domain you're trying to model. And it by you know, having the strict metadata definitions like you're saying it allows you to quickly traverse and quickly filter efficiently really large swaths of data because you've, you've done that metadata filtering and you've assigned relationships which are not assignable in any other way besides knowing that they have similar metadata. So you know, that's great, but it misses the. But you have these very defined relationships. Trump is a part of this article this article was published on this day. So published on is defined a part of. Is defined the article type, the person.

Robert Caulk [00:05:47]: But what about like, you know, Trump defeated Kamala, right? Defeated. Now you have to add a new relationship. You have to then add that into your ontology and be ready for it at all times. But then you kind of bring that to the limit case and it's just an unlimited number of possible relationships between people, organizations. You know, Kamala can be a part of the Democratic Party, Trump can be a part of the whatever, MAGA. And so by kind of leveraging this high level rat reasoning and intelligence which LLMs have now ushered in, you can now start defining relationships in a high resolution. And of this, you know, there's a big weakness and we'll talk about that. But there's a big strength.

Robert Caulk [00:06:34]: The strength is with this high resolution, I can now look at a very small piece of my graph and quickly consume that information as an analyst or pass that to an LLM in a very, very prompt, token, optimized way. Right? Because instead of a full paragraph which is like, oh, Kamala got defeated by Trump and then Trump joined maga, blah blah blah, bl, all these extra nouns and verbs instead, it's literally a JSON structure passed to an LLM. You've now reduced your token usage and you've, you've, you've, you've actually magnified the most important relationships for the LLM to move forward on to glean insights. So now you're just working in a much different world than the previous 2018 ontology structures. So it doesn't, the, it's, it's almost just a different type of approach. But the guys that are using traditional knowledge graphs and ontologies, they haven't quite, you know, gra, grasped this, this idea that maybe there is a new use case for, for knowledge graphs that wasn't quite so useful back then. And so that, I mean, I don't know if that clarifies the question.

Demetrios [00:07:43]: Yeah, it's basically you're giving high definition relationships in a way, as opposed to lots of noise. And you have to filter through all that noise to get what you're looking for. You're trying to say what are the most important relations that we have with whatever the subject is? And then we can filter from there. And I guess the main thing that comes to my mind is how you were saying you have almost infinite or infinite minus one amount of relations and edge cases where there's just that long tail of relations that you can have when you're Doing it the ontology way. How does noun with no ontology void that.

Robert Caulk [00:08:34]: Void the, how does it void what?

Demetrios [00:08:37]: The, the like infinite relations problem or do you still have that?

Robert Caulk [00:08:41]: Yeah, so the infinite relations problem means that your database is just kind of screwed if you're going to try to model it with the same type of database. Right. Because the database, the graph databases are not designed to handle an infinite number of relationships coming in because they, they're indexing in a way that allows you to filter. So you have to approach the problem from a different side if you're going to do something like this, which means a hierarchy of databases and some of them are not graph databases at all. Right. And so you know, for us a lot, a lot of what we do is just storing a lot of high fidelity information and then building that into, and then putting that into a graph database with a finite number of relationships and then traversing around. So you know, you're looking at the world's largest news knowledge graph sitting on S3 more or less and then using some traditional traversals or traditional indexing and retrieval methodologies and then finding. Okay, this is the part I want because nobody wants to view a million relationships all in one graph.

Robert Caulk [00:09:46]: It's just useless. Yeah, so that's, that's how you kind of solve it, then you can cache it. And I don't, I mean I don't work at all with MEM graph but I'll plug them. It's a great cache of a data. It's almost like designed for this use case because it's designed to be run in, in memory and that allows you to do very high performance cache movement in and out very quickly. So you're still interacting with the data as if you in the same, the same type of queries but you're not having to deal with that massive infinite number of relationships if it makes sense.

Demetrios [00:10:22]: Actually I've heard MEM graph being mentioned before and the way that I think I heard it described was that it's almost like a DuckDB of graph databases. So what are the benefits there of MEM graph is just small and fast and like. Why do you like it?

Robert Caulk [00:10:44]: Well, I like it because it's open source fully. A lot of you get involved in Neo4J. It's really, it's really geared at funneling you to enterprise and I come from an open source background so I'm, you know, everything we do is open source and everything we consume is open source and then getting, you know, if, if Neo4J's enterprise product really actually was compelling then. Okay, I get it. But MEM Graphs, Open source is actually compelling and fully functional. There are some additional features, as always, but generally speaking, like Quadrant, the base database that is open source is extremely usable in production. So to me that's a big distinction. But the design, the underlying design, and I'm really not an engineer expert, I haven't even reviewed their source code or anything, but what I know from my bird's eye view of it and reading docs and community and interacting with their community, it's.

Robert Caulk [00:11:42]: Yeah, they've really focused on performance. So using in memory and so I mean, you know the difference between storing stuff in memory and disk, it's going to be a lot faster. And Neo4J keeps everything on disk. It's using cache, most likely. But MEM Graph took kind of the reverse approach. It said everything's in memory. And now if you want to try to store stuff on disk, it's kind of a feature. And people do want to do that as well for much larger graphs.

Robert Caulk [00:12:09]: But mem. MEM Graph, I think, kind of caught onto the idea that you don't actually need a massive graph. What you really want is the right graph and that can be constituted of, you know, a hundred nodes, but as long as it's the right graph and then you can traverse it and filter it the right way, it becomes very valuable. And doing it quickly and on the fly, like we're doing, loading stuff in, deleting it, like we'll just, we'll literally create a database and then just delete it. Right, because it's. When we're done with it, we're done with it. So it's, it's a really interesting use case of. It fits really well in that paradigm.

Demetrios [00:12:46]: Yeah, it's ephemeral in that sense. Okay, I like that. And it makes sense why it's. In a way you can Compare it to DuckDB and. Because it's more like, yeah, small graphs, you don't need to boil the ocean. You just need the graphs that are important for you and for the time that it's important for you and then you're done. And it almost reminds me of, of Redis too, how they took that approach of, hey, let's cache things, let's make it in memory, let's make it really fast and let's get you what you need. If that's your use case, then go for it.

Robert Caulk [00:13:23]: Yeah, exactly. No, I think, I think it's a really fun time to watch all of the different databases and services choose their features and even us as a company trying to choose what we offer. And that's what I realized is the most difficult aspect of interacting in the AI industry right now is deciding, you know, what people appear to want, but what people might actually need. And that is a really difficult thing to connect. So I admire MEM Graph for kind of taking a slightly non conventional route A while ago, even before AI, they were pushing this sort of. But there's, there's this other aspect of data, Graph dbs where maybe you don't even need them. Depending on what you're really doing. You can still do Graph Rag without a graph db.

Robert Caulk [00:14:11]: And that's where I think there's, I'm really, I'm watching the, the ecosystem to see where that goes because you know, the, the graphdb itself is very expensive to run. It is very computationally expensive and if you can get away without it, it's, it's kind of like it's much better for Graph Rag, especially for speed and stuff. So that's part of the industry. I don't know how it's going to go. I would love to watch and see what they do.

Demetrios [00:14:40]: But how do you even envision that happening? And what makes you think that it's moving in that direction and that's a possibility in the future?

Robert Caulk [00:14:49]: I mean at the end of the day you can simulate a graph database with Postgres tables, right? And, and, and link lists. It's, it's more about what the graphdb does is it makes that multi hop query really nice and easy and fluid and, and, and the UI is there and computationally effective. The index itself has prepared you for these multi hops. But, but truly all you're just connecting metadata on documents, right? And so you can, you could build a graphdb in postgres. A lot of people do it and if you don't require have that demand of these very complex queries, you might actually be better off with just postgres, which is wild to think about, right? Because we're in this world of like everyone says, oh, vector DBs, graph DBs. No, no, SQL DBs, all of the, and then add the vectors to the NoSQL. Add the graph to the vector. Like it's just like at some point you, sometimes you can actually simplify things a little bit by just saying, hey, this is my metadata.

Robert Caulk [00:15:51]: I have four points of metadata. I want to be able to do a couple hops. I don't need to do something wild. So maybe I just custom engineer it. Maybe that's how RAG goes, especially if rag is becomes a super domain specific industry or not industry, but like use case of. Yeah, every time you need to engage rag, you're doing it from with domain expertise of engineering behind it. And so in that case if you had to do some engineering and maybe it made sense to not deal with a really heavy database and instead do something more niche, even with Redis, I don't know.

Demetrios [00:16:29]: Yeah, it's so funny. It reminds me of that meme of the like the nitwit and the ninja and how it starts over here and it's like I'll just use postgres and then there's the Bell curve graph and it's like no, we need vector database and graph database and no sequel. And then on the other side you have the ninja who's like, I'll just use Postgres.

Robert Caulk [00:16:52]: Yeah, exactly. Postgres is one of the best inventions I think of the last 20 years because. No, I mean I think it replaced Oracle generally speaking. And Oracle was this miserable vendor lock in. I still have friends in the industry and their whole job is trying to migrate away from Oracle for years.

Demetrios [00:17:14]: Job security right there.

Robert Caulk [00:17:16]: It's, it's absolute insanity. So yeah, I, I, I, I agree with this meme, but we do use vector databases. I don't want to, I don't want to, you know, hate on vector databases.

Demetrios [00:17:27]: We're talking in the future. This is where you're thinking maybe there's a potential here, let's see. And if you can then great. But at this moment in time you don't. And that's really good to call out. It also reminds me when you talk about Oracle and migrating off of it, there was an episode of Acquired that I listened to on AWS and I think it wasn't until 2018 that they had fully migrated completely off of Oracle. 2018 for AWS and well, I guess AWS or Amazon it probably was because AWS probably started a lot on AWS stuff. And so you can imagine how hard it must be for folks to get off of Oracle.

Demetrios [00:18:14]: So anyone that is out there and completely locked in Oracle, you're fighting a good fight.

Robert Caulk [00:18:19]: Yeah. Join the Postgres world.

Demetrios [00:18:22]: Yeah, they're like you are, you're oblivious to all these pains that I have. So anyway, the, let's talk about what you are actually doing as opposed to what is potential in a utopia world where we can just use postgres for everything. Because it feels like as you were saying, you have many layers of databases and the actual structure around that. And the way that the queries go through each layer of these is kind of interesting to me.

Robert Caulk [00:18:53]: Yeah, there's a lot of engineering behind it which is super fun to be a part of and to solve in real time. I mean it one of it kind of if we back up to how we generate the data, it might help understand how the data is structured and then queried. We're doing a lot of processing of news and that each, for every, for, for each news article, we're extracting and enriching as much information as we can to build a synthetic representation. So we're, we're not actually dealing with the full text ever. We're going and we're extracting evidence, we're extracting the key people, we're extracting the, the geocoordinates. Right. Related to the locations that are mentioned in the thing. We're extracting all sorts of stuff including relationships.

Robert Caulk [00:19:38]: And so that happens for every single article. And that means, you know, one article might mention Joseph R. Biden and the other article might mention President Biden and the other one might mention just Biden. And obviously that is one node or maybe it's not, depending on what you want in your system. Right. And so being able to choose how that those nodes are disambiguated is huge. But the generation of that data itself. Since we're going ontology free and we're saying, okay, you know, we're extracting a person from this article, he might have a different name from this article, we had to solve, figure out how to do this and build up from it.

Robert Caulk [00:20:17]: You know that aspect. We fine tuned Phi 3 mini in order to build the graphs and I'm happy to chat about how we did that after. But basically Phi 3 mini has one job and that's to just take a text and generate an ontology free graph. And so it does it for this article, Joseph R. Biden, this article, Biden, this article, President Biden. And then you know, we store all of that information in, you know, a hierarchy of databases in some ways, right? A lot of UUIDs pointing to different UUIDs. Maybe it's too complex, I don't know, but it seems to work quite well. But a lot of the data stored in S3 and so we're, we're doing retrieval on these UUIDs based on a lot of pretty traditional techniques to be completely honest.

Robert Caulk [00:21:04]: I mean we do have vector representations there, but keyword search is still really, really good. I think that's something that's maybe also A little contentious these days that maybe you don't actually need these dense vectors. You might not even need sparse vectors. So even with just basic keyword metadata filtering, you can get so far to just get to the base of, okay, here's, here's an article that has Biden, Biden, Biden. And then you can take those graphs and you can start doing your disambiguations on top of them and then put it into a memograph. And then, then you can do traversals where you're not dealing with three nodes which are clearly all the same person. Now you have one node, but you managed, it's like, it's almost like a decompression or a compression of information. So it's decompressed at that base layer and then you have a, a decompression algorithm, which inside of it is a disambiguation.

Robert Caulk [00:22:02]: And so that really helps with, with solving these, these big challenges on the, the largest news knowledge graph in the world. Once you're in MEM graph, you can have all the fun you want with whatever queries you want and, and then, and, you know, even filter it down even further if you want.

Demetrios [00:22:20]: And so why is it that you were saying that the MEM graphs were spun up and then blown up in, in moments like, where is that, what is the life cycle of the MEM graph look like?

Robert Caulk [00:22:37]: Good question. So when our users come in and they want to interact with this, this knowledge graph, right, we're taking their query and we're saying, okay, how do we build a graph for this person from what we know here? So there is actually some AI on the, on the, on the incoming side because we're, they're not using traditional query languages for, for their, the graph. They're just using natural English language. And so it's on us to say, okay, these are some important keywords we're going to use for filtering. These are some important concepts that we should keep in, in place. And so then we, we combine all, we go fetch all that information, do all of the combination, and we can actually put that into a MEM graph instance for them right then and there. And then at that point we can actually use traditional queries and maybe it. We also have already prepared a traditional graph query based on what they asked.

Robert Caulk [00:23:28]: So maybe they said, give me all of the relationships of Biden within three hops. That's it. And, and then tell me, tell me what's the biggest threat to Biden's administration based on three hops or something, right? Something that is Clearly English language requires a lot of basic retrieval and a construction of some real query. And so then once it's in MEM graph, we can do those tradition that actual query and spinning up the MEM graph is actually quite fast. Right. So I know we kind of hated on graphdbs, but when we hear at the scale of what these users are looking for, they're not looking to build a massive graph. They want to know. Usually there's a time constraint as well.

Robert Caulk [00:24:12]: So you're only looking at the graph between, you know, now in the last 72 hours or even the last week. And no one would say I want the graph for the last two years because a, it would probably break MEM graph anyways. And so. But it also is just too much data to even parse and try to glean insights from. Usually what you want to know is like what was happening with Biden right before a certain event. So that would be an analyst saying, okay, leading up to Biden exiting the race, I want to know all of the. Relate all of the people within three hops that said something related to Biden exiting and then give me a summary of all of the those interactions and tell me who were the most likely individuals that contributed to maybe influencing Biden or influencing the media or whatever kind of analytics. Analytics insight you're after.

Robert Caulk [00:25:08]: After. Yeah, that would basically we would take that and then have to do this whole decompression, get it here, extract it for you, and also even give them access directly to that MEM graph instance to then maybe build their own queries. Right. And as soon as they're done, we wipe it clear, we're done with it. Right. We've at that point there's nothing more for us to do. That graph will never need to be created again because someone else is going to have a slightly different query. And as you can see, this like decompression has so many, you know, I guess, reasoning checkpoints that it most likely won't happen again.

Robert Caulk [00:25:44]: We do have some caching, but it's very, very limited in this sense.

Demetrios [00:25:48]: Yeah. And so when you're loading up the information into the MEM graph for the user to be able to interact with, how do you know what is enough? If they don't put how many hops or also how are you not grabbing things that may or may not be relevant or maybe relevant for the next query? Is that also just loaded in the. In the time when someone says, okay, and I want to go a little bit deeper over here, I'm going to Poke around in this area. Then you just grab that data and serve it up.

Robert Caulk [00:26:26]: So you mean like modifying an existing graph is what you're saying?

Demetrios [00:26:30]: Yeah, so there's. Exactly. It's like I. Modifying an existing graph or I want to know how you are creating the surface area of the graph. Because I, I'm assuming that not everyone says, give me three hops.

Robert Caulk [00:26:50]: Yeah, you have to.

Demetrios [00:26:52]: Do you have to say that? Or is that like one of the prerequisites?

Robert Caulk [00:26:55]: No, no. I mean, luckily we can kind of decompress from anything. It's on you to decide what level of structure and specificity you're after. It could be there's irrelevant things that are connected, but they're going to be connect. There's going to be a connection. Right. Like you might have. If we're talking about this like, presidential thing, it's like, okay, here's Biden floating here.

Robert Caulk [00:27:17]: And then. And then like a couple hops over here, you've got some random sporting event that had nothing to do with it, but there was like a mention of something related to something in the anthem. And then just. Just this weird connection. It might be floating around. But in some ways that is actually what a lot of our analysts are after, because that's where you get to something, a connection that's hard to get to through vector search because you've got this very, very dissonant connection. Set of connections.

Demetrios [00:27:50]: Yeah.

Robert Caulk [00:27:51]: Like, for example, one of our favorite clients is Love Justice. They do. They combat human trafficking. So it's an NGO in South Africa. And what they do is that like, they'll say, okay, I know there was this incident of, you know, there was an incident of criminal activity at this shop, and this person reported an incident and the police are investigating it. But then this shop is also connected to this one person over here. And that person might be benign, but it turns out that they're also connected to something over here which basically opens up, okay, why is this person, you know, you. It really.

Robert Caulk [00:28:32]: They follow this path through and all of a sudden they've kind of uncovered a hidden insight and that allows them to then say, hey, this is somebody or something or some place or some product that we need to look into because it's weirdly connected. So that's, that's, that's essentially the. One of the use cases you would use the knowledge graph for.

Demetrios [00:28:53]: But, so the surface.

Robert Caulk [00:28:55]: Sorry, I kind of drifted off the surface area. But no, that's okay. It's kind of. You get what you get and you hope that that Raw representation. And that this metadata filtering is going to get you enough of a surface area that then you can play a lot with once you're in MEM graph. So maybe it's a weakness, maybe it's a strength.

Demetrios [00:29:14]: Yeah. It reminds me of that moment in the series the Wire. I don't know if you ever saw that, but there was. I can't remember in which season it was, but all of the police are sitting around in their stakeout room trying to figure out how the drug dealers have money or who is giving money to who when it comes to this drug dealer is supplying the streets with drugs. But then it's like the money is going into all of these accounts and then one of these accounts actually, what's going on here? They're following the money and they see that it's going to a politician and it feels like, damn, dude, if they just had Ask News, they could have made that whole episode like 30 minutes shorter.

Robert Caulk [00:30:04]: Exactly. That's. That's funny that you say that, actually. Who's that guy? Matt Gaetz in the New York Times released an article when Matt Gaetz was. Was the nominee for the head. The head of the Justice Department. They people. All of the.

Robert Caulk [00:30:20]: The dirty laundry was coming out as usual. But one of the dirty laundry pieces from his interactions with minors was apparently the whole. The investigation into him. They. They built a Venmo payment knowledge graph. Did you see this?

Demetrios [00:30:35]: No.

Robert Caulk [00:30:37]: And it was just so damning. It was like Gates pays this guy a lot and this guy pays all the miners a lot. And it was just. But then it was. Sometimes he would even pay the miners directly and it was like, man. And then he would put little notes like for tuition. Like, you couldn't make a better knowledge graph set of metadata than use Venmo and add notes like for tuition from Gates to. So people.

Robert Caulk [00:31:07]: I think people are missing the power of the graphs, especially people like that, like Gates. But you've got companies that are designed for using the power of graphs just to solve these criminal investigations. Like one of our advisors, Paco Nathan. I don't know. You know him, right? I think we connect.

Demetrios [00:31:25]: Paco. That's great. Yeah, yeah. Paco's awesome. He's a legend in this space and just in AI in general.

Robert Caulk [00:31:32]: He really is. Did he talk about. Did he tell you about what sensing is doing and what they're. Yeah, that's crazy. They're. A lot of. What they'll do is even like they. They use open source intelligence.

Robert Caulk [00:31:44]: They do a lot of disambiguation. And what they're able to do is open up insights through knowledge graph connections that are just totally hidden when tracking criminal activity. And some of that comes down to, you know, even imagery for. I think one of the examples he gave was o like fishing, protecting, protecting the oceans from overfishing and stuff. And you can use imagery and you can use, you know, information from ports and you can use boat registrations and you can connect all of that stuff. And all of a sudden you start to see this boat is doing something that is weird. It's connected to this thing. And, and then besides that, my.

Robert Caulk [00:32:22]: One of my favorite examples was the Silk Road investigation. How they found the. The. The creator of Silk Road. You know Silk Road from.

Demetrios [00:32:31]: Yeah, yeah, I know I've for sure I've heard of Silk Road. The. It was a drug marketplace, right?

Robert Caulk [00:32:40]: Yeah. And it was based on Bitcoin. And essentially the, the way that this, the feds, the FBI identified the, the creator and found him was by mapping out all of the wallet addresses and the movement of money between wallets. And I think Cosmograph even has the. The Silk Road investigation available like on their website to visualize just to. It's really cool. You see all the wallet addresses floating around and you can. And that's basically through these connections they were able to say, okay, this wallet is showing up a lot in these very, very suspicious payment transactions or there is a wallet that's suspicious.

Robert Caulk [00:33:21]: And some, you know, the, the way people glean information from knowledge graphs. Once you start seeing it, you realize, okay, there's a lot, a lot of power and that is with ontology. So there's, there's definitely still good use case for ontologies.

Demetrios [00:33:38]: And you mentioned how AI is on the front end when the user interacts and their queries, but then you also have AI on the back end when you're sending queries to an LLM. How does like where does everywhere in the. Besides the Phi 3 which is awesome. Use case of just fine tuning Phi 3 and that's very cool to hear. You did that. Did you distill the. When you fine tuned it, how did you fine tune it?

Robert Caulk [00:34:07]: Oh, we have a blog post I can, I can post as a comment on the. On the podcast, but.

Demetrios [00:34:13]: Nice.

Robert Caulk [00:34:13]: Yeah we, we fine tuned it. A lot of it was engineering the data set. Data set. We pretty much did a knowledge transfer from GPT4O so which is. Was. Is not new. I think even back in the day alpaca was like the first to show that hey, you could actually Steal some of this knowledge from chat GPT or from GPT 3.3.5 back in the day and now we're, you know, we, we basically do the same thing, except with GPT4O to label a bunch of data, engineer that data in a very diversified way to cover the parameter space as best as we can. The parameter space that we knew we would be looking in, which is, you know, ensuring you have topics of politics, sports, entertainment, ensuring you're hitting all the languages, ensuring that you're, you know, crossing all the different continents and countries and sources and stuff like that.

Robert Caulk [00:35:04]: So labeling that Data set with GPT4O, ensuring it's diversified and then fine tuning 5.3on it and it turns out it outperforms Claude 3.5and it's free and fast, right? So if we were to run on Cloud three. Five, you'd get great results. But imagine putting a half a million articles a day through Claude 3. 5. You'd pay probably 10k a day. And I'm not even kidding. And so all of a sudden you've got this little tiny one that fits on basically a consumer graphics card and it's, it's just pumping out Claude CLAUDE level quality. Of course it's not going to do anything else, right? It's only going to build the graphs, but they're better than clod graphs, which is fun, but okay, so to answer the end of your question, which is one of my favorite reasons or one of the actual main impetus for taking this whole ontology free route was the conveying information in a very concise and token optimized way.

Robert Caulk [00:36:05]: So like you said, you know what happens on the end with the AI? Like why are you now we're trying to communicate to an LLM. So a lot of our users might not even need the MEM graph component of this. So a lot of them will say I need a graph of Biden's interactions and related to Hunter Biden or whatever. And so then we take, we build them this very, very concise representation of relationships and then they'll take that as JSON or you know, YAML and then they'll just hand that directly in a prompt to an LLM and continue with their insights, right? So they'll say maybe we're just part of their chain. We're not the beginning nor the end. We're just one component getting them that real time grounding. But instead of, you know, in order to build that relationship graph of Biden and Hunter, probably we had to read a thousand, two thousand articles distill them, enrich them, understand them, build relationships. And now it's all been condensed into maybe three or 400 tokens.

Robert Caulk [00:37:11]: And so they're, they're saving a ton of money there. I mean they're paying us because obviously they pay for the request, but they don't have to do any of that aggregation of information. You know, what if a Chinese outlet talked about Biden randomly, how would you have gotten that right? And doing all this in under a second too, and passing that to the next AI or to the next LM and then maybe a pending after. So like their prompt might look like this. So they, they get the YAML or the relationship graph from us, they add it to the beginning of their prompt and they say, here's the relationships related to the news of Biden and Hunter. And then here's my other context that I only I know about. Right. In my own application related to this piece of whatever to help answer the user query.

Robert Caulk [00:37:58]: And here's another set of context. They're painting their own context. We're just a part of it. But now their context window is basically fully open. They didn't have to put a bunch of articles in it, they didn't have to pre process all this text. It's just boom, JSON. And so, and the, the LLM loves interacting with it. The other beauty, beautiful part of this is that this is the innovation of us connecting publishers in a very, very equitable way to LLMs.

Robert Caulk [00:38:25]: Because now the original expression is nowhere to be found. There's no plagiarism concerns, but the publishers can still monetize the information that they've produced without worrying about leaking all of that information and leaking all of their. Not, sorry, not leaking the information, but not leaking the original expression that their journalists spend so much time writing and they don't like, they don't want to train new foundation models with their expression, with their real, with the true, with.

Demetrios [00:38:52]: The full style of writing. Yeah.

Robert Caulk [00:38:53]: But they want to monetize it in this new AI world and they don't know how. So Ask News essentially locks them in and says, okay, come to us. If you surface, you get a royalty. And what surfaces is a synthetic contextual representation where like you've surfaced somehow, but in a lot of context with a lot of other information. You just contributed to this query, so you get a royalty, but there's nothing there that's going to get trained on or stolen or redisplayed on another website. It's only for communication to LLMs.

Demetrios [00:39:26]: Wow.

Robert Caulk [00:39:28]: Wow. There's so Many.

Demetrios [00:39:30]: And you're saying, you're saying everything that I like to hear. There's so many cool things that you're talking about here. And the. So the first thing let me start from what my mind instantly jumped to was, oh, you're just. For a lot of your users use cases is you're just another step in a dagger.

Robert Caulk [00:39:51]: Exactly.

Demetrios [00:39:52]: It's like, go grab the context from Ask News and then they can use that however they choose. Maybe it's by putting it into a context window for their LLM call, or maybe it's just to have some more information to play with when they want to go and they want to play with that information. And then number two is, it is so nice that you're doing the heavy lifting to synthesize all that data and then find what the real signal is out of all of that. Because, yeah, thousands of news sources, I can imagine that some of them are really bad and some of them don't mean anything, and some of them are actually quite good. So how do you know which ones are good or bad? Or do all of the people get paid equally? So I can just, I can start churning out shitty things and start getting paid on Ask News because I'm not known for my quality?

Robert Caulk [00:40:57]: Uh, that's a good question. That's, that's one of the most common questions we get. And, and that's. I like how people are aware of the, the dangers around potentially propagating misinformation. And this one is super interesting because we've just been in the mix of it and we've been observing it and running research on it. And so one of the outlets that people hate that they might stumble across is rt, right? It's like, oh, this is a Russian propaganda network. Everything they say is fake. But actually the beauty of what we're doing is this contextual representation.

Robert Caulk [00:41:36]: First of all, you can filter out RT if you don't want, right? You can say, I don't want that. Or. But even when you include rt, if you take a contextual picture of French, German, English, Italian, Arabic and Russian outlets and Ukrainian to discuss the Ukrainian Russian war, all of a sudden the LLM is smart enough to detect what are the alignments and what is poking out and contradicting, right? And we're able to leverage the LLM's intelligence at that point. So like the, the, the stories we write on Ask News, that's precisely what we do. We cluster a bunch of articles and then we actually ensure we do what's called diversity enforcement. So RT might show up, but knowing why, how RT relates to what on the same topic, to what the Ukrainian source says is actually the right piece of information to convey to the reader. Because that shows up as a contradiction. It's like, oh, RT says that no civilians were killed.

Robert Caulk [00:42:36]: Every other source in every other language in every other part of the world says that a bunch of civilians were killed. That's just, that's just a piece of information that we are relaying. It was a contradiction that the LLM was able to identify because it poked out like a sore thumb. Now you as the reader, you decide what to do with that information. It helps you, it reinforces that rts might be reporting incorrectly. So if we remove RT from the whole interaction now, we're actually kind of biasing away from Russia. We're not giving Russia even a say. You know, sometimes what in an RT article, there might actually be really important information to know about in Russia.

Robert Caulk [00:43:15]: You know, obviously there's that understanding that it's coming from a state run megaphone and that that should be a part of what of, of how you consume that information. But a lot of times they're reporting actual events that are occurring and knowing what they say is important. So you know, if you, if you want to start publishing a bunch of articles that might surface in these queries that can, that that's perfectly okay. That's that, that we would take in your public. Like if you have, if you want to monetize your sub stack, you can do that, right? So you can come and add your substack and then we'll go and essentially build that synthetic relationship representation for you, leave your content there and then make it retrievable for the developer. And if you said something that's retrievable, they can play with it. The last point I would make on that, we have a ton of filtering mechanisms. So we, part of our enrichment, we check the reporting voice.

Robert Caulk [00:44:12]: So we'll check if it's persuasive, sensational, objective, investigative, analytical. So you could say I only want objective. Some of our analysts say I only want sensational because they actually want to see what the sensational side of it. Maybe they only want sensational Russian to just understand only what. Because that's the part of the context that they need to paint in their next step. Right? Like it's almost like a chosen bias with your filter where you can say I only want this, I only want that. You can also filter on page rank. This helps, right?

Demetrios [00:44:44]: Wait, that sentiment analysis is done by another model in the background.

Robert Caulk [00:44:50]: Yeah, actually llama 370B is running. Llama 3.170B is running basically the full synthetic representation which includes, yeah, the reporting voice, the sentiment, the extraction of key evidence, supporting details, the sort, any key people, stuff like that. That would be kind of the synthetic representation. So you know, Llama 3.170 B, you've played with it, right? I mean it's, it's wildly capable and we see it works very well, especially when it comes to sentiment analysis. Positive, neutral, negative. You see this like basically this, this massive shift when in comparison of the Llama models versus what sentiment analysis was before. And it used to be very, very difficult to get that right. We're still not perfect, but this was a step change.

Robert Caulk [00:45:47]: This was going from 50% of the time right. To 85% of the time. Right. Something where an analyst used to say, so many of our analysts and clients say you have sentiment analysis. I don't touch that. Like I've, I've seen, I've seen enough sentiment analysis, thank you.

Demetrios [00:46:02]: Yeah, I've been burned by them enough that I just don't want to. Exactly.

Robert Caulk [00:46:07]: So we have to convince them. It's like actually we've done the research, we've done the checks. It's not perfect, but it's very, very. Llama 3 has, it's, has a high level reasoning. What does that mean? When you have a higher level, you can do things like that. It's cool.

Demetrios [00:46:23]: And so maybe you explained that then, but I didn't catch it the first time around. It's these news sources. You're crawling everywhere you can to get the news sources. You're then using Phi3 to figure out if there are nodes. And then when you figure out what the nodes are, you use llama3 to break it down close.

Robert Caulk [00:46:51]: First, we aren't crawling everywhere. We're only crawling where we're allowed to crawl or where we have a licensed source or where someone has signed up voluntarily and given us the ability to go behind a paywall. So we'll never go behind a paywall. Now. The, the construction of the information itself, actually we, when we get to the article, Llama does its extraction. Phi 3 does its extraction. Gleiner does its extraction. We, we didn't talk about it, but this is a very, very, this is a big pipeline.

Robert Caulk [00:47:23]: Gliner is, is like our crown jewel at emergent methods. It turns out it was like the 10th most downloaded model on Hugging Face this year. I found this out through Clem's post, actually. He like they have the, the 21 days of hugging face or whatever. And day two was the most downloaded. And I was looking through. It's gliner. It's just an entity extractor.

Robert Caulk [00:47:46]: We're not even really that brilliant. We didn't invent the architecture. We just engineered a data set which was diverse and fine tuned it. So we're kind of, we're fine tuners. We're not the architectural masters of transformers, it turns out. People. It works very well and people caught on. But it's really cheap.

Robert Caulk [00:48:05]: These are all super cheap models, which is why you can kind of daisy chain them up. You can say, oh yeah, llama 3 is pretty cheap. I mean it's more most expensive one. But then phi 3 is super cheap. Gliner super cheap. And that allows you to kind of really full build out that synthetic contextual picture of a single article. And you do that for all of them. So every single article we find, we put it through that pipeline.

Robert Caulk [00:48:29]: So that sits now go ahead, which one?

Demetrios [00:48:33]: Okay, wait, so you have Llama three going first and then it goes down the line. Or is it all at the same time? Like asynchronously?

Robert Caulk [00:48:44]: It depends. Some of them are async. If they can be async. Like I think the graph and the gliner are async at the same time because they don't depend on each other. Yeah, Llama 3 goes first and it, because Llama 3 gets you that original, that full kind of distillation of information and then you can start processing on top of it, which is really nice. It's. It allows you to. A lot of these small models like Gliner have a context window of 512 tokens.

Robert Caulk [00:49:13]: That's kind of the, at least the, the general pipeline. And it's evolved. It's been really fun. We just get these free tools all the time, man. I don't know, I feel like every other month is Christmas because we're just sitting around running, you know, we were running llama 2 13B, the Wizard LM. I don't know if you remember Wizard LM back in the day. Yeah, I do. It was from Microsoft.

Robert Caulk [00:49:37]: They pulled it. But we were running that and then llama 3 came out and we're like, all right, let's see. This seems smarter. Boom. And it was smaller and faster and smarter. You know, what does that do to the, the ecosystem? Right? I'm a tiny startup. We have a very small amount of funding and resources. In your opinion, are you seeing it because you, you, you're tapped in, you're seeing people, startups like me, it's gotta have a massive ripple effect, right?

Demetrios [00:50:06]: Dude, it's, it's so funny because I think there is, on one hand there is the, you have to be almost, I'm not going to say the cutting edge but you have to be testing all of this stuff out in a way because you are resources constrained. And then I see the other, there's another end of the spectrum where it's folks that are just trying to figure out the product itself and once they figure out if they can make a product that someone loves, then they'll find out how to optimize on pricing and latency and all of that. And so you almost have two sides where there's the, the pricing insensitive. And so they say whatever, we'll just throw everything at OpenAI or Claude because we want the best and that's the least of our worries right now. And then you have this side where it's like I'm going to fine tune everything that I can to our specific use case and throw a lot of different models together and make it. And also like what you're doing with the databases and you're doing with how you're architecting things is so cool to see because it shows the possibility of what's out there.

Robert Caulk [00:51:26]: Let me ask you a question. It's kind of a side, a tangent but yeah, I've been wondering lately the pricing of OpenAI because you mentioned. Okay, I'm just. The pricing's going down and down and down so why don't I just use that, that, that and for sure we use it. Don't get me wrong, like I said, we, we to build, to label our data, we use GPT4O and Claude and for doing the high level forecast, we'll, we'll run those through those because it makes sense, they're very smart. But I'm really scared as a startup about this pricing going down trend because I don't think that that's how it's, I think this is a race to the bottom of these companies willing to take a ton of debt. So to subsidize those prices. What happens as soon as that debt, that, that fruitcake is done, all of a sudden what do we do now? The prices go up and all of these businesses that like mine, even I'm worried about that have been exposed to a 2.5 per million GPT4O token price and said I can build around that and build a real product and then I start charging customers based on that.

Robert Caulk [00:52:39]: Ground level. All of a sudden, in two years, if we're lucky, and the gravy trains over, all of a sudden OpenAI is finally. All of these companies say, listen, we can't do it anymore. All of our investors, Sequoia wants their money back, Microsoft's. Now it's gonna start ticking up. Right. And then that's just gonna put pressure on us. Do you think that's gonna happen or can you alleviate some of my concerns?

Demetrios [00:53:04]: Yeah, it does feel a little bit like the days of the ride sharing when the ride sharing first came out. Right, exactly. You had lots of subsidized rides just to get people taking Uber and making it their go to. And then now if you look, Uber is not cheaper than taxi, that's for sure. So it does kind of have that. Especially when you hear things like OpenAI lost 5 billion and they made 3 billion. And you're like, well, why did they lose so much money? Oh, maybe it's because they were subsidizing the inference. But that's the, what is it the elephant in the room that I haven't heard a lot of people talking about.

Demetrios [00:53:46]: And so you, you could be right. I hope you're wrong and I hope what happens is, is that we just continue to get. You build that muscle of using open source models so much that by the time llama 5 comes out you're like, yeah, I don't even need OpenAI or if I do need to hit an API, it's not really going to make a nominal difference on the cost because the. Ideally the majority of the businesses are maturing with the market and they understand. So that, that's my look at it. But I do think that there's the race to the bottom. The other thing that I was going to say is they can't go down forever. There's gotta be a stopping point.

Robert Caulk [00:54:36]: Yeah.

Demetrios [00:54:37]: And that's what it's, it's like when you have a running race and you have the finish line and you know how if you do it mathematically every time you go from point A to point B and then you cut it in half, that could go on for infinity. But in reality, at some point you cross the finish line and that's what it feels like in reality, it feels like at some point we're going to hit a stage where you can't go lower because it just. What are you gonna do? Charge free? That doesn't work.

Robert Caulk [00:55:15]: No, I'm scared. I think, and I like what you said about building that open source muscle and we're trying our, I think we, we are, we're prepared. But yeah, I look at some clients who want to use these high level derivatives for a certain price and I'm thinking like one guy said I want a two year contract. I cannot forecast that far of OpenAI's pricings. And so that's hard. But no, the other part of the data, the pricing of inference, the data itself is going to get more expensive.

Demetrios [00:55:44]: Okay.

Robert Caulk [00:55:45]: And because that gravy train is also coming to a very abrupt end, you've probably watched the whole perplexity, OpenAI legal fights between publishers and yeah, they're signing up with platforms, they're fighting others. The cost of true good data is only going to go up from here. And you, I think you've maybe seen on LinkedIn there are ads now for just like make 1600 bucks a week doing coding problems to make training data for us. Right. I like that. I like the premise, I like what they're after. It's like pay someone to generate your data, get the data you want, it makes sense. But that doesn't equate to a 2 5th, 2.5 GPT 4 $0 million token.

Robert Caulk [00:56:27]: I'm sorry, it doesn't. That's expensive.

Demetrios [00:56:29]: Yeah, that's a very expensive one. Actually I just saw that Uber opened up its gig economy or things that you can do as an Uber contractor. You can now give rides, you can deliver food and you can label data that is their next year. Interesting. So that's pretty wild. And then on the other side you see LinkedIn that is trying to just farm the data by saying you're an expert in this field. Why don't you answer this question? And in that I have, my favorite person that I follow on LinkedIn is this guy who just blatantly gives the wrong answers to all of those. So he's completely trying to data poison him.

Robert Caulk [00:57:14]: I really don't like that system.

Demetrios [00:57:17]: Yeah, but you, you kind of read the answers and you go, this might make it pass, like this might actually work. You know, it's not like you kind of have to read it and know that he is data poisoning for you to understand what's going on. So yeah, that's the LinkedIn LLM that comes out. I'm not going to have much trust in because of that guy.

Robert Caulk [00:57:42]: Makes sense. I mean with, with even LinkedIn data itself having value, maybe their LLM is going to be trash. But a lot of kind of some of our competitors are people are companies like Exa. And I'm seeing, I'M I'm seeing these business models crop up and I think I'm missing something and I want you to tell me where what I'm missing. But the business model of exa is we scrape the entire web without any concern, including LinkedIn, including threads, including Instagram, and deliver it to you and structure it for you. And actually the tech is quite interesting. I like what they're doing. I think that it's very compelling.

Robert Caulk [00:58:23]: But how is it a sustainable business model when LinkedIn is going to be like, nah, I'm good. You're not going to scrape our stuff anymore, right?

Demetrios [00:58:29]: Yeah.

Robert Caulk [00:58:30]: How, how is that a strong business model?

Demetrios [00:58:34]: Yeah. And even X charging for API usage. And so all of a sudden you're now every time you're hitting an API, you're doing that. And that is one of the biggest questions that I had for you is like, isn't it too long? Because all of, I'm assuming all of the news sources that you're getting are when it's published on a blog or a publication. And so isn't there too long of a gap between when something happens and when it gets published? Or I, I know the gap is very short, but for someone to actually write something and then push publish is much different than me sending out a tweet.

Robert Caulk [00:59:21]: This is an awesome discussion. I'm really enjoying this one. There's different timescales of business insight. Okay. And some. Let's take forecasting for example. Typically forecasting is not the next 10 minutes in event forecasting where you're saying you want to forecast if Pete Hegseth is going to be removed as the nominee from, for the Secretary of Defense.

Demetrios [00:59:51]: Yeah.

Robert Caulk [00:59:52]: And that requires quite a lot of information into the archive, actually, not even just the current. You do want the latest data points for sure, because that helps you project forward. But understanding how those data points developed is equally important, if not even sometimes more. Especially if you're saying something like what is the S&P 500? Is the S&P 500 going to close higher or lower? Lower in 2025. That doesn't really require the last tweet from Warren Buffett. It requires maybe from building out an understanding of what the Fed is doing, an understanding of the global economy and under like a very high level contextual picture which we can now do with LLMs, which is insane. And so that's a resolution where having five minute information up to the last five minute news is very useful. But like you said, if you're like, actually the, the best example is Yeah, a lot of prediction market users use us, but they can't use us for every.

Robert Caulk [01:00:56]: Every prediction market because of what you just said. There are certain prediction markets where we're very usable and there are certain prediction markets where we're completely useless because they depend only on what happens on Twitter. And only you need to be basically tracking Twitter and it'd be the fastest to click to get the right prediction market price versus that Pete Hegseth example where you can make it 8 days in advance, 10 days in advance, and do research. And so just seeing that there's that variety is really interesting. We, we like to specialize in structuring the news. We have a lot of analysts and firms like Love justice using us for those sorts of investigations. But there are certainly applications where we're not the right one. And I like that.

Robert Caulk [01:01:41]: I even think that it would make sense for us to reach out to another surface, that service that maybe specializes only in Twitter. Right. I mean, we could use the Twitter API and we have some development on that, but maybe there's a better service that can really structure it in a more retrievable way, the way we do it with news. But they're focused.

Demetrios [01:02:03]: So yeah, you can plug that in to your service and so you do have that real time. But I do like one thing that is clear. Just like as we were talking with about MEM Graph, right. MEM Graph took steps and it made decisions on how they were going to do things. And it feels like you've done the same. And you said, we are going to create our product for these use cases. And right now we aren't even gonna worry about those use cases that need to be real, real, real, real, real time. And the latest, you know, tweets about whatever or the latest posts, if it's breaking breaking news.

Demetrios [01:02:47]: That isn't necessarily where the value, the strong value comes in. It comes in more on the understanding of how the whole plays together.

Robert Caulk [01:02:56]: Exactly. And there's also some consideration of factuality. Right. If it's just bumped to Twitter, there's a strong chance it might actually be untrue. Some of our analysts really want us to track telegram channels.

Demetrios [01:03:11]: Oh, I was going to say that too. And you can do it especially like Ukraine war.

Robert Caulk [01:03:15]: Exactly. That's where a lot of the real latest information comes from. But there's a lot of misinformation there too, or may not even misinformation, but just, just they're not sure what occurred. They're speculating. It's speculation. And that can it shouldn't necessarily be written down as news. Just the fact that someone had to sit down and verify that this is something we want to put on a server as a news article, it gives it that additional level of credibility. Again, RT is doing it.

Robert Caulk [01:03:43]: So we take that a bit with a grain of salt. But there's something to be said for that difference in publishing versus just 200 character tweet or a message on telegram. So I think that using that context smartly is good. You know we, we do have, I think it would be really smart like Camel AI. I don't know if you've heard of them. No, they're an agentic, multi agent framework. They're playing with letting multi agents make each other and make more agents and then talk to each other and then come back and it's wild and one, they just have a bunch of tools. To me that really is the solution.

Robert Caulk [01:04:20]: Ask News is one tool that one of the agents might want to use. But then you have a Twitter tool, you have a. And then it can define that contextual picture. It can say this is the news, this is the Twitter. Be careful. Right. This is just Twitter. Let's, let's, let's use, let's, let's use it for what it is and paint that contextual picture.

Robert Caulk [01:04:40]: I think that might be a good way and maybe in the future we end up going the route of expanding. But I think you're right that one of our as a startup we're really focused on getting news, right? Yeah.

Demetrios [01:04:52]: One thing that you're doing is you're giving the whole picture and you're giving both sides and maybe some of these sides of the picture. Me as a consumer, I wouldn't necessarily go and seek out, but I am or I like to think of myself as open minded and I don't want to just read stuff that is coming from an echo chamber. And so I appreciate different viewpoints. And now how are you making sure that the stuff that you're grabbing isn't just fake news? Is it? Because you have a good solid graph of it and you can see that's part of it.

Robert Caulk [01:05:36]: There's a lot of human in the loop. We have an editor on staff, an editor in chief who is in our staff. He was editor of the Rocky mountain news for 20 years. Right. So this, this is someone who takes this seriously. There's a lot of human review of the data, a human review of the output, human review of what sources we're tracking. You know, if you submit, if you submit a sub stack and it's clearly just complete and utter trash. We, we, we won't, we won't include it.

Robert Caulk [01:06:10]: But it's, it's usually where our standards will allow a lot to come in. Not really, but because, and only because once you get to the point where you're painting the picture as a user or as us, we are the user of our own service in some ways. Yeah, you can really make strong decisions of how to put competing perspectives against each other. And like I said, you can filter on objective reporting voice.

Demetrios [01:06:37]: Yeah.

Robert Caulk [01:06:37]: You can detect when the voice of what was written was trying to evoke emotion. And that is a strong indicator that this is probably, there may be interesting details there, but that's an indicator that maybe we should leave this source with less credibility. You can even still pass it to the LLM, but you can just say label this, this is sensational. Right. And now treat the sensational sources as they should be treated with less credibility. You can, it's a big gray zone of how you can approach that problem. It's a big experiment to be completely frank. Like the front news, the front page of Ask News, that is where we do that high level clustering.

Robert Caulk [01:07:19]: We're using our own service on the back end, but we're writing our own work which is bringing in the, the Chinese, Japanese, French, all of the different perspectives in one. And that has been. When we started, we didn't know if it would work very well. But what we've noticed over the past year, we've been tracking the Israel Palestinian conflict, which is maybe the most polemic topic that currently exists. And every time we write an article, not every time, but many times, I'll look at one and I'll say I'm going to send this to someone I know who is very pro Israel and just ask them for their opinion. And I'm going to, or I'm going to send this to someone who's very pro Gaza and just ask them for the opinion. And honestly they're, they, they always come back and say that's a decent representation because the way that it reports is following these editorial standards of, well, according to the, the Health Ministry of, of Palestine, this is what happened. And that's kind of what an Israel reader wants to see.

Robert Caulk [01:08:26]: Right? Because they want to know what the, that health ministry is saying. And as long as you say that's who said it, this is where the information came. It's, it's up to a pro Israel person to decide if they trust it or not. And same with the, the Gaza side If, oh, the government of Israel has said, you know, Benjamin Netanyahu came out and said this is what happened. Well, at least it's being reported in that fashion. And, and then the contradictions are painted, right? It's okay, this, the, the Israel government says this, the Palestinian government says this, and this is the contradiction.

Demetrios [01:08:59]: That's it.

Robert Caulk [01:09:00]: Take it or leave it, right? At no point did we say anything and this was fact. Nothing was true, nothing was false. But we said these are where the alignments are. Believe it or not, a lot of times the Israel, you know, the stuff coming out of the Israel government actually aligns with some of the stuff coming out of the Gazan Health Ministry because a lot of the facts are indisputable about where the thing occurred. Right? There are journalists on the ground, there are pictures, there is a lot of aerial evidence. So the key alignments are very nice to see and I think that helps those people when I send the polemic topics out. So I'm not saying we're solving it, but we definitely have found a very unique contextual representation and you know, the final point of evidence for this methodology. One of our favorite clients, other favorite client is the University of Texas and they're doing research into misinformation detection.

Robert Caulk [01:09:57]: And their new method, CREDI Rag is a rag based credibility system based on Ask News but also using some wild techniques of building out social graphs between who said what and why they, when they said it. And but Ask News is this key component and they're finding that this is working really well bringing in all this different, bringing in competing perspectives for this. They'll basically, they take a Reddit post that might be misinformation and they search Ask News database to get all of the diversified viewpoints and then they pass that plus the Reddit to an LLM and you can classify it. You can say, okay, what does this seem to be misinformation? Does it seem to not be? And that works very well. Their accuracy is up to like 95% on the detection, which is up 25% from the previous state of the art. So that's a really fun validation. I love bragging about it. I appreciate you giving me the moment.

Demetrios [01:11:00]: For that, but yeah, of course. It's so cool to see. And I'm a firm believer that there's almost like three sides to every story. You know, your side, my side, and the truth is somewhere in between there. Exactly. This in a way is how you can get both sides of the story and you see where it matches up and Then you see where the narratives take different turns for better, for worse. So that, that's super cool that you give that because I, I like your friends that you're sending these articles to. Want to feel like I'm at least getting the other side of the story.

Demetrios [01:11:43]: Whether or not I believe it, that's my choice or whether or not I choose to support it. I guess less than believe it is do I want to support that argument or another argument. But I want to at least know what are both of these arguments exactly something that I was thinking about too. Are you using podcasts? Because it feels like that podcast transcripts could be a good news source too.

Robert Caulk [01:12:08]: We are looking into. We have an in the development environment YouTube transcripts. But the legality of it, of going public with it is some. Is the main blocker right now. There's a lot of. Yeah, you got to be very careful with how you use YouTube data because they're not really designed to be freely scraped like that.

Demetrios [01:12:29]: Yeah.

Robert Caulk [01:12:30]: So yeah, I think we would probably need to sign up individual podcasts to really get them in. But there's so much good information being transacted on podcasts. Like I completely agree even in you news YouTube videos themselves. You know, I think you probably know of dw, right. The, the German one. I find them to be very, you know, matter of fact straight and I would love to get that information as another competing perspective from their action from their YouTube videos. But the, the. The scraping YouTube legality is, is a, is one that's a bit hard to.

Robert Caulk [01:13:06]: To handle.

Demetrios [01:13:07]: Interesting. Well man, this has been fascinating. Before we go, I have to ask because you kind of talked about it back when we were mentioning the ontology. You said the way that you're doing no ontology is beneficial in some ways but you said for other ways it's not beneficial. What are the other ways? For those people that have sat with us for this whole conversation, at least give them some kind of payoff and let them know what the other side of the story is. Because I know that that is, that is also important. Like we can't paint it out to be all roses and tie dye, right?

Robert Caulk [01:13:48]: Yeah. A lot harder to handle more data. It'll become slower to do your traversals. It'll be not only that, but harder to formulate your query in some ways depending on if you're going to stick with this very free flowing ontology of ontology freeness. You can't really decide your, your queries are harder to build because you don't have a definition a priori in order to decide how that query should propagate. So that would be the biggest limitation. However, with the raw data and like I said, then moving into a MEM graph, you could, you can use our metadata to, to build it out. It's just a matter of deciding what you need.

Robert Caulk [01:14:34]: But communicating to LLMs, I find that to be the, the, the biggest difference is like, the strength of the ontology is querying and being able to extract insights that way and getting really directly to where you need to go in that graph. Even using Node 2 vec in order to, you know, predict links and stuff, like doing things that are very, very creative, but passing really highly resolved information to an LLM to extract that, that might be where there's a slight weakness. So that would be, I think the Give.

+ Read More

Watch More

Managing Small Knowledge Graphs for Multi-agent Systems
Posted May 28, 2024 | Views 1.7K
# Knowledge Graphs
# Generative AI
# RAG
# Whyhow.ai
Graphrag: Enriching Rag Conversations With Knowledge Graphs
Posted Jul 26, 2024 | Views 301
# GraphRAG
# LLMs
# Graphlit