MLOps Community
+00:00 GMT
Sign in or Join the community to continue

How Grounded Synthetic Data is Saving the Publishing Industry // Robert Caulk // Agents in Production 2025

Posted Aug 06, 2025 | Views 21
# Agents in Production
# Synthetic Data
# Publishing
# Ask News
Share

speaker

avatar
Robert Caulk
C*O @ AskNews

Robert is a scientist by trade, now focused on building the largest news knowledge graph on the planet to feed context-hungry agents.

+ Read More

SUMMARY

Synthetic data plays an important role in the news ecosystem. Publishers are now monetizing a synthetic version of their data to help feed news-hungry agents in the wild. We discuss how grounded synthetic news data not only protects publishers against copyright infringement, but also reduces hallucination rates for broad agents built to use hundreds of tools. As agents become better and better generalists, the data that they retrieve via tool-use needs to be packed up and ""Context Engineered"" for quality and ease of consumption. The ancient adage was never more relevant ""Quality in -> Quality out"". Enter the world's largest news knowledge graph. A perfectly searchable, highly accurate, news context delivery machine - geared for high-stakes decision making agentic tasks far and wide. Some tasks include fact-checking, geopolitical risk analysis, event forecasts for prediction markets, and much much more.

+ Read More

TRANSCRIPT

Robert Caulk [00:00:00]: Thanks for coming to my talk. My name is Robert and I'm the CEO of Emergent Methods and our primary product is Ask News. And we, we're trying to save the publishing industry with grounded synthetic data and we're going to jump, dive right in. But before I do just want to say that there's a whole team behind us and especially Ellen Tornquist and Wagner Costa Santos. Let's jump in here. So the plight of the modern publisher is, you know, they've been publishing data now for 30 years on the Internet and they've had an ad model that works pretty, pretty well. And now we're at a different world where people are using ChatGPT and ChatGPT may go visit a website on your behalf at that moment in time. There's just no, there's no ad revenue for a bot, unfortunately.

Robert Caulk [00:01:07]: And so how do we subsidize that publisher data, that journalism and keep the these publishers alive in this new world of AI, you know, especially with low subscription sales, people are getting the short form content on Google and the worst is when original voice and expression is used to train LLMs. Right? So you've got those court cases between the New York Times and OpenAI News Corp and perplexity. Everybody's upset at everybody for, for using publisher data. So what does the web look like without publishers and journalists? I don't think that's a question that we want to the wrong answer to. I think it's pretty well established that journalists and publishers are foundational to everything that we do, including AI agents, your agent, doing whatever it needs to do. If it needs real time information, objective accuracy of what's going on in the real world. It's going to boil down to a good journalist and a good publisher. Real boots on the ground.

Robert Caulk [00:02:15]: There are some solutions to the new problem, right? They can direct license deals. So one publisher is trying to license deals with every single AI developer and every single AI agent. It's not super scalable, it's difficult. Good for large publishers getting in bed with large corporations. Not so good for smaller publishers that don't really know where to start. This lately the cloudflare AI content market plans are great. I think there's a lot of potential there. I'd like to see where it goes.

Robert Caulk [00:02:49]: But it doesn't actually solve the problem. We'll see in a second here. There's some other solutions. You got pro rata and they're trying to basically create ad filled AI chatbots and then direct that revenue back to publishers. That may be great for B2C. But it's not so great for someone like you trying to build an AI agent that's trying to solve real world problems, make forecasts on geopolitical risk and you know, identify technological advanced industries where technological advancements are going to occur. They all suffer the same problem. They, they're not controlling for that original text and expression.

Robert Caulk [00:03:28]: It is lost it in, in every single one of these, it gets lost and it's out into, in the ether. And as soon as that original text and expression is lost, it's more difficult for that publisher to just continue existing because they've lost that the, the, the expression that they put a lot of money into building. So what does it mean to bridge publishers and AI agents? Right. I think that that is the most important point that we're trying to, to solve with Ask News. Well, the publisher wants to monetize the information. The AI agent just needs the accurate information. The publisher, they're trying to protect original expression. Meanwhile, the AI agent just wants reliable data.

Robert Caulk [00:04:11]: They want to build. You want to build a business on data that you know is going to be there in five years, right? A reliable data feed token cleanliness. On that AI agent side, it, if you've dumped raw HTML into an LLM, you know, it's not the last step of your, of your data processing pipeline. It's most likely one of the first steps and an expensive step and a difficult step. Meanwhile, the publisher doesn't care about that publisher cares about getting users that aren't competitive with their original text. The publisher wants low user acquisition effort. They want to just get AI developers to use the information easily. The AI agent just wants a lot of sources, a lot of diversity.

Robert Caulk [00:04:58]: These LLMs are incredible at reconciling tons of different data points across thousands of different sources all at the same time. These, you know, Gemini, we have context windows now up to a million tokens. Imagine just how much, how many different sources, how many different perspectives you can pack into that to, to generate a forecast. It's pretty, pretty incredible the world we live in. Obviously the AI agent wants Natural Language API. The publisher doesn't know about Natural Language APIs. These are small publishers. They've got data on a website, but they're not really sure how the hell to get this to an AI agent.

Robert Caulk [00:05:34]: It's just sitting as HTML on a server. Speaking of servers, the server load with these AI agents going directly to scrape the same page over and over is astronomical and it costs the publisher money. So not only are are, are the publishers losing, losing out on, on their, their product. But now they're paying more money to keep the same product up. So obviously there's, there's always more. Right? And so this bridge is really important. So how do we build that bridge? Well, that's where grounded synthetic news data comes into play. Take an original news article, remove the narrative voice and journalistic style and original phrasing, but preserve the most important information and enrich the most important information.

Robert Caulk [00:06:22]: Add sentiment, add classification, identify geocoordinates, identify entity relationships. All of the contextual pieces that make the LLM build out a better image of what the hell is going on in the real world. That's what the AI agent wants. They don't actually care about that journalistic expression. So that's the big question. How do we do that? How do we preserve that information, build that synthetic data as best we can? So first we should probably define the ideal synthetic representation of news information. This is called context engineering, which is a big word lately and, and I'm really happy to see that that is happening. We introduced it and coined that term.

Robert Caulk [00:07:07]: Actually it was in, I think December 2023, 2023 in a different talk. But Demetrios hosted us on the Vector Space talks for quadrant in February 2024 where we talked about context engineering for real time news distillation. We've been screaming this to the top of the rafters for years now at this point. So let's context engineer the news. Let's do that. What does it even mean? Well, this is pretty obvious. We didn't invent how to understand the news, right? This has been around since 350 AD. The who, what, why, where, when.

Robert Caulk [00:07:40]: That's how you really understand. And if you can understand it, you can convey that to an LLM. So what do you really need to understand that you need the people, the organizations, you need the events, the statements, attributions you need. The published date is really important, right? Like knowing Joe Biden backed out of the race doesn't help me forecast what's going to happen next. If it happened 24 hours before the, the, the election or three months before, it's a big difference, right? So these are pieces of context that are key to avoiding hallucinations, locations, motivations, you all, all of the above. When you scrape that HTML illegally, by the way, you're maybe getting some of this, rarely getting most of it, and just never getting a lot of that. And you don't have a reliable data feed, everybody loses, just, it doesn't make sense. But what if you actually did have all of it.

Robert Caulk [00:08:34]: And what if you had more? That sounds good. I think we're all on the same page. Let's add more. Let's add knowledge graphs, let's add source origins and geocoordinates and let's license and keep the publishers afloat so that next year when I still need information from that publisher, they're there, their journalists are. Have boots on the ground for your agent who's trying to make a forecast. What's the tldr? This is it, right? It's a perfect meaning for it. We want token optimized extra content, LLM ready, objective information. That's it.

Robert Caulk [00:09:07]: We don't want this raw HTML, poorly labeled, seeds for hallucination, biased expression, none of that. Okay, so how do we build it? We've defined it, that's great. But now let's talk about how we build it. I think that's kind of why we're here. A lot of entity extraction, sentiment analysis, you know, understanding category correct storage types. Let's talk about how do we actually get to this information. This is not easy, right? You need, you need to follow AP style guide. You need to understand how do you extract statements and evidence and attribution.

Robert Caulk [00:09:40]: You need an in house editor in chief which we have, someone who has spent 45 years in journalism understanding how to extract information. This is how you ensure journalistic integrity in that synthetic data point. Classifying topics, very helpful sentiment analysis. This helps understand the why, right? I want to know what kind of sentiment is going into this, into this story reporting voice. That's huge. That, that helps the LLM decide whether this is a credible source. Was this a provocative, was this written in a very provocative way or was it objective? Was it written subjectively? These are really key data points in context engineering that allow your LLM to do so much magic under the hood. Extracting entities is huge.

Robert Caulk [00:10:31]: I want to get all of the entities so that the LLM is ready. This is the synthetic data form. The LLM can work with these entities. This is an entity who. And you know Joseph Wilson is a doctoral candidate at the University of Toronto. Knowing that that character exists is key to your context. We've trained an, a small language model, an entity extraction model, gliner to be exact, to do exactly this and it's small, which enables high throughput and it's really, really good. And it was one of the, it was the top 20 most downloaded model in 2024.

Robert Caulk [00:11:08]: We're really proud of that. Building that model meant understanding how to balance perspectives. If you're going to extract entities from text, you want to be able to get to the entity. A lot of the basic spacey entity extractors are trained on western names, so they're very good at extracting western named entities. But as soon as you get to an article in Africa and you're trying to extract an African name, it's less likely to extract it correctly. That's a problem if your goal is objective accuracy. In order to build a forecast, a geopolitical forecast, I want to get to all the names. I don't want to just deal with western names.

Robert Caulk [00:11:47]: So how do you train it? That's what we did. We wrote this article. I encourage you to go check it out. It's a great article. Translation huge. How do you translate properly? Right. All of this stuff is huge. Using a good storage unit is great.

Robert Caulk [00:12:02]: Quadrant is huge. That's how you can store all of the synthetic test text with all of your metadata and enable high level filtering relationships are key. How do extracting what who did what, who's connected to which place? Here's an example. Knowledge graph. Under the hood you've got Trump and all the connections to Trump across all the different sources. This is how an LLM views information objective with objective accuracy and gets to truth. Right. That's the goal is I want to get a high probability forecast for, you know, whether or not the tariffs are going to be called off next, next month.

Robert Caulk [00:12:44]: I need to know all of the connections between Trump and all of the, the countries and, and industries that are intertwined. Right. That, that's kind of basic. What about bias? That's huge. How do we remove it? We wrote a two series set of articles on this in our bias research project. First one was we were trying to find that crazy agenda of Deep SEQ that everyone decided to call out immediately. It turns out that Deep SEQ hosted locally, not using their chat interface is actually not biased. Okay.

Robert Caulk [00:13:23]: It's actually biased to the western side, which kind of makes sense because this is distillation of of OpenAI models anyways, but proving it, understanding that it's there allows us to mitigate it. And let's talk about part two. Bias expert. This is another model that we just released. This just got released two days ago actually. We're really excited. This allows you to identify the bias and potentially recommend amelioration for how to improve it. So let's put it all together.

Robert Caulk [00:13:53]: A synthetic data point contains all of the most important data in a clean fashion. Token, optimized, no extraneous HTML crap with extra additional Enrichment, talking, relationships and sentiment. And what else do we have here? The geocoordinates and the key points, the translation. This is what your LLM wants. Your LLM is what it eats. But don't take my word for it. Let's look at the external real time validations. This is Metaculous.

Robert Caulk [00:14:24]: It's a real time forecasting tournament quarterly where these LLM bots need to answer questions in real time. Real forecasts generate true forecasts about what's going to happen in the military, in geopolitics and finance, in tech. And there's only one way to get things right and wrong and it's to actually get it right. There's no gaming a benchmark, there's no answers a priori, it's you make the forecast, you get it right or you don't. And the truth is that other providers are not taking this as seriously. Okay, and, and it true, it shows because basically the whole leaderboard is just people have switched over to Ask News. After three quarters of this, people stopped even trying to use perplexity and now they just switched to Ask News because they want to win money. This isn't really about like, you know, this is just, hey, I have a bot, I want to be better, I want to make money, I want to forecast the future.

Robert Caulk [00:15:19]: Okay, makes sense. This is MLOps, a conference. And so, you know, we should talk about the actual ops behind what? Behind generating that synthetic data. That presentation from February 2024 goes pretty deep into this and most of it's totally straight. I think we talked about llama 2 because that was the model at the time. But besides that, everything here shows how do you take a lot of articles from a lot of places and store those properly and retrieve those properly? Go check that out. I think you'll like that presentation. But it's not just trying to orchestrate a bunch of LLM calls, it's orchestrating the full system.

Robert Caulk [00:16:10]: Right, Your reasoning engine and all of the various databases and storage services that are going to help support that system. Let's bring it back. What is the bridge between these two things? Well, the publisher wants to sell high value human seeded information and the developer just wants to buy the right info at the right time in the right way for an extended period of time. And they want that to always occur. The publisher doesn't want to lose control of original expression. And the beauty is that the AI developer doesn't even want the original expression in these high stakes decision making systems. They just want objective accuracy. In some ways the original expression is worse because it hasn't been filtered through bias and reporting voice filters and cleaning all of these pieces of things.

Robert Caulk [00:17:09]: This is how the developer actually wants it. They want a clean piece of data. The developer doesn't want to waste time maintaining HTML scrapers and cleaners with hopes and prayers. They'd rather focus on their high stakes decision making system and let's let the publisher and ask news figure out how to get the right data to them at the right time. So which publishers are already signed up? We have a lot. I think we have over seven or 800 of them. AFP Agence France Press is one of the top and we're really proud of that. They are really excited about what this means about driving more revenue in a complementary way into the publishing business to help support their vision, which is hey, journalism.

Robert Caulk [00:17:58]: They just want to keep doing journalism and so well, all these high stakes decision making systems that are reliant on AI agents that need real time info are ready to pay for this sort of good information. That's the beauty of it. How does the royalty distribution work? 50% of all of the royalties go to publishers and are split weighted by Surface surface rate. So anytime an article from a specific publisher surfaces, they get a percentage of that royalty for that month. And the other 50% main allows us to maintain overhead and continue our research into bias which is open source entity extraction which is open source and all of the other tools that we're putting out. Who's using it. We've got a lot, we've got a ton of different examples. A lot of universities, Columbia, George Mason University, Texas.

Robert Caulk [00:18:56]: What about some of the scientific discovery platforms? You should check some of these out. These are great. Amass tech is really good. They're helping with scientific discovery and we've got risk assessment from Riley Risk and Prepper AI. There's a lot of great users who show that hey, this data is valuable and we're willing to pay publishers to get to it. So I'm not the only one behind this. I'm just kind of presenting for everybody. The team is small but strong and we're focused on research and helping drive build the bridge between the publishers and the developers.

Robert Caulk [00:19:35]: If you are interested in either being on the publisher side or the developer side of this puzzle and equation, join our discord. Come. That's the QR code at the bottom left there or just shoot us a a LinkedIn message and we'll be happy to chat and see if we can help but I appreciate the attention.

Skylar Payne [00:20:30]: Awesome. Yeah, love this. This is a lot of great content. Really loved the bits on evaluating the bias in your models and awesome to see that you've released a model. If folks have questions, feel free to put them in the chat and I'll pass them along. But in the meantime, seems like you've done a lot here. What do you think's next for Ask News? What's the next big problem?

Robert Caulk [00:21:01]: Yeah, I think the goal now is bring in more publishers and start getting the publishers paid. So kind of simultaneously enlisting publishers and developers at the same time. I think that we're just scratched the tip of the iceberg with what can be done in the quality of the data. And I'm really excited to see as these new tools come out, all these, as soon as we release a fine tune, there's a new model that's even better that we could fine tune. And so I'm really excited to see just how small and intelligent we can get with the bias extraction system, because that's really key. Having high throughput through an model, like that's important and it really changes what you can forecast and what you can do with the data. So that's.

Skylar Payne [00:21:53]: So yeah, I want to dig into.

Robert Caulk [00:21:54]: That a little bit.

Skylar Payne [00:21:55]: You said like high throughput on the model is important. Is that because, like, typically you're going to be using it in like an online setting like this wouldn't be like trying to mitigate bias offline with your model or.

Robert Caulk [00:22:10]: That's a good question. And maybe I didn't touch on it, but the reason we're able to put such an extensive pipeline in front of every data point while maintaining the low latency is because we process 24. 7 articles as we're pulling them in. And that's high throughput we have to process with multiple LLMs on each one. And we're a startup, so the smaller the LLM, the cheaper it is to run, the more articles we can pump through. And by the way, part of our agreement with publishers is that it all has to be done on premise. So we're not allowed to process any, any data using OpenAI or anthropic because that would defeat the entire purpose of us protecting the original text by just sending it to OpenAI to get processed. So we have to process it in our cluster.

Skylar Payne [00:23:01]: Can you clarify? I think the term on prem to me has gotten a little bit ambiguous in AI systems. When I hear on prem, I think, no, we're running the servers, not using the cloud.

Robert Caulk [00:23:14]: Exactly. Yeah.

Skylar Payne [00:23:15]: So that's it. Okay, so like, you're not using AWS or anything even to do this process?

Robert Caulk [00:23:21]: I mean, for processing the data? No, for serving some front end components and stuff. We'll use DigitalOcean, but AWS is far too expensive for us, so we stay far away from it. Unfortunately, although DigitalOcean has been a struggle, so we've migrated away from it. A lot of what we do is just we brought everything back on prem, but there's a little bit of a risk there, you know, if our, if our data center catches on fire, that then we have some problems. So there's. That's the real downside of that. But, you know, can always build another one, I guess.

Skylar Payne [00:23:55]: Awesome. Well, thank you for the time. If folks want to connect with you, is there a place they can go do that?

Robert Caulk [00:24:00]: Yeah, come over on LinkedIn. I can toss it into the chat here.

Skylar Payne [00:24:05]: All right.

Robert Caulk [00:24:06]: Basically it's Robert Kalk. C A U L K as it is here. Awesome.

Skylar Payne [00:24:11]: You heard it here, Connected Robert Cult on LinkedIn. Awesome. Thank you so much for your time. This is incredible.

Robert Caulk [00:24:18]: Ciao, guys.

Skylar Payne [00:24:20]: Take care.

+ Read More
Sign in or Join the community
MLOps Community
Create an account
Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
Comments (0)
Popular
avatar


Watch More

Planning is the New Search // Fabian Jakobi // Agents in Production
Posted Nov 26, 2024 | Views 1.6K
# streamline workflows
# Agentic
# memoryrank
The Creative Singularity is Here // Pietro Gagliano // Agents in Production
Posted Nov 26, 2024 | Views 1.1K
# Singularity
# Transitional
# Forms
Operationalizing AI Agents in Data Analytics Workflows // Ines Chami // Agents in Production
Posted Nov 22, 2024 | Views 925
# AI Agents
# analytics
# Gen AI