Intentional Arrangement: From Digital Hellscape to Information Nirvana // Jessica Talisman
speaker

Jessica has over 25 years of experience developing and implementing information and knowledge architectures within a variety of domains. She has worked building information and knowledge management systems for e-commerce, educational technology, government, advertising technology, academic libraries and even vendor–tool based architectures. Jessica builds semantically rich knowledge management ecosystems, comprised of taxonomies, thesauri and ontologies, for the benefit of humans and machines. Jessica currently works for Adobe, as the Senior Information Architect, where she is building a semantic knowledge graph to represent Adobe content and assets.
SUMMARY
Humans have catalogued information for more than 3,000 years, a practice that has evolved as technology has advanced. The Library and Information Science discipline is responsible for building systems for organizing data, to serve our physical and digital knowledge domains. How have librarians sustained analog and digital repositories? Intentional arrangement. With the current wave of artificial intelligence (AI), many organizations are struggling from poor data management. Welcome to the digital hellscape, rife with dirty data that is un-curated, unstructured, undefined and ambiguous. To emerge from this hellscape, look towards the librarians, who can show you the way. Controlled vocabularies, taxonomies, thesauri, ontologies and knowledge graphs have all emerged from the librarian’s toolbox. In kind, AI performance is optimized when trained on the same, intentionally arranged, structured data. Harmonious data ecosystems, optimized for human and machine, is our information nirvana and can be achieved with intentional arrangement.Humans have catalogued information for more than 3,000 years, a practice that has evolved as technology has advanced. The Library and Information Science discipline is responsible for building systems for organizing data, to serve our physical and digital knowledge domains. How have librarians sustained analog and digital repositories? Intentional arrangement. With the current wave of artificial intelligence (AI), many organizations are struggling from poor data management. Welcome to the digital hellscape, rife with dirty data that is un-curated, unstructured, undefined and ambiguous. To emerge from this hellscape, look towards the librarians, who can show you the way. Controlled vocabularies, taxonomies, thesauri, ontologies and knowledge graphs have all emerged from the librarian’s toolbox. In kind, AI performance is optimized when trained on the same, intentionally arranged, structured data. Harmonious data ecosystems, optimized for human and machine, is our information nirvana and can be achieved with intentional arrangement.
TRANSCRIPT
Click here for the Presentation Slides
Adam Becker [00:00:00]: Jessica, are you around?
Jessica Talisman [00:00:09]: I am here, Jessica.
Adam Becker [00:00:12]: Very good to see you. I gotta confess something.
Jessica Talisman [00:00:14]: Yes.
Adam Becker [00:00:15]: About when I was reading through like your. The agenda and the abstract for the talk and then I did a little bit of research and I saw kind of like how you articulate some of the challenges that exist in the space. And I thought to my. I felt almost like a certain type of like embarrassment because I thought to myself, somebody knows how my brain works with the taxonomies and the ontologies and like, I love stuff like this, though I rarely have the actual, like the vocabulary to describe all these things that I love. And just seeing the way that you describe it all just made me blush, to be honest. That was the.
Jessica Talisman [00:00:57]: That's what I aim for.
Adam Becker [00:00:58]: But yeah, I was like, you know what if I have another next role, should be something like that. Whatever it is that you're doing, I find that so fascinating and I was looking forward to this, to this talk. So I'm going to leave you to it. I'll be back in 20 minutes and I hope everyone else has the same reaction that I did and take it away.
Jessica Talisman [00:01:22]: Take it. Thank you. Can you see my slides? Just making sure. So my name is Jessica Talisman. I'm senior Information Architect for Adobe's Experience League, which is the primary learning gateway and documentation platform for Adobe products, mainly the B2B powerhouse tools like Adobe Experience Manager and Adobe Analytics. I'm currently building the Knowledge graph and semantic system to improve customer journeys in a multitude of ways. I've worked in knowledge management in the Information architecture space since 1997. Included in that work, I've worked for the US Department of Justice, Overstock, Pluralsight, Microsoft, and even as an academic librarian.
Jessica Talisman [00:02:05]: Today, we're going to talk about intentional arrangement. Intentional arrangement is the explicit or implicit acts of organized organization by people or by computational processes acting as proxies for or as implementations of human intentionality. By organizing with intention, we can go from digital hellscape to information nirvana. Information begins with data. LLMs are an assemblage of statistical models representing patterns of language. But having data or having a lot of data is not enough to solve business problems. To solve business problems, data must be structured and contextualized in order to deliver results relevant to a domain, a business, the Internet, and the world. In fact, the mainstreaming of AI has shined a light on data quality issues, which I think we're all familiar with, and the importance of structured semantic data.
Jessica Talisman [00:03:04]: Knowledge management practices are missing from a lot of organizations and domain spaces. Absent and because they're absent. As a result, LLMs cannot make sense of unstructured business environments. However, everyone wants a knowledge graph because graphs have been proven to be the answer for wayward AI implementations. But guess what? A knowledge graph is a knowledge management tool that requires knowledge management practices to render semantic structured data for machines and human retrieval tasks. Allison Gopnik in the Wall Street Journal characterizes Gen AI as a cultural technology, and I think a lot of us don't think of it as such. It perfectly positions how we should be approaching AI, from training data to rig and rag implementations to the classification of output. Cultural technologies, in her terms, provides ways of communicating information between groups of people.
Jessica Talisman [00:04:00]: Examples, she suggests, are writing print language libraries, Internet search engines, or even language itself. She states that asking whether an LLM is intelligent or knows about the world is like asking whether the University of California's library is intelligent or whether Google Search knows the answer to your questions. It is how you use the technology that actually lends intelligence. She reminds us that previous cultural technologies also caused concerns. She gives the example of Socrates Thought thoughts about the effects of writing on our ability to remember because we have confronted these issues before, where we throw doubt in the face of technologies like the Internet, we share concerns about the potential for people using LLMs to spread disinformation and misinformation, whether intentionally or by accident. She also notes that past cultural technologies required new norms, rules, laws, and institutions to make sure that the good outweighs the ill, from shaming liars and honoring truth tellers to inventing fact checkers, librarians, libel laws and privacy regulations. So why are knowledge graphs at the center of Gartner's impact radar for generative AI? Because they are critical helper functions for the LLM. A knowledge graph represents the distilled, organizational, business and domain specific knowledge of your company.
Jessica Talisman [00:05:25]: Providing high quality curated and structured information to the LLM via rag is often the missing piece, which can change the output of the LLM from a dangerous hallucination to a helpful solution to a business problems. According to the Turing Institute, knowledge graphs organize data from multiple sources, capturing information from entities of interest in a given domain or task like people, places or events, and forge connections between them. In data science and AI, knowledge graphs are commonly used to facilitate access to and integration of data sources, add context and depth to other more data driven AI techniques such as machine learning, and serve as bridges between humans and systems such as generating human readable explanations or on a bigger scale, enabling intelligence systems for scientists and engineers. All cultural technologies rely upon knowledge management just a fact. What are examples of cultural technologies? We can look at libraries, archives, museums, scholarly publication channels, academia, and even search engines such as Google and Microsoft Bing. A knowledge graph is a tool for knowledge management and therefore an essential cultural practice focusing upon leveraging machines to support information retrieval. The ethos of knowledge management is founded upon principles such as accuracy, truthfulness, machine readability and interoperability with minimal bias. As a cultural technology, the information we model consumes and shares and shares should represent facts which manifest as real ideas and real things in the real world, all with accurate context.
Jessica Talisman [00:07:06]: One thing that cannot be automated is deciding what the truth is. Humans must design for the truth. It is people that ultimately shape data to become information. Relegating the creation of knowledge graphs to the LLM is like relegating the creation of evals to the LLM without human oversight. Knowledge management as a cultural practice is about grounding data with facts to create information rich landscapes built for reliable information retrieval. In other words, AI cannot exist without people people curating and structuring data to become information which is then presented as knowledge in a knowledge graph. This fact setting establishes ground truths to serve as LLM contextual inputs through the life throughout the life cycle of an LLM. Knowledge management is the human touch that legitimizes AI as a trustworthy cultural technology.
Jessica Talisman [00:08:00]: For over three centuries, librarians have been managing knowledge and evolving knowledge management practices. It's kind of mind boggling to think about the depth of experience and iterative building. With the digitization of bibliographic records in the 1960s and the emergence of the World Wide Web in the late 80s and early 90s, librarians adopted consistent and standardized methods for managing and connecting distributed Big Data. Authority records became central to building and connecting rich semantic knowledge management systems. Link data proved essential for entity resolution, identity management, embedded references and citations. This unlocked the linking of data points and records across vast globally connected data system and linked data fit neatly into metadata rich catalog records naturally standardized and formatted using machine readable languages and formats such as XML and rdf. These same principles were evangelized by one founder of the World Wide Web, Tim Berners Lee, who espoused the virtues and critical importance of linked Data. In his 2006 link data note, Berners Lee advocated for the use of URIs and identifiers, HTTP protocols for information retrieval, RDF for structured data, SparQL for querying it, and the systematic linking of documents across the Web.
Jessica Talisman [00:09:25]: If you hear today about Web 3.0 or the semantic Web, this is really not a New concept, but a continuation of the founding principles for the Web in the age of AI. Everyone using these technologies will benefit from the first principles of authority records, linked data, and library sciences as a practical method for minimizing LLM hallucinations. So we're going to dive into sort of the world of linked data and how librarians have proved out, you know, and actually built a system leveraging LLMs that spanned the course of a decade. So the online Computer Library center, or OCLC, is the owner and manager of WorldCat. So these are all librarians working in this space. This is the world's largest interoperable library catalog system, which is represented as networked semantic knowledge graphs. WorldCat is also cooperatively networked with other major knowledge graphs such as Wikidata, Amazon Books, Google Books, and academic scholarly publishers. This massive network knowledge system has been the vehicle and orchestrator of continued linked data initiatives throughout the digital repositories of knowledge.
Jessica Talisman [00:10:34]: Beginning in 2009, each of OCLC's collaborative efforts have focused upon launching large interoperable linked data semantic systems, culminating in the Shared Entity Management infrastructure, completed in 2021. As data, information and knowledge expands and contracts with the speed of its creation and deletion, Linked data provides flexibility without compromising the integrity of knowledge management systems. The entire shared linked data infrastructure benefits all users and systems through collaboration and has established a global system for distributed entity and identity management. Evolving library data into linked data frees the knowledge in library collections and connects it to the knowledge streams that inform our everyday lives on the Web through smart devices and using technologies like artificial intelligence. So why link data? Descriptive metadata, records and ontologies can be heavy and verbose, making it difficult to manage knowledge at scale. Link data streamlines the description process, which enables the serendipitous discovery of new concepts through the relationships between linked data authority records. In the case of LLMs, link data provides structured, machine readable descriptive data. And all the same benefits are afforded to humans.
Jessica Talisman [00:11:54]: Enhanced discoverability, improved interoperability and contextual understanding. For the decades long OCLC Linked Data project to be successful, communities had to agree on to the meaning of common shared data. While each organization or community can have their own unique vocabularies and metadata, Link data connects unique terminology to the commons and resolves the varied terms via ontology, rich meta records and authority records. Now that the OCLC Linked Data project has reached maturity, libraries are shifting away from traditional cataloging to identity management. It's not the same identity management that we think of in terms of security or a person's identity, but Actually the identity of concepts within a large ecosystem and managing those relationships. The benefit of this is that networked linked data systems manage entities at scale to optimize knowledge discovery for humans and machines. One of the most valuable learnings from the OCLC linked Data project is the codification of iterative data preparation and cleaning processes, something that many of us don't even pay attention to or invest in. Without data quality and knowledge management, linked data systems will suffer or fail.
Jessica Talisman [00:13:09]: This is at the heart of intentional arrangement throughout approaches to organizing. Focused upon every step throughout the process to optimize both the input and the output, the intentional arrangement approach creates space to focus on tasks such as curation, entity resolution and data modeling. Often the demands for speed in development cycles bypasses critical pre and post processing phases, resulting in diminishing returns. This approach has been carried over into the data preparation process for AI systems in terms of like how the library and information space is managing. Leveraging AI or LLM tools to understand link data at scale, you can visit wikidata. As you can see here, with link data, each concept is assigned an HTTP identifier that provides this content canonical definition, often doing so with links to other authority records. Modern LLMs can directly leverage these URIs, which is very useful for RAG implementations. As I mentioned before, link data enables concept discovery by forging connections beyond what is explicitly modeled by humans.
Jessica Talisman [00:14:19]: LLMs can leverage linked data via RAG to enable the discovery of new concepts and relationships. Link data acts as a reference librarian for concepts defined and managed in the World Wide Web. So this is a snapshot of the link data ecosystem and it shows the links how each of these pages or documents are linked throughout using HTTP URI based identifiers. And each of these represents basically a meta page, an ontology rich page in which those identifiers are able to structure and represent information. So if we remember, a link data ecosystem such as the OCLC link project consists of a net of networked knowledge graphs. So it's not one single knowledge graph to be successful. Aligning to a common core understanding of entity and identity management is key. So to get an idea of the scale, we're going to look at the players in this massive link data ecosystem.
Jessica Talisman [00:15:18]: There are more than 400 partners assigned to well defined verticals representing distinct services and workflows. So if you can imagine, over the course of a decade, these partners worked in coordination and concert to be able to deploy and to iteratively test and produce a really robust ecosystem of link data that leveraged and was able to bring on board LLMs as tools. You'll see each vertical is a cog in the knowledge wheel. We have traffic partners that span the divide between libraries and the World Wide Web, socializing linked objects to make them widely discoverable for all consumers. We have publishing, which contributes to the greater whole with scholarly publications and research objects. We have libraries representing curated knowledge repositories and of course education and academia, which model skills and learning frameworks. So it goes without saying that this project would not have been possible without standards and funding. Wikidata has emerged as the most universally visible product for this effort or connected with this effort.
Jessica Talisman [00:16:25]: Wikidata taxonomies and ontologies are commonly used to create knowledge graphs for RAG implementation, as the barrier for entry is very low and downloads are readily available in several machine readable formats. Many fans do not realize that what makes wiki data so robust and valuable is in fact link data. So because we sort of referred to RAG throughout the presentation so far, you know, I find it very interesting that RAG and link data offer similar benefits and returns. While RAG relies upon quality data connections, Link data's job is to ensure the quality of data connections. They are almost like two halves of a whole. Link data requires that connections are well defined and machine readable, while RAG appreciates semantic structure. In fact, link data works incredibly well with RAG implementations because link data is built for retrieval, precision synthesis, while also supporting concept discovery and entity resolution. A linked data approach streamlines knowledge representation with a simple HTTP URI identifier to enrich an LLM with real world contextual understanding.
Jessica Talisman [00:17:37]: So as we work at look at what activity has been going on. I came across this Google paper. It was published in September 2024 and I know that there's lots of naming of new techniques, but there is this new emerging alternative to RAG named rig or Retrieval Interleaved Generation. So the paper that was published Knowing when to Ask Bridging Large Language Models in Data presents rig and alongside a Google New offering called Data Gemma Data. Gemma combines GEMMA and Google's link data repository, which is called Data Commons. Yet another linked data repository that's been maintained and open sourced by Google Rig allows for multiple retrievals at various stages of response generation, addressing all aspects of a complex query. This dynamic approach enables AI systems to recognize knowledge gaps as it generates a response and fetches data from external sources multiple times during the process. By continuously retrieving and integrating information, Rig reduces the likelihood of inaccuracies in the final output.
Jessica Talisman [00:18:43]: It is the reflection from the outside world. External sources via linked data Where RIG shines Google's Data Commons is a knowledge graph. So not to be confused, it is a knowledge graph and linked data repository that organizes the world's public data sets and makes them universally accessible and useful. Data Commons encompasses a large range of statistical data from public sources such as the United Nations, National Census Bureaus, health ministries, environmental agencies, economic departments, NGOs, academic institutions. Very similar to OCLC's linked data project. Currently this corpus includes more than 250 billion data points and over 2.5 trillion triples from hundreds of global sources. Google's RIG paper focuses on three main hypotheses, all of which were realized and proven true. We can see the parallels between the benefits of RIG for AI systems and the value proposition of linked data.
Jessica Talisman [00:19:41]: So the first hypothesis external sources versus clarity closed internal sources are valuable for teaching LLMs when and what to ask. Linked Data is an elegant solution for reaching into the external world of knowledge for knowledge modeling to support contextual understanding. The second hypothesis we need to decide which external sources should be queried for the requested information, which is always a challenge and a struggle. Since the set of available sources may be large and dynamic, it is better that this knowledge be external to the LLM. So what's interesting is that RIG handles the complexity and weight of vast amounts of data using linked Data because it's a very economical way to package a lot of machine readable data data into a authoritative link that a machine can actually use, leverage and read. Linked Data delivers such a large amount of contextual semantic data and the packaging is so neat and efficient. It really points towards the economy of managing knowledge at scale. And finally, once we understand what external data is required, the LLM needs to generate one or more queries to fetch that data.
Jessica Talisman [00:20:56]: Different sources produce different kinds of data, and it would be beneficial if the LLM did not need to have specific knowledge about the APIs various sources and could instead rely on a single API. I mean, that's everyone's dream. In other words, we need a single universal API for external data sources. So the nice thing is Data Commons does provide that single universal API across many different sources while still leveraging link data. So RIG evangelizes these values and really helps to support universal API adoption by example using Data Commons. Google's Data Commons link data provides that persistent identifier linked to a common data layer to streamline knowledge management and sharing. We've already witnessed Rig at work with LLM retrievals providing citations in addition to citations to related objects, and we see this in real time when we Even perform a Google search rig demonstrates a meaningful shift towards intentional knowledge management practices and the application of link data. Like two halves of a whole, link data grounds humans and machines in reality where all participants are accountable to a common core.
Jessica Talisman [00:22:08]: So linked data architectures represent intentional arrangements. In the age of AI, Building upon linked data as a methodology for knowledge management means a return to the first principles established by the library sciences and the World Wide Web. The more organizations and the data industry reflect upon information and knowledge management practices, the more gaps in design decisions become clear. Knowledge management is a practice built upon thousands of years of organizing resources. If it weren't for well structured linked Data from the OCLC project, Google Commons, Wikidata, DBpedia, LLMs would not be as powerful as they are today because linked data describes resources. That's the most important thing. It's about the description and definition and disambiguation of resources. Use link data.
Jessica Talisman [00:23:02]: Experiment with knowledge expansion, entity resolution and data preparation workflows. Try RAG with ontologies bolstered by HTTP URI identifiers, linking concepts to the outside world to define and ground truths. By prioritizing resource description and resource description workflows, we focus on building towards a common core of knowledge to enrich knowledge repositories for communities of practice. Intentional arrangement is about being invested in people who are ultimately the beneficiaries and curators of information and knowledge retrieval. And here is my information sharing, I think. Yes, perfect.
Adam Becker [00:23:55]: Jessica, thank you very much for this. This was absolutely fascinating. Let's see if we have any questions. I think we have a couple of questions here. Okay. Well, we have first a comment. Amy Hodler is saying I love the cultural technology emphasis. She's part of an AI tech book club and it's surprising how often cultural items come up.
Adam Becker [00:24:16]: And then Ramona is adding to that a question is the recommendation to try and adapt or align our data to these systems or something else?
Jessica Talisman [00:24:24]: I think adapt and align. There's so much goodness when you see really rich use cases that have been successful, like the OCLC project that was built over the course of a little more than a decade. Like that's, that's a proven use case. That's something that we can learn from. And you know, part of evolving technologies and approaches is finding use cases and evidence of where implementations have been successful. And here is a successful implementation right in front of us. There's actually what's really interesting, there's the oclc, the Worldwide Networked Catalog is basically the largest knowledge bus and knowledge graph network knowledge graph system in the world they were able to deploy a system, a cataloging system called entities. And so now library catalogs are moving away from cataloging traditional cataloging to these link data approaches.
Jessica Talisman [00:25:18]: Because as we know, even from this conference, everyone's struggling with managing knowledge at scale. And linked data provides a solution for that.
Adam Becker [00:25:28]: We have a comment here by Jan. I think that might be how you pronounce it. It probably is not only about the way to digest and use the data these times. The way data is shaped and linked is also about getting resilient or being resilient against damage and destruction. Is that a theme that you resonate with?
Jessica Talisman [00:25:45]: Yes. And you just to clarify damage and destruction from just the, the degradation of. Because that happens knowledge or information gets outdated and degrades over time. So that would be one instance. And when you, when you pack, when you use link data, you don't have to worry as much about for example, you know, ontology is not validating. That's one, you know, broken ontologies or broken knowledge graphs are outdated. That simple link is able to harvest concepts and, and help bring clarity to machines and people with new information that maybe humans overlooked in their modeling practices.
Adam Becker [00:26:30]: Do you have suggestions for how to get started integrating these linked data? And so like rig is one of those such ideas. But I, as you were speaking I tried to pull up like so I think I pulled up just wika.
Jessica Talisman [00:26:44]: Yeah.
Adam Becker [00:26:44]: Just to see what I get right. And, and I see that. So this is interesting. I just, I punched in a few different words here and just to see. I got, I just typed in colonialism being one of them just to get the sense because I feel like even could it be that even the linking itself is political. Right. And then probably these people are kind of negotiating this at a relatively high level. It's not just at a statistical level like an AI would do.
Jessica Talisman [00:27:11]: Right.
Adam Becker [00:27:11]: These are. There's actual human beings behind this that are sort of like guarant. They're providing some, some signature of guarantee that they've discussed this in, in some depth.
Jessica Talisman [00:27:21]: But Adam, link data because you have to supply references and they have to be authority sources different from Wikipedia. Wikidata is the link data source. So if you look in the left hand nav on that bar you will see pages that link there and permanent link in concept uri. That's what you leverage. But you can detect the authority of this page in general by expanding the other things that link here that verify the source. If we go through this, you have all the maps and a lot of them are Missing. But as you go towards the bottom below the statements, you can also look at other instances of this thing, right? And other connected concepts, which is very helpful for expanding or fleshing out in general. And then all the way towards the bottom, you're going to have representation.
Jessica Talisman [00:28:19]: There we go, the identifiers. So it's representing the National Library, all of the libraries around the world where this is discussed with their HTTP link data identifier. You can grab it right there, you can click and go to the link.
Adam Becker [00:28:33]: Like These ones reference URLs.
Jessica Talisman [00:28:37]: Isn't it wild? And most people miss this, they don't see it and, and don't realize this is what you actually pull from. It's not Wikipedia. If you want to create those authority source link data records, you're going to want to draw from. And you could choose, you could go to this colonialism page and you could choose one of these other identifiers instead, or go to one of those other pages and choose your sources.
Adam Becker [00:29:01]: So the idea is that you can. Then what is it that. How does the. I understand how an agent might interact with it, but in terms of like an LLM, in terms of like a replacement for Rag is the idea that I, I find a topic and then I kind of crawl through its various references and then extract all of those and kind of bundle it up in some manner and then feed it as context or that I.
Jessica Talisman [00:29:26]: You would want it associated. So I've seen very simple spreadsheet based implementations where you have like metadata, sort of like a key value pair almost where there's a reference for each. And then the more expanded or mature version of that is in a knowledge graph instead we are using ontologies or even if you had a JSON type implementation or representation, instead of defining those things explicitly, you just include as a link the HTTP identifier and that's it. And that manages at scale the knowledge representation. And so what happens is as you said, like wikidata or any knowledge source, like open Knowledge source, is going to expand and contract and maintaining and keeping up with that, keeping pace is near impossible. It's impossible for us to organize all of the world's knowledge. So this is an efficient way to deal with the exp. Expansion and contraction of knowledge at scale.
Adam Becker [00:30:28]: Jessica, I feel like I could keep you here for many, many hours and I'm going to try to do this offline then. Jessica, thank you so much for coming and sharing this valuable info.
Jessica Talisman [00:30:36]: It's really fun. Thank.
