arrowspace: Vector Spaces and Graph Wiring
Speakers

With over a decade of experience in software and data engineering across startups and early-stage projects, Lorenzo has recently turned his focus to the AI-assisted movement to automate software and data operations. He has contributed to and founded projects within various open-source communities, including work with Summer of Code, where he focused on the Semantic Web and REST APIs.
A strong enthusiast of Python and Rust, he develops tools centered around LLMs and agentic systems. He is a maintainer of the SmartCore ML library, as well as the creator of Arrowspace and the Topological Transformer.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
Meet arrowspace — an open-source library for curating and understanding LLM datasets across the entire lifecycle, from pre-training to inference.
Instead of treating embeddings as static vectors, arrowspace turns them into graphs (“graph wiring”) so you can explore structure, not just similarity. That unlocks smarter RAG search (beyond basic semantic matching), dataset fingerprinting, and deeper insights into how different datasets behave.
You can compare datasets, predict how changes will affect performance, detect drift early, and even safely mix data sources while measuring outcomes.
In short: arrowspace helps you see your data — and make better decisions because of it.
TRANSCRIPT
Lorenzo Moriondo [00:00:00]: Because once you have data evaluation, you have classification, you have search, you have a new set of tools that you can implement with this, right? And that's what graph wiring is about. What do we do with this? Super nice, super new cool tools provided by Piplexity to actually supervise, manage, curate data sets for machine learning operations and large language model operation.
Demetrios Brinkmann [00:00:32]: You've been working on a lot over the last years. Can you break down for the listener what you've been sinking your teeth into?
Lorenzo Moriondo [00:00:41]: Yeah, absolutely. Very, very briefly. It's like I've been actually experimenting a lot with drugs since like March last year. And I've run into some limitation of vector search. But then like as I am likewise the rabbital person, like probably a lot of developers and engineers around, I started to dig a little bit into the limitation of vector similarity in general. So that's how I've actually started this project and published this library called Aerospace that is an attempt to make vector search more accessible and more powerful in some sense, especially in the scope of text embeddings. So like vector space, but not just like vector space, but vector space with highly semantical connections. So text embeddings for language, but also like in general, any kind of vector space that has an early component of connectivity between the features in the embeddings.
Lorenzo Moriondo [00:01:54]: And then yeah, I went through with this and I started writing a few papers. Now I am at the fifth paper since October. I wrote two papers last year and three more now, especially the last one. I'm very happy with that because I draw this connection between this way of doing vector search and what I call graph wiring that is basically the generic application of this particular vector search and the way of measuring information that is now being proposed in January, that is called EP plexity. That is basically a new way of looking into entropy and complexity. So this journey brought me from vector search trying to apply all these tools connected to racks to some more interesting, more higher level kind of concept. But also I'm trying now to build a set of tools based on these foundations and I think that will be very, very helpful for everybody doing language model operations, but also like engineer machine learning operations.
Demetrios Brinkmann [00:03:10]: Okay, so first of all, amazing name with epiplexity or it's not mine.
Lorenzo Moriondo [00:03:17]: It's a paper from the University of New York and Carnegie Mellon. I took all the theoretical part from them and I just wired in what I've been working on and the things just like align perfectly. So I think we were on the same wave somehow. We were just surfing in Parallel at some point. And in January my paper on, on like one of my moonshots called the Topological Transformers and this epipexity paper just came out almost at the same time. And I would say, oh look, this is what I'm doing. And yeah, okay, I have a measure for it, epiplexity. Oh, that's great.
Demetrios Brinkmann [00:03:59]: So what was hard about vector search and what did you do to change it?
Lorenzo Moriondo [00:04:05]: Vector search has been developed in the years as, as a purely geometrical kind of operation. So you have like high dimensional vectors. You just compute the distance between these two vectors. You use your own favorite distance matrix that can be cosine, L2 or any other kind of matrix. And you got a score out of it, right? So these two vectors are X and Z distant, right? They are like applying this to semantically dense, what I call semantically dense embeddings is that some of the information that I demonstrated being part of the AP plexity framework basically get lost in the folding or in the reduction of the vector space into a geometrical space. There is some part of structural information that is encoded in my implementation, in my algorithm by the graph described by the relationship between the features. So you have column vectors, you compute the connection between these column vectors in the vector space and you get this extra that I call topological spectral information. And yeah, I went through all these experiments and basically I found out in the end that is demonstrated in the last three paper I published that I hope everybody, I mean some people is going to double check so I can get some very good feedback that basically some piece of information during the bedding process get lost or in dimensionality reduction or in the way the bedding process works by basically computing only on the item space and totally forgetting the feature space of the data set.
Lorenzo Moriondo [00:06:08]: And that basically this can be fixed for these geometrical vectors for geometrical embeddings by increasing the number of dimension of the embeddings. So if I started from 334 dimension embeddings, so it's like usually like a kind of usual denoising out encoder. And I actually ran my test comparing search using like cosine search on these limited space embeddings and aerospace, my library in the same space, right? And I actually found out that there is some way of doing this search better because there is some kind of chunk of information that I call like topological information there get lost. So applying topological search to the same space I can achieve better. There are different scores for that in general, some kind of better Search in a more like semantically meaningful search. I developed a score for this that is called MrRTOP0 that is basically subtitled the Topological PageRank. That is a way of measuring these things. So I measured these things in the geometric search and in the aerospace search topological search.
Lorenzo Moriondo [00:07:27]: And I found this gap recently I've done the same thing like with larger embeddings. So I've done exactly the same thing with 1024 dimensional embeddings. And what it came out is that ah, look, this piece of information that got lost on the 304 auto denoising out encoder is actually preserve but it's new way of doing embeddings but with much higher dimension. So basically what was happening is that I could do the same level, the same quality of search of 1024 dimension embedding using the 384 embeddings because Aerospace basically rebuild somehow the information that got lost in the embedding process through. Through what epiplexity measured as generated information. So basically I was generating information in the framework as it is explained by the framework of aviplexity. I was regenerating the information that was lost by the bedding process using my algorithm aerospace. And the outcome is aerospace worked better in topological terms both in the 384 and in 1024 dimension space.
Lorenzo Moriondo [00:08:44]: So yes. And it allowed basically a 384 space to almost work as a space that had three times the number of dimensions. So it was like a very like oh kind of discovery.
Demetrios Brinkmann [00:08:58]: So this is fascinating to me because it's almost like you're getting a bit of a cheat code. You're getting this extra dimensionality for free. What are the downsides of it? Because I imagine there's no free lunch.
Lorenzo Moriondo [00:09:15]: No, absolutely, yeah. I mean let's take a step back in the general objective. The general objective was we want a better, more fitter search to apply rags on. Right. Because retrieval meta generation depends a lot from searching among documents. So the initial hypothesis was that geometric search found very well the first three four five top rank documents but starts to fail very steeply after the fifth document. And that's demonstrated with comparative test. I wrote different blog posts about this because that's how it works.
Lorenzo Moriondo [00:09:58]: It just finds very well the top 345 documents but then its performance just like go down steeply for like the tail for the tail of the ranking. So I said okay, but retrieval, augmented retrieval needs something that is maybe as like a better, more smoother distribution. Right. We don't Want like to have peak performance on the first three and then okay, the rest, whatever, whatever, whatever happens with the rest, it's okay, right? No, because like rag has to maybe walk down different pathways to do some kind of reasoning and some chain of thought. So you can start with like with the top three ranking results and have a great like outcome. But then at some point you start looping and you start like get stuck in your local minimap because as many queries you run on your documents, you always return the same top 3, 4, 5, 4. So I said there should be a way to actually make this the distribution of the ranking like less like steep and allow the search to look at the lower ranks so that if you got stuck in the top ranks, with the top ranks, the reasoning maybe got in its local minimum on the top ranks. Just okay, let's go and look at the lower ranks.
Lorenzo Moriondo [00:11:14]: Right? So that's actually exactly what topological search does. There is these sliders in which you can basically modulate how much geometric search and how much topological search you want. So if you want like, okay, let's start with pure cosine search. Okay, slider 100%. We look at the top three top five results. But then maybe you want, okay, but let's try to open the reasoning, right? Let's try to look a little bit on the edge of the distribution. A little bit. Let's just look at this tail.
Lorenzo Moriondo [00:11:45]: What happens if I search like less, less popular or less like top ranking results? Right? So you can adjust this slider and you go down to almost like 40% of mix between like 40% cosine search and 60% topological search and you start having like very good, still very good results, but they somehow give a different point of view or a different like kind of approach to the problem than the top three top four rank. Right, I like that. And so you can just like tell your LLMs to just through to a simple like MCP server. Oh, let's adjust my search. I mean I'm stuck in this loop at this point. Maybe I should adjust my search. Okay, let's put this slider down. Let's go.
Lorenzo Moriondo [00:12:32]: Let's look more into topological kind of analyze search and you just like start receiving very still like very highly related, highly meaningful documents for the contest, but different ones, right? So you can restart your chain of thoughts and say okay, this way I reason this way, this way I reason this other way. What are the common patterns in these things? And so on.
Demetrios Brinkmann [00:12:56]: It's a way to let the LLM keep exploring on relevant data.
Lorenzo Moriondo [00:13:00]: Exactly. It's like when you do geometric, you basically go depth first. If you adjust the topology, you start opening the graph and starting to see like more like bread first, right?
Demetrios Brinkmann [00:13:11]: Yeah. Now you mentioned that there's maybe a way to implement this with MCP servers. Have you seen any of the vector stores or vector databases already incorporate these algorithms into it? Would it be at that level of a vector store? Is it something that you add on top of it? Where does it play in the stack?
Lorenzo Moriondo [00:13:32]: It would be just a different track for. For running search, you can have like search geometrical and search topological. And you. No, nobody implemented that. I have like some side project trying to do these things, like running comparative rags installation 1 using traditional geometric vector space search 1 using mixet hybrid topological geometric and things like that. So we'll see the results soon enough. I guess the fact is that geometric search is so well established and so well optimized that obviously it's the first way to go for like 90% of a use case. Maybe right up to the point you need some very good reasoning or you start having very complex reasoning.
Lorenzo Moriondo [00:14:17]: That's the point we are reaching now with Rags. Right. So in the last 20 years or so and the last, like even before, right, Geometric search was perfectly good because. And it was like highly. It became like highly optimized with this, like hnsw. Now we have like. So we have like these seven layers of graphs trying to look into like billions of records. Right? And it's great, but maybe we need to push these things a little bit more now because we have like more intelligent system to deal with.
Lorenzo Moriondo [00:14:48]: Right? So that's my point of view.
Demetrios Brinkmann [00:14:52]: Yeah, so is it. It's basically a. If I'm understanding you correctly, you want to use the tried and true algorithms until you saturate them and then reach for something like this.
Lorenzo Moriondo [00:15:10]: Basically what I'm trying to do now in experimental terms is that I add topological information to the geometric search because in the end it's like, it's a weighted sum, right? You have like an alpha that is like the geometric search and you have a beta that is the topological search component. Right. So through the aerospace, through the library, you can modulate these things until you see that you reach something better. Maybe like just the marginal better. That is just like a marginal 5% of 8% that I demonstrated in my blog post. But that kicks the rags out of his local minima maybe, right? And allows him to actually. Okay, let's climb up the local minima. I'm running around now and let's get in and maybe reach another local minima that tells me something meaningful as well and very much related, but different in a way that I can re elaborate my chain of thoughts in some different way.
Demetrios Brinkmann [00:16:12]: So it's that razor thin edge because you don't want to introduce too much noise and then have irrelevant data come through.
Lorenzo Moriondo [00:16:21]: The fact is, the problem with geometric similarity is that it has more noise if it's run like as a pure geometric search. Because you just get stuff that is similar geometrically but not in meaning. Because there is no semantical kind of structure that tells, oh, this is geometrically very similar. But maybe it's totally out of context. Right? That's why you get like, if you run like a geometric search and you ask for like the top 50 maybe records, the last 20 records are just like garbage documents, or the last 20 documents is just like garbage documents, like gibberish documents because maybe they just happen to be close in the geometrical space, but they are semantically totally not relevant to what you are looking for. Right? And that's what topological search fix in the first place. Plus is allowed a better context that there is a ratio, the diluted is the head tail ratio in which we measure the distance in terms of semantic distance between the head and the tails. With geometric similarity this is always unbalanced.
Lorenzo Moriondo [00:17:40]: Like the AD that tells you are about something and the tails are about something else. With topological search these things is rebalanced. So the ad and the tails, they talk the same semantic space because we inject like semantic information in the search.
Demetrios Brinkmann [00:17:57]: Now how does this play into memory? Because I know you've been going down the memory for agents route quite deeply also.
Lorenzo Moriondo [00:18:10]: Yeah, basically at some point this became something like, okay, maybe we can do like search and memory in the same structure, in the same data structure, right? Because we basically generate this graph out of the Vedic space. And this graph is a sort of permanent memory of the semantic space because basically it collects the invariants of this semantic space. So if you have like a bunch of legal documents about some particular kind of, or a bunch of philosophical or papers about some kind of philosophical topic, right? You build the, the graph Laplacian out of this, out of this embedded space of this document. And what the graph Laplacian represent, there is like a good kind of background work on this by all the researchers that work on Laplacian representation for vector spaces that basically represent the invariance of this space. So if the text embeddings, if the vector space is a text embeddings so basically a reduction of all the meaning in a given field of study. The graph Laplacian automatically, it's a representation of its invariance. So you have like that somehow you get some like long term memory there. You have like a summary of the summaries of all the documents you have in your vector space, all compressed in a sparse matrix.
Lorenzo Moriondo [00:19:39]: Basically. The sparse matrix is a very interesting structure. I talked about it extensively in my paper and in my blog post. It has incredible properties and it's heavily used, but in the item space of the vector space. So basically let's build the Laplacian on all the items of this vector space. What aerospace does, it flips the concept by saying, no, we want to look for invariance in the feature space. So we want to look for relationship between features. What is the relationship between the color of all these documents and the length of all these documents? The sentiment in all these documents.
Lorenzo Moriondo [00:20:20]: Right? Not just like, what is the relation between the color and the sentiment of documents A and documents B. No, we want to look into the graph, the column vectors, the feature graph.
Demetrios Brinkmann [00:20:32]: And you're doing it at a document level, not at a chunking level.
Lorenzo Moriondo [00:20:37]: That really depends how you do your embeddings. If you chunk your documents in your embeddings, you will come out with embedded chunks. Right? I usually, I started. The example I'd use is the CVA dataset. That is a dataset of reports of common vulnerabilities in software and systems. Usually these are like large report in JSON format. So they are just basically text file with a title, a description. I just build my embeddings by passing the entire document.
Lorenzo Moriondo [00:21:07]: But in theory, yeah, if you have like books, you can just, you know, like chunks into paragraphs at the time of the embeddings. And you can do like, yeah, you can just say, okay, then yeah, you have to do the reconstruction, say which chunks belongs to the same book, which chunks belong to the same paragraph, which chunks beyond, et cetera. But yeah, this works the same both for documents and chunks.
Demetrios Brinkmann [00:21:32]: Okay, so sorry I cut you off and derailed you a little bit. You were talking about how it's flipping it on its head, looking at the features as opposed to that. And so how can the features tell you information? That's what I can't make the connection on.
Lorenzo Moriondo [00:21:49]: That's the real semantic metadata that you're looking for. Like when you look into what we. There is like an infinite number of definitions for metadata. Right? But in the knowledge graph space, for example, that you know very well and your audience knows very well. Obviously what you have is like it's a graph that goes on top of the existing graph. So you have the graph of the nodes, right? And then you have all the graph of your classes on top of your graph of the nodes, right? And you call that metadata. It's the relationship between class of documents is the properties that connects documents in a way defined by the class level, by the metadata level, not by the the instance item level. Same thing for vector space.
Lorenzo Moriondo [00:22:34]: There is a space in the nodes that are. That is the space of the documents, the item space. But then there is a space that connects the features because each node has its feature vectors. So we said this node is represented, all the nodes of this space are represented by this feature Vector for dimension 0, the color or the length. But there is another vector for the dimension five that is like another characteristics of the vector, how these features connect together. So you see that you go to a second order kind of layer that I call it metadata layer, because that's exactly what it is conceptually. Right? And it turns out that there is information there that can be used for search because you don't only search the node space, you search the feature space, you can search the feature space. And that's what mathematicians called topological.
Lorenzo Moriondo [00:23:29]: Because this space gives you somehow the third dimension, the Riemann space of the vector space. Basically, this is the space of the second derivative of the vector space. And that's where a lot of cool stuff is because that's where all this structural information is found. And that's the structural information aerospace injects in your query to find, oh, this very super cool new results or results related to your query that are not geometrically closed, but maybe they follow different pathways down this morphology and they're still related, but maybe forgotten by the geometric search.
Demetrios Brinkmann [00:24:10]: So you're kind of blowing my mind here that that metadata has any value.
Lorenzo Moriondo [00:24:18]: Exactly. That's what it was forgotten, basically. It was like left there like an archaeological relic with no use. Right? And no it has so that I found this like last year as well myself. And it was like, come on, this is not possible that nobody looked into this thing before. And I said, okay, let's try. I mean, I mean I'm looking, I'm actually taking this this time to walk down this, this rag vector space kind of problems. And that's true, that is there.
Lorenzo Moriondo [00:24:50]: And through the epiplexity framework, I measured it like last month. I ended up measuring what was missing by the geometric search and it's what is called structural information by the Perplexity framework. And it's measurable. It's almost like 20, 30% of the total information. So every time we run a search on the geometric search, we basically lost 20, 50, 2015, 20% of the information we could have used that is there. But we didn't regenerate because we didn't rebuild the Graf Laplacian. And this is mathematically solid as far as I could investigate. And I would be very, very helpful to anybody that will look into this to tell me, no, you're not right.
Lorenzo Moriondo [00:25:39]: I would be like the happiest person if somebody found some fault in this reasoning. But at this point it looks like it works because I have run like tens of experiments. I have run tesope in different settings, like again, like shrinking or increasing the number of dimensions of the embeddings. And I hope to be right.
Demetrios Brinkmann [00:26:03]: So that's where you're getting these dense vectors for free, basically.
Lorenzo Moriondo [00:26:10]: Yeah, we are getting information out of the same dense vector space because we weren't looking in the feature space before.
Demetrios Brinkmann [00:26:16]: Yeah. Okay. I'm starting to wrap my head around it. And then the epiplexity. You should probably break down what that paper is and what your paper was, because I know you're referencing it quite a bit, but I'm not super clear on exactly what it was.
Lorenzo Moriondo [00:26:33]: I think nobody is because it's such a new thing. And I guess they have only one citation for now in Scholar, in Google Scholar because is such a new thing. So I don't know if I cannot really. Maybe you can actually ask as a guest the people that wrote the paper. But my basic understanding is that basically what Shannon information measure is what they called random entropy. But they said, look, if you, in general terms, like if the universe was a vector space, you will witness Shannon entropies. But they say look, but everybody that computes something is not like it doesn't look at the entire universe. It looks only at the problems that are in his computing capacity.
Lorenzo Moriondo [00:27:27]: So if I can compute up to certain level of, for up to a certain time, with a certain amount of computing power, I can compute this number of algorithms. Right? It is true that there are infinite algorithms out there and they all are under the law of entropy. But if I actually limit the investigation of only which algorithms we can compute, that is related to what Wolfram call computational boundedness. I guess it is. So he said we can actually measure random entropy, but also something else called structural information and epiplexity is Basically both of these things, instead of just looking at random entropy, it looks both like bounding the possible computation, but to the observer, to the person that runs the computation, and to its computational power, we can also highlight and measure this other thing called structural information. And in this framework, the graph Laplacian in my aerospace algorithm will be the structural information part. While geometric search deals only with this very wide, generic kind, universal kind of construct that is like the geometric space of, of, of the vectors.
Demetrios Brinkmann [00:29:03]: Okay, So I got this far in understanding it. Your epiplexity is constraining the space that you work with.
Lorenzo Moriondo [00:29:13]: Exactly. Because like, like in, in physics, right? In, in, in, in modern, not, not modern, but like in contemporary physics, you know something only because you go and observe it, right? So everything you see is limited by, by, by your detector, so your eyes or your like andron collider, right? So anything is dependent of the observer. So epiplexy does the same thing for information. You say, look, we cannot just like, it's more like what, like relativistics physics does, right? We don't look at the entire universe, we look at the frame of reference. So we have two bodies, they move at relative speeds to each other, and what happens depends on which observer you pick on and how fast this observer is going in relation to the speed of light, right? Relaxes. Does the same apply the same thing to information saying, look, you cannot compute like the entropy of the universe? Yeah, you can, but it's not like a rule that applies to, it may be not a rule that do not apply to every observer in the universe because there are observers that go faster and observer that goes lower, right? So the same thing, there are observers that have a vast capacity for computing and there are others that are limited in terms of computing. So for one observer it might happen that a problem is uncomputable, while for another observer it may happen that it is computable because it has access to more computing power. And maybe he can run the algorithm for more time because he lives longer or because like whatever other reason, right? So if it actually tries to bound every algorithm inside what is called a model, I don't know, I don't want to get too much in detail, like a random walk model that is bound to the observer.
Lorenzo Moriondo [00:31:05]: So you do this operation, you take the algorithm, you encapsulate this algorithm inside a random walk that is observer bounded, and then you compute the structural information that is generated by this bounded class between the observer and the algorithm. It's very, very, is greatly interesting to me So I hope it is for everybody.
Demetrios Brinkmann [00:31:27]: When you bind it, then you're able to get a score on what the.
Lorenzo Moriondo [00:31:32]: If you want, we can actually take a look into a jupyter notebook that I wrote exactly to explain this point in my last paper. If you want, if we have time it out.
Demetrios Brinkmann [00:31:43]: Yeah, I would love that.
Lorenzo Moriondo [00:31:44]: Okay, I will share it. No problem. So basically this is the paper and everybody can go take a look at it. But if we scroll here on the left, can you see it? Yes. Yep. There is like the notebooks directory and There is like this 00 in this notebook. Basically I try to explain to myself what apiplexity is and how it relates to my research and is the object of my latest paper, as you see, outer space, fish space, graph Laplacian structural information. So the question was, is it true that aerospace graph Laplacian encapsulate structural information as described by epiplexity? Right, that's the main question.
Lorenzo Moriondo [00:32:33]: And if we go through all the steps of these notebooks, it will find that. Yes, but very briefly to know just what ap plexity measures, we can just take a look to like these two or three first steps. Basically what it does first it computes the minimum length program to compute the graph Laplacian from the vector space. It's a measure of what is the minimal set of bits that transform a vector space into a graph Laplacian. Basically it's a Kolmogorov measure for complexity. It's Kolmogorov's measures for complexity. Right. So what is the minimal set of bits that maps my vector space to the graph Laplacian? And that's the first part to compute the plipexity, the second one.
Lorenzo Moriondo [00:33:27]: Exactly. And it's here. The aerospace pipeline has a prefix free program, right? You want to compute the length of the aerospace program. So yeah, there is like it's a mathematical construct that does this thing. So what is the minimal number of bits to describe this problem? This program basically. And that's the first step. This is all described in the epiplexity paper published 6 January. You can go and look into that.
Lorenzo Moriondo [00:33:56]: The second step is the wrapping. I was telling you before, because we have to make the Laplacian of this vector space into a Laplacian constraints Gaussian, Markov random field. I have problem already with the acronym myself, but I tried to went through these things like line by line and try to understand it. So we do these things. We basically encapsulate this system between the observer and the algorithm inside this shell of Markov chain. Of Markov chain model, right?
Demetrios Brinkmann [00:34:30]: Yep.
Lorenzo Moriondo [00:34:31]: And in the end we run a test how much these things decompose the original space and compress the original space because yeah, basically from the paper you can actually extract. There are three tests that tells you your epipexity measure is correct. Here they are. One is the compression test, one is the spectral gap test, one is the downstream lift test. If all these three tests passes, you are measuring your algorithm can be measured in terms of ap, plexity and epiplexity.
Demetrios Brinkmann [00:35:08]: Just so I'm clear, is giving you more information on the randomness that you're getting from aerospace.
Lorenzo Moriondo [00:35:18]: It's totally different from entropy.
Demetrios Brinkmann [00:35:20]: Okay.
Lorenzo Moriondo [00:35:21]: Epiplexity basically adapts traditional information theory to what we discover with machine learning and neural networks. Right. Like traditional information theory told you that you cannot extract more information that is in the data because it's just not there. Right. So it tells you that whatever you do with the data, you will get entropy, you will lose information. Right. But we demonstrate that it's not true because we have algorithm now that they do generate information out of the vectors, out of data in general, generating new structures.
Demetrios Brinkmann [00:35:56]: And this comes back to the whole idea of hey, if we look at the metadata and we find trends in that, that actually gives us a denser vector.
Lorenzo Moriondo [00:36:05]: Exactly. We will get information that is not in the vector space. So epiplexy basically demonstrate that we can generate additional information from existing information. So that it's not that every data space, every data set, it's just like whatever you do with it, you will going to lose the information you have because you're going to compress it, you lose information, you're going to decompose it, you lose information. Say no, look, now we have algorithms that if you take a bounded observer to the algorithm, you can measure that these things generate information. It's not just like a thermal loss. There is not just randomness taking over. You can actually do the.
Lorenzo Moriondo [00:36:49]: This is connected to me what are called inverse problems, right? When you have like a noised image and you want to denoise them, this is called an inverse problem. You are generating information that is not there by looking at the relation of the pixel. You can generate non generate, but you can extract information that is hidden there by using algorithms, Right. That's the same principle. Whenever you do denoising, you generate information out of something that is not supposed to have that information. So when you do like super resolution, for example of images, you are doing this thing right? You go from 200 720p to 1024ps right. You apply super resolution to the image and that's exactly what epiplexity measure. It measures how much information is generated by the algorithm, not how much generation is consumed by the algorithm.
Lorenzo Moriondo [00:37:48]: And they demonstrated mathematically, I mean it's a very young obviously framework, so it's not as established and it needs to be double checked and tested. And I guess this is the first algorithm that tests itself against epipaxity. So aerospace is the first algorithm that on top of which we computed information generation, the amount of information generated by the computational process. Yeah, and that's something that say oh wow, I connected some dots and that was the thing. And that's actually, that's the very latest thing. But if you look back to my previous papers, there is the walk, the stepwise stairway that brings you from a simple like search in a vector space kind of algorithm to a more general algorithm. Because at this point, as it is in the abstract, basically aerospace is generic enough to provide, I mean graph Laplacian applied by Aerospace in the feature space is generic enough to provide a good approximation or good results for searching classification, anomaly detection, diffusion, dimensionality reduction and data evaluation. All this.
Lorenzo Moriondo [00:39:18]: Obviously my idea is that. Okay, but this is super cool. We need to use this for like, for LLMs and machine learning operation, right? Yeah. Because once you have data evaluation, you have classification, you have search, you have a new set of tools that you can implement with this. Right. And that's what graph wiring is about. What do we do with this super nice, super new cool tools provided by Piplexity to actually supervise, manage, curate data sets for machine learning operations and large language model operation. And you got some answers about this.
Lorenzo Moriondo [00:39:58]: But the paper that deals about applications, machine learning application and large lake is the previous one that talks very extensively how these tools can be used in aiops or mlops pipelines.
Demetrios Brinkmann [00:40:15]: Well, it does feel like you're doing this with datasets after the model has been created. Have you also thought about trying to go for the datasets that the models are being created on?
Lorenzo Moriondo [00:40:30]: You mean the embeddings model?
Demetrios Brinkmann [00:40:32]: Yeah, so the training data.
Lorenzo Moriondo [00:40:35]: Yeah, exactly these things. There is a big question marks in whatever we do with LLMs currently because everything is connected to how we do embeddings. That's why it's very important to. And there are now teams and teams of engineers working only on how do we move from raw text to embeddings. Right. It's a field on its own. We have models with 4 billion parameters only doing this at the moment. So that's one thing.
Lorenzo Moriondo [00:41:04]: If we go instead of on the numerical side and we said let's look into pure numerical data, right? Machine learning data, regressions, trees, decision trees and all this stuff I found out, I mean, or my intuition is that it really, really matters how we do Fisher engineering because from, let's say we have like this mass amount of raw data like coming from the Large Andron Collider or whatever other big machinery for physics or for like any other kind of measurement you have, right? Obviously you don't run your machine learning models on the totality of this data. We are talking about thousands of terabytes, right? So what you do, you do some Fisher engineering, so you run some models to reduce your data into some manageable, workable. This is the same thing that happens with satellite data, right? The image you see on your screen from the satellite, it's just like a model of all the raw data that the satellite collects and push down to Earth, right? So it's really important how the people at NASA, at ESA or whatever else, how they model these algorithms to actually make this data usable. My intuition is that if we try to embed in some way or in a better way, we do now the features relations, while we do feature engineering for this data, we may have more powerful topological search downstream. So as your question is really relevant because it really matters how you treat the data upstream. But having aerospace, you can indeed measure how much structured information your feature engineering generates. So you can compare. If I generate these data sets using this model, it produce downstream 0point something epiplexity, right? Structural information.
Lorenzo Moriondo [00:43:09]: What happens if the same data, I use a different feature engineering model and I compare this. Oh look, this generate 1% more structural information. You see that these things can work upstream and downstream. Because my idea in the beginning started from, from. Okay, let's apply this to performances, large language model performances, right? We can say let's take the latent space and just like compute the Laplacian, the graph Laplacian on the latent space. That's what at anthropic they call, I guess machinistic analysis. So they go and just like investigate, analyze the latent space to find where the best tokens are generated in which subspace. That's exactly what you can do with the graph Laplacian that's downstream.
Lorenzo Moriondo [00:43:59]: But upstream you can do what we were talking before, like let's measure which model does best feature engineering. So you see that this is like pretty much like A quite kind of effective point of view in terms of what we can do with the data. How good is the data, how this data set will go if I add something, what, how will this data set look like in six months for now
Demetrios Brinkmann [00:44:28]: and things like that coming back around. I'm not sure I fully understood how this connects to memory with agents.
Lorenzo Moriondo [00:44:40]: Okay. The graph Laplacian is a permanent memory in the sense that it holds the invariance of your context. So basically it tells you which pathways are possible from one feature to another. I mean mathematician describe this thing as a three dimensional space. So for example, if you have like outliers in your feature space, these outliers will be denoted with very high energy. If you have like very connected kind of features, this feature are denoted with very low energy. Right. And the graph laplace and basically describes all the path that you can take from a very connected feature to a loosely connected features.
Lorenzo Moriondo [00:45:32]: So it basically constrain the way you can do reasoning somehow inside the feature space of the vector space.
Demetrios Brinkmann [00:45:44]: Okay, but we're not talking about agents remembering in the way that it's like oh, the agent understands that you like only vegetarian food or.
Lorenzo Moriondo [00:45:56]: No, it's not about. Yeah, remembering scenes, what they're called sins or what they're called. Like they have different names for that. Right? It's not about that. It's a different kind of memory. More like permanent long term memory of a given context. But it can also be applied to those because then if you have like a list of scene, a list of scenes that your rags remember, you can do the same thing. Because that's a vector space itself.
Lorenzo Moriondo [00:46:22]: Right. So you can have embeddings of this, of this like context memory. I don't know how you call it, I don't remember how they call it. And then do the same operation on the context. So you can actually build your memories of your context. Right, but that's a different kind of memory indeed. It's not like this is more like, I think they called it transient memory or something like that. What the, what the vector, what the graph Laplacian describe or define is the long term permanent relationships inside a context of documents that if you know it, it's a memory.
Lorenzo Moriondo [00:46:58]: Right, because that's exactly what you want to remember if you are an expert of that field. Somehow you want to remember the invariance somehow and then connects your invariance to your application. Yeah, or at least that's how I see the process itself.
Demetrios Brinkmann [00:47:16]: Can you talk to me a minute about this idea that you Follow on the discovery driven development because I think that's also pretty fascinating on understanding a little bit more on how you work and how you go through and test some of these ideas.
Lorenzo Moriondo [00:47:35]: Absolutely, yes. Thanks for the question because that's, I guess we have talked about the content and I think the content may be interesting to people and I hope so. But yeah, I mean going through this I also understood a lot about what I'm really doing here, like what is my method, right. For doing this thing? And my method is mostly based on intuition, like heavily based on intuition. That works great with large language models because that's what large language model lack. They don't have an understanding of the world that allows them to be intuitive. Right. Because they are constrained by being, by their way of seeing the word.
Lorenzo Moriondo [00:48:18]: Right. They only see the word through language, through text. So they cannot build the level of intuition that humans can do because we have so many more senses. Right. So what I do is basically I try to leverage this very intuitive part of what I do and I try to inject this intuition into, into, into the process of multi agent kind of research processes. Right. So I work with different LLMs in parallel and I just try to collect the thing that they do, right. And that's just like try to inject my intuition to correct what I think that I think they're going a little bit out of the scope of what I'm doing.
Lorenzo Moriondo [00:49:01]: Right. So yeah, and this connects greatly with what happens in the meanwhile in scientific research. Because in the meanwhile you have all these new papers published at the level that we have never seen before, right. So you cannot really work in your own lockdown kind of understanding of things. You always need to update this understanding of things very by, if you want, Right? Yeah, it's by vision in the process, but also into listening to what happens outside. And so actually when I found the paper that connects, I instantly tried to build new intuition of what I'm doing based on this new paper. This happens with the epiplexity paper and this happened also with another paper that was greatly impactful on me. But that's another really long talk that I don't want.
Lorenzo Moriondo [00:49:59]: I mean, I think maybe you don't have the time to start, but yeah, that's the process, right? You have your own Bayesian process that goes on with your multi agent things. But sometimes something from outside happens that is very meaningful to what you're doing and you cannot just keep going on without including what has happened in the meanwhile. Right. So I said, okay, I built aerospace I built what I call like these super crazy experiments called the topological transformer. There is a transformer architecture that works on spectral indexing, like in topological search instead of geometric search. That is very like kind of my moonshot. That is just like something I'm not following at the moment, but that an experiment that I did, I said, okay, but what happens if I look at aerospace in the framework of ap plexity and all this new stuff came out. Right.
Lorenzo Moriondo [00:50:52]: So you can see how the method somehow being so open to the outside, like putting in in question, like asking question from things that comes from outside what you're doing. It's fundamental because it allows you to stay in touch with what's happening outside and to improve very much what you're doing. Yeah, it's a kind of science driven kind of development somehow in this sense.
Demetrios Brinkmann [00:51:22]: And how do you go about updating these frames of references? Because I'm assuming you're constantly reading new papers, you're constantly trying to build new intuition, but it sounds like it's only occasionally that something will hit.
Lorenzo Moriondo [00:51:40]: I have, I have GitHub repos that are only text files with things like text that I've built using large language model. I said, okay, this is good. I put this in a text file inside this GitHub repository. I have a file system that basically builds day by day on top of this thing. And I have a bunch of text files that I didn't open anymore, but they're still there. But maybe there are like those three or four text files for each directory that instead branched out into something else. Right. So I mean I could go through the entire history of what I'd done because I have the entire file system of what I'd done since like March last year.
Demetrios Brinkmann [00:52:24]: Wow.
Lorenzo Moriondo [00:52:24]: And yet. And this contains the prompts and the answers. So in theory, if I want to do a meta research of what I'm doing, things that I have no time of doing at the moment because of these, all the other things I'm trying to carry on. I could actually do some kind or which prompt worked best or which answer was the most impactful in what I've done. Like in the, in the months after. Right. And all these kind of things. Yeah, but it's just like basic prompt engineering, prompt prompt architecting.
Lorenzo Moriondo [00:52:56]: I guess it's not, it's nothing very, anything. Very, very super fun.
Demetrios Brinkmann [00:53:00]: There was one spicy question that I wanted to ask you. If you inevitably have tried to play with vector search versus just the giving an agent a tool, especially like a coding agent Just being like, hey grep. And seeing what the differences are in that, how do you compare those two? When do you think to reach for one versus the other, if at all?
Lorenzo Moriondo [00:53:30]: Yeah. Basically what GREP does is what is called lexical search, right? That is also what BM25 does. It basically just count the instances of the how many times the word or the concept of the how is it called? Or the. Or the seed of the word is present. Right. That's called lexical search. That is part of geometric search. Now you have all of this like wave of EBRID search in the sense that they mix geometric search with lexical search, right? So what you're doing is still like geometric search because you define your.
Lorenzo Moriondo [00:54:03]: In the geometric search, you define your context by distance among vectors. In the lexical search, you do more like statistical kind of things. You count basically the statistical characteristics of the words in the context. That is a pretty well good way of establishing a context because the kind of words, the class of words, the families of words you have in a text is its context, like the definition, somehow definition of context. The problem is how you measure these distances. So like, okay, you have like this word appears 50 times in 1000 words, right. Does it mean that this word defines this context? Right. So you do, okay, let's add some geometric search, some like distance cosine L2 search to this.
Lorenzo Moriondo [00:54:58]: And with these two things you can tell better which context belongs that documents belongs to. But then you go back to the original problem of like running vector search, geometric search on that vectors you build, right? So you go back that you lose all the semantical part that the features analysis brings in. So yeah, that's lexical search. That's lexical analysis. Yes, yes.
Demetrios Brinkmann [00:55:30]: Lorenzo, this has been great, man. I appreciate you coming on here.
Lorenzo Moriondo [00:55:34]: Thanks a lot, Demetrius. Thanks a lot for what you do in the community. And it's great to have people that can work so well like building up communities and products. So it has been great talking to you, Ra.
