Future Proofing of Agentic AI Systems // Rekha Singhal
speaker

Rekha Singhal heads the research at TCS Paceport, NY, USA. She leads the Computing Systems Research area. She is a ACM Senior member. She contextualizes future of business to research problems and collaborate with TCS Research & Innovation, Academia, Technology Vendors and Startups to design the solution. She focuses research on accelerating development and deployment of modern AI enterprise application in heterogenous multi-cloud environments. Her research is a core component of some of the TCS products such as Optiunique and HPC-A3.
Her research interests are at the intersection of next generation AI (GenAI) and future of computing, heterogeneous architectures for accelerating AI pipelines, Learned systems, high-performance data analytics systems, big data performance analysis, query optimization, storage area networks, and distributed systems. She had associated with Spec RG Big data and ML Perf. She has filed 70+ patents and have granted in international territories. She also has 100+ publications in international and national conferences, workshops, and journals. She has contributed to TCS Reimagination Research books.
She has been in program committee of conferences like ICPE, IEEE Big data, AIDM (SIGMOD), TPTC (VLDB) amd journals such as PNAS, Distributed Systems, CACM , IEEE Transactions on IOT etc. She is in organizing committee of COmsnet 2024 and AIML2023.
She had led the project on Disaster Recovery appliance which was runner up for NASSCOm award. She has received her M.Tech. and Ph.D. in Computer Science from IIT, Delhi, and has been a visiting researcher for a year at Stanford University, United States
SUMMARY
Traditionally, an enterprise needs to go under transformation for business processes or IT system due to the arrival of any internal (such as mergers and new service) or external (e.g. new technology like GenAI, pandemic etc.) disruptor. This led to a long catch-up time for an enterprise. We can design our applications to be resilient to these disruptions using adaptable-IT systems, by leveraging future advancements in computing, and scalable to envision zero-downtime. Today’s AI based applications may involve traditional computing, SQL processing , deep learning and Gen AI model inference, so are heterogenous in their demands for computing and memory bandwidth. Some of the applications, like molecular simulation and portfolio optimizations are intractable and not solvable even using traditional compute. Also, the workloads on these applications keep varying throughout their life cycle. On the deployment side,, the range of computing accelerators today extends beyond traditional general purpose computing to low power AI specialized hardware such as Inferentia, Graphcore, Sambanova and Cerebras, many of which are accessible on public and private clouds, and special hardware like Quantum computer for intractable problems. Further, due to death of Moore’s law and need of power hungry AI models, there is shift from silicon based computing to physics based computing for AI applications, such as photonics, neuromorphic, analog and DNA computing paradigms. Achieving the optimal balance of latency, throughput, cost, and energy requires large design space exploration across hardware architectures. This motivates us to build intelligent middleware for pareto optimal mapping of application components to heterogenous hardware ecosystem. This asset can recommend high performance low cost, low energy deployment options for enterprise applications.
TRANSCRIPT
Rekha Singhal [00:00:06]: And good morning, good afternoon, good evening everyone. I don't know from whichever geography you are joining in and thank you for being here. My name is Rekha Singhal. I'm heading research and innovation for TCA Spaceport. TCS is Tata Consultancy Services for people who are not aware of it. Tata Consultancy Services is an IT service and they do system integration for our customers. Our customer spans in all the domains, manufacturing, retail, whatever it is you may think of. So what I'm going to talk about today is one of the practical problem which we face today with our customers, right? So most of our engagements are in terms of, you know, transformation for our customers.
Rekha Singhal [00:00:51]: So any changes in their businesses or technology or ecosystem we always get into the transformation exercise which is time consuming, money intensive. What we are envisioning for our customer can we build perpetually adaptable enterprise systems? We can think of it today thanks to LLM and Genai. What I'm going to talk about today is what are we doing internally to do the future proofing for agentic AI system? Like you heard the conversation between Adam and me which I was not actually aware that all of you are listening to it. He's shared that things are so dynamic, things are changing more rapidly the way it used to be in the past. So the need for being future proofing is much more imperative than it used to be because things are changing very very rapidly. So we need to be agile and resilient. So that's what I'm going to cover today. Basically I'll give you a little bit flavor of what do I really mean by future proofing of agentic AI and what are we doing internally to do this so that we can leverage it for our customers engagement.
Rekha Singhal [00:02:00]: As I was indicating that we do see enterprises and why I'm saying enterprises because Agent Ki is component of it. I will do touch upon agentic AI but I'm starting with the bigger thing, Enterprises itself. We have some external factors like if you look at the enterprise level, gen AI is the new technology and people are shifting from, let's suppose even the cloud provider, they must be hosting their application in aws. Now they are migrating to Azure maybe to get more access to in the initial period to ChatGPT, OpenAI APIs and stuff like that. Or you may have some kind of environmental factor like Pandemic, you will have some external factor which will disrupt your businesses in some sense or you will have your internal disruptors like may your own mergers, new services. End of the day you get into transformation business which is more Time intensive and money intensive as I mentioned. So you want your enterprise to be agile and resilient and agile means you should be adaptable to short term disruptions and you should also be adaptable to long term disruption. So you have to predict what may happen in future and or you should keep an eye on the future and make it resilient your system.
Rekha Singhal [00:03:13]: And all these apply to agentic AI system as well. Although I'm referring as an enterprise system for enterprise level we see that it's doable to make it them future proof because we have now LLM and we can use multi agents. Another part which I'm going to touch upon today is the future of compute. So these go hand in hand as all of are aware of it. And LLM I can say like if you look at the Gartner report recently, they also talking about that we are evolving from sequential agent based computing which we used to be doing this year early 2025. Now we are moving to Internet of agents. Like one of the example is Nanda project from mit. If you're aware of it, I'm completely involved into it.
Rekha Singhal [00:03:55]: Where we're looking at if you have Internet of agent, each one of have our own agent and how does the age ecosystem and the infrastructure will evolve. And once we have that, getting the future proofing for your enterprise system will be much more feasible and doable. Right. So that's where we are moving in the agentic system as well. So it's basically because of agentic I can do the future proofing but I have to do the future proofing of agentic as well. So couple of the point, it's not an exhaustive list just to stimulate your mind. So when I say future proofing of any system be agentic, AI or enterprise, the resilience means you should be adaptable to change in application functionality, some compliance security aspects. You should be able to adaptable or resilient to change in workload.
Rekha Singhal [00:04:41]: Today you may design your system keeping certain kind of distribution for your user request. Let's suppose in your rack system or in your voice agent system. But if the distribution changes does your deployment changes and how does how do you really adapt to it? Or new computing paradigm. Today we see because of AI a new kind of hardware keep cropping up. So many accelerator. Today we are seeing more of silicon based accelerator. Tomorrow you may see nature inspired computing accelerator like photonic based computing or neuromorphic computing or maybe DNA computing. And if those become here and now, how do you really adapt to it? Some new technologies Like Genai.
Rekha Singhal [00:05:20]: And then can you do really optimization for the cost, performance and power which is really the case especially for AI workload because they're all very power hungry. They are really compute hungry. If you're making your agent based system which are reflecting in nature, your cost may go beyond control. How do you really look into those aspects and so and so forth. Right? So you need future proofing for multiple reasons. I've articulated few of those. So basically what I mean you I'm bucketing into many components. But today my focus will be on the self adaptable IT system.
Rekha Singhal [00:05:54]: Because this is a large topic, I can't cover everything in 20 minutes. So I'm little bit focused on self adaptable IT system. Now for self adaptable IT system you can correlate like building a digital twin for an IT system for an enterprise. So when you say digital twin you should have some kind of sensing and monitoring tool. You should have some predicting simulating capability within your framework. You should, you should be able to generate the whole design space so that you can look into it. What is the right deployment for me where again you may have some fragmented resources heterogeneity. I'll touch upon a little bit of that.
Rekha Singhal [00:06:31]: I'm giving you the overall view first. Then you have to have optimization for objective. Is it your CIO is your user or is it an end user user? Because CIO may be more interested in the increasing the utilization, reducing the cost. Your end user is interested in their own performance in terms of latency throughput. We have to have some kind of a Pareto curve. And how do I choose that particular design space? And then finally actuating and learning that how do you really actuate the whole deployment in terms of taking care of interoperability standards and how do you learn once you deploy it, it may not be the real thing. And then whatever you learn, you take it back to the framework. So that's what it looks like digital twin.
Rekha Singhal [00:07:13]: Right? If some of you are aware of more comfortable with that term. I wanted to introduce that. So we are working towards it. And I think in this talk I will cover only the heterogeneity part, I won't cover other aspect. It's too much to cover even within digital twin of IT system. So let me take you double click on heterogeneity. So I am talking of heterogeneity from the compute perspective. So last year Gartner came out with this one of the future technical trend they have every year, right? They talked about it that they need to Have a hybrid computing orchestration tier because you have variety of computing environment today be it quantum photonic, they may get mature in the later horizon but they are seems to be making some promise to AI community especially because they are nature inspired.
Rekha Singhal [00:08:03]: So they are efficient in energy and and they have good speed. It's just that the error correction things need to be taken care of and they are not scalable but we foresee to be there. So that's one. But if you look at this year trend 2026 same Gartner has added some more layers because now we have LLM, we have model zoos and many more capabilities. So this is what Gartner is saying. But you still have those future of compute being intact, adding more application layer on the top of it. What we are seeing is if you look at the performance as a whole, since I'm talking of IT system, you have three basic pillars. Your application, you have your own workload coming from the business and finally you have your infrastructure which you want to deploy.
Rekha Singhal [00:08:49]: This infrastructure which we are talking is more of a heterogeneity nature. You have some compute which is here and now, but you have some compute which is in the future like your custom accelerator may be through the RISC pipe. You may have nature inspired computing. So looking at all these together now, can I make my IT system to be future proof? I may have some kind of a simulation platform for next gen computing and I may have some real benchmarks and real platform for the here and now and I mix them together to make my IT system to be future proof where I have easy kind of migration from one platform to another easy kind of evaluating their effectiveness for my own workload. That's what we are doing it in house. If you look at heterogeneity further, double click what we are looking at in TCS in the heterogeneity space that your application itself has many dimension of heterogeneity. Your workload may be heterogeneous in nature. You may have RDBMS which is doing the transaction processing and you may have different kind of database for analytic processing like Big Data and now you have vector databases like Chrome DB and all.
Rekha Singhal [00:09:57]: You may have your traditional MLDL and as well as your new agent AKI tool. You don't have a homogeneous set of activities within your application itself. It's completely heterogeneity on top of it. Your deployment stack is also heterogeneous as I just indicated. You have different kind of hardware in terms of computing memory interconnected. You have different framework. If you look at Agent Aki itself, you have variety of framework available today. Crewai Autogen LangChain so how do you really interoperate across these frameworks? You have many middlewares, you have different cloud services for the same job, you have different geolocation.
Rekha Singhal [00:10:35]: Your enterprise may be distributed across edge devices as well as geographically different location. How do you deal with it? It's like a matching exercise. You have a heterogeneity deployment, you have heterogeneity application. How do I really match it efficiently? That's the kind of problem statement I'm going to touch upon in that domain. Couple of solutions which we have built in house. I'll give you some flavor of it. I will not go in detail. One of the challenge which we have in front is I think last two years, as soon as LLM came in our life one of the challenge was can we do LLM inference cost effectively on fragmented resources on distributed data.
Rekha Singhal [00:11:15]: So a couple of these things and how do I choose the right kind of accelerator for my own fine tuned LLM? If I'm using API call then which API call is good for me? Or if I have my fine tuned model especially for SLM community who believes in SLM and many of us including me believe in SLM will have our own slm. We'll use an accelerator to deploy it. So how do I do it? And I believe agenti communities aligned more towards SLM. The then towards having OpenAI or API call to build your own agents. Then how do I choose that design space for LLM deployment? Finally when LLM I'm using it's a big application not a standalone. So if I look at the full application having its own MLDL pre processing the whole pipeline, how do I choose the design space for cloud deployment? Couple of it if I have a time I will touch upon the edge cloud solution as well which we are doing internally for design space exploration. Okay so let's take take you to a little bit on the just LLM space. If I have a SLM which I have fine tuned now I'm looking for an accelerator where I can put it up right? So if I look at my application there may be many component and one of the component maybe let's suppose LLM right? Or slm1 slm c4 maybe some other slm apart from the slm.
Rekha Singhal [00:12:37]: I have my own workload knowledge. I understand how the SLM will be used. Is it for summarizing application? Is it for Q and A application question answer or is it like a voice agent. What is my workload? What are the guarantee I have to ensure in terms of latency and throughput which we call slo. Now based on that I may have to choose different accelerator. It could be a GROK instance, it it could be a 100, H100 or maybe inferentia chip on AWS or maybe some other kind of accelerator from some private cloud provider or NEO cloud provider. You have a variety of choices within you and different kind of storage you can have it. Different kind of cloud services, aws, Azure, Google and within that again different services.
Rekha Singhal [00:13:26]: Even if you look at this small example which I'm showing you here, you're having some kind of a 7.6 lakh combination to choose from. That's a huge for a human to visualize and digest it and come out with the right choice. As a human we may skip some of the choices, but computing help us to delve all the choices and be sure that we are not leaving any one of them to be very optimal. For my own objective, this is what I took an example of research report for financial sector. We keep building Genai based solution. So we have this one prototype I'm showcasing here where I'm taking a topic for the financial sector generating some kind of subtopics and through the Bing search I'm getting some top 10 questions for each of the subtopics then giving it to embedding model which is in some kind of a rag which goes into my data space or data provider and get some relevant chunking for each of those questions and assemble those through an LLM call it could be an API large LLM and finally put the result. I have a lot of choices like each SLM I can put on different accelerator embedding model. I can put on different accelerator a lot of design space which we did it and we figured it out that based on the workload for that embedding model, if you have thousand users, 300, 500 or 100, the choice of hardware for a given throughput or total cost, the choice changes, right? So if you look at thousand users per hour, my choice will be GRO for this, some 800 for another one and you know, some T4 instances and H100, all this combination is able to give me this throughput which may be a desirable throughput and if I compare with the baseline as a, you know, I mean everybody today just goes for Nvidia 800 by default.
Rekha Singhal [00:15:19]: Okay, that's the best, but it may not be the best, right? For your workload thousand user that's what it turns out to be. If you do a proper research methodology then a pure a 100 kind of setup. So that's what our intention is to. And you can keep changing it based on your workload if you have the workload forecasting model and if I go further deeper. So this has different hardware now if I go in a different cloud services even if I'm using let's suppose only GPUs but if I have different regions, different cloud and different way of distributing my blocks within my SLM across these GPUs. Again you have a large number of design space. So we go further deep into it that now let's expand across these things. And what we observed that if you leverage just vllm or vllm press Ray I'm not going into detail of this tech and the other one is PETR which we are exploring it internally.
Rekha Singhal [00:16:13]: It's a research based prototype came out with a different university and we're building on top of it to solve the industry problem. So good example of you know how we leverage industry leverage academic work to solve the problem. And we figured it out that if you do petal way what we are doing internally we are much better than in fact what VLLM can give you or Vllm plus Ray can give you in terms of solving the different problems. Right. So the takeaway for us here was that you can use the heterogeneity and can deal with it to get what you really need to have it. It could be by choice or by design. And I think I have some three, four minutes more. I'll quickly touch upon another small POC building the whole story that if you have a retail application with some chatbot again you you have now much more component than just LLM.
Rekha Singhal [00:17:06]: You have some order processing, microservice checkout card. And if I want to plot this, you know on the whole design space if you see all these blue dots, the purple dots, these are all the design space I have in that ecosystem. So this could be again in lacks I plotted few of them. And the curve which you see is the Pareto curve which I can choose any point on this because it gives me high throughput and the less cost depending on where I can put my knobs on. Again we are able to do this based on building this performance model and we could be future proof with this for our application including Agent Ki workload. This is one of the demo for the tool. This is how the interface looks like. I don't Know how much time I will have.
Rekha Singhal [00:17:51]: Maybe I'll quickly skip it very fast where we just go through all.
Adam Becker [00:17:57]: Don't skip it. Don't skip it. Keep going. We're eating this up.
Rekha Singhal [00:18:02]: Sure. Thank you. This is the tool which we have developed internally. Where you see we can allow a user to pull in the architecture they are thinking in the beginning. With this tool here you can see on the top the cost and latency. You can keep seeing whatever infra as a service you are choosing and how your cost and performance is keep on changing. You can choose the parameter. I have taken cost and latency here to show you and you can keep playing around with the different services here.
Rekha Singhal [00:18:32]: I'm now choosing a different EC2 instance for my own rag which is a chatbot here. By changing that and I can give the workload what my rag will handle it. How many throughput that means transaction I'm expecting or queries I'm expecting based on that it will keep on changing on the right side. And it's like a studio playground for you for a solution architect before it really choose the kind of services. This is a very primitive. I've shown you. We have gone much beyond it. You can choose different kind of could be for AWS for GCP and multi cloud as well.
Rekha Singhal [00:19:07]: This is all I have for listening. I'm happy to take questions now. Thank you. Time limit, Adam. I didn't want to overshoot.
Adam Becker [00:19:16]: No, you're very good. I am just stunned by all of this. So it's. I have to pick up my jaw from the floor. So much of your presentation. I will wait for. For questions from. From the chat.
Adam Becker [00:19:31]: I gotta tell you though, the entire lens of it. So I tried to do some research to figure out, you know, what is it that we're going to be talking about. And I got the impression it's going to be something about the future. I. I had no idea it would be so vivid about the future. And I'm fascinated first of all by the almost like the charter or the mandates to go and research future proofing your AI agents. How. How does that even come about? Do you actually have like, is it large enterprise that are just already not just concerned about keeping up but keeping up in the future? And then you guys have an internal team that says we're going to study the future and we're going to see what we have to do.
Adam Becker [00:20:19]: How does that even happen? I'm blown away. Yeah.
Rekha Singhal [00:20:22]: No, no, thanks, Adam. So that's correct. So what I presented today is only from the aspect of computing and we have a large TCS research so there are dedicated team looking into security compliance and and they are looking at top vulnerability of all these agents. We keep tracking different agents, we look at the benchmarks and we understand what kind of hallucinations and what kind of robustness challenge even the LLM may have it. And we do incorporate all those specific compute aspects because that itself is challenging. Today you may deploy your app. You are more suitable for those hardware than the ADA A100. So how do you really first of all just there on it, whatever you have been deployed for years and years paying in dollars.
Rekha Singhal [00:21:08]: Right. So but as an agent TKI we're looking at everything cost, performance, security, keep tracking of the top vulnerabilities of these LLM. So when you include an agent we can be sure that whether you are future proof to those kind of threats or not. So we look at all aspects within TCS research. I just little bit scratch.
Adam Becker [00:21:27]: Wow. I. I know. I when. When you, when the moment that you shared all of the different dimensions of research that you do and then you said but we'll only really have time to talk about the heterogeneity. I was like we gotta extend the time then.
Rekha Singhal [00:21:42]: Yeah. And I there also, you know I didn't talk about many of the stuff. I really missed out many of it but I thought I just wanted to give a glimpse of it. So that's the reason I touched upon.
Adam Becker [00:21:52]: Yeah, yeah. Well we, we, we appreciate it. One or two things there and I see that people in the chat believe so as well and people had asked to get the video because there was so much information. Let me start by tackling a couple of questions that have come up. We have one from IMAD Is an old SLM model fine tuned to a certain domain better than a bigger and newer model?
Rekha Singhal [00:22:14]: So what we are observing for our customers especially when they have a specified task like we are actually building a platform which says knowledge as a service.
Adam Becker [00:22:23]: Right.
Rekha Singhal [00:22:24]: SO Knowledge Level 3 Worker in Data center. Right. Or who can resolve the quiz cannot be like a level three because it doesn't have that kind of experience. So how about if I can have that experience encapsulated in kind of a model and you let it be used by L1 worker. Right. So this become knowledge as a service. So if you experience for the specific job specific we can't really do with the LLM and because finally we need to deploy the SLM as well and I believe. With SLM the fine tuned model for that task is I Hope I answered that question.
Rekha Singhal [00:23:06]: I don't know.
Adam Becker [00:23:07]: But yeah, I think you did. Imad, if you have more questions, you'd like to tease it out in more detail. Definitely. Let us know. So I had another question for you about that Studio view. Is that something that you're open sourcing or that you're allowing people to use or is that something that you use internally? How do you, how do you see the future of this tool?
Rekha Singhal [00:23:28]: That's the disadvantage of industries like us, large organization. It's not enough, unfortunately, but we do use it for our customers and yeah, so we use internally for our customers while designing their own solutions, while giving services to them. So actually this tool will be part of our platform which will eventually service to our customers directly. Yes.
Adam Becker [00:23:53]: Very, very cool. We got one from Adrian. How can he work with you? This is his jam. Everybody's a big fan, I think.
Rekha Singhal [00:24:02]: Please reach out to me on my email or through Slack. I'll be happy to chat and understand how we can work together. Obviously I'm looking forward to work with the community.
Adam Becker [00:24:11]: Yes, yeah, yeah Adrian, that's. Well, we'll put your LinkedIn below in the chat and your in your email. Nice. Very cool. So is it fair to say that. You know what, let me put it to you. How resilient do you think existing enterprise are to the changes that are going to come our way? Compute changes, infrastructure changes, just the heterogeneity of the LLMs and the diversity of models. How resilient are they and what do you think they could do as an organization, let alone technologically, but just as an organization to start to think about building more resilient and more robust approaches to the future?
Rekha Singhal [00:25:04]: Yeah, no thanks Adam for this question. It's a very, very relevant, important question. So what we are seeing, what I'm seeing with the developer while working with them. So so today solution architecture has to design the infrastructure or choose an infrastructure. It's basically based on one's experience in a limited scope, whatever project the person would have dealt with and some kind of what he would have heard. So with that they design the solution, they design the deployment stack for application, which is limited obviously, because my own experience is limited. But with the tools like which I showcased, we gather all these experience. I have gathered experience, let's suppose for all the people in the mlops community today and encapsulated in some kind of a model.
Rekha Singhal [00:25:51]: That means the experience is much wider than a single person designed the solution architect. So I would say that today things are not that resilient because the way we are designing it it limited to some kind of experience. But if you start using gen AI based solutions where you can capture all the experiences and and do a design space exploration, take a call. You'll be much better off because the design space exploration you can build as many hardware as you want because it's a computing infra which is doing things for you, not as a human being. You have to crunch the numbers in your mind. So I believe these kind of technology could be helpful for us to build our resilient system today. It may not be the way we are designing it. That's all.
Adam Becker [00:26:38]: Yeah. Are there good resources you would recommend for folks to keep up with with changes in how we should even conceptualize agents? Or is it is it all evolving so quickly? Is there a blog, is there a book, a podcast, anything you would recommend?
Rekha Singhal [00:26:54]: Yeah, it's. It's again so I because things are evolving so fast so I just keep tab off, you know, for my own purpose. I'm a researcher so I'm so I can share what I'm doing. So for the compute I really look into all the latest research in future of compute and I look at all my focus is more on the nature inspired computing because that's the future all silicon based computing in here and now. So I just use it whatever is available because I know things are moving pretty fast. So for nature inspired computing I track journals in photonic computing, neuromorphic DNA and because they will become here and now five years down the line. So I'm working more towards those computing paradigm looking at the simulators, evaluating them and seeing how the application can get mapped to those kind of compute, you know and including quantum. Quantum is I think soon will be there for AI specifically I don't know.
Rekha Singhal [00:27:48]: But quantum may be there for financial sectors and for the health sector for large optimization problem. So I'm looking at more that domain rest comes to your, you know, your inbox. Anyway, so through the MLOps community newsletter itself or some other newsletter, you keep getting those news. But on my own I make efforts to look at beyond than what is there today.
Adam Becker [00:28:10]: That is exactly what it felt like to be listening to you. Just to take a step into the beyond. Rekha, thank you very much. This was absolutely fascinating. If you're ever down, I would love to just pick your brain about the future of bio and nature inspired computer too. We should do a podcast about that. I think it would be interesting. Thank you.
Rekha Singhal [00:28:29]: I'll be happy to do that. Thank you Adam. We'll be happy to work with you. Thank you so much.
Adam Becker [00:28:35]: Absolute pleasure having you on. Thank you very much for coming.
Rekha Singhal [00:28:38]: Thank you for listening. Thank you, everyone.
Adam Becker [00:28:40]: Yeah, well, thanks.

