When Agents Learn to Feel: Multi-Modal Affective Computing in Production // Chenyu Zhang
speaker

Chenyu Zhang is a cross-disciplinary software engineer, researcher, and entrepreneur at the intersection of artificial intelligence, affective computing, and education. He holds a Master’s in Education (Learning Design, Innovation, and Technology) from Harvard Graduate School of Education and an Honours B.Sc. in Computer Science, Mathematics, and Statistics from the University of Toronto.
Chenyu has conducted research at MIT Media Lab and Harvard’s Berkman Klein Center, focusing on affective computing, multimodal emotion recognition, and AI interventions for learning frustration. His work has been published at SIGCSE and submitted to NeurIPS, AAAI, ACII, and AERA.
As founder of GlowingStar Inc., an MIT- and Harvard-affiliated startup, Chenyu is building the world’s first emotion-aware AI tutor designed to provide every learner—from working-class students to lifelong learners—with a personalized, empathetic learning companion. Previously, he worked at Block, Manulife, and ROSS Intelligence, leading full-stack engineering and accessibility initiatives. He has also taught at MIT, Stanford, and the University of Toronto, and mentored students through AI4ALL Ignite.
Chenyu’s broader mission is to make world-class education accessible to all by blending technical innovation, research, and coaching. He is certified as both a Co-Active Coach and a Designing Your Life Coach, and he frequently writes and speaks on AI, ethics, and educational equity.
SUMMARY
The next generation of AI agents won’t just respond to what we say—they will sense how we feel. As large language model–powered agents move from research prototypes into production, a critical frontier is the integration of multi-modal affective computing: combining voice, text, facial expressions, and interaction patterns to detect the learner’s or user’s emotional state in real time. This talk explores the challenges and opportunities of deploying emotion-aware AI tutors in production environments. Drawing from ongoing research at MIT Media Lab and Harvard, and from startup experience building GlowingStar, I will share how multi-modal signals—speech tone, facial micro-expressions, response latency, and even silence—can be fused into affective state estimates that meaningfully improve user experience. We will unpack the technical lessons learned from moving affective sensing beyond the lab: designing architectures that combine ensemble LLMs with sensor inputs, diagnosing when modalities conflict or sabotage each other, and establishing guardrails for privacy and consent in sensitive domains like education. In parallel, I will highlight multi-agent orchestration patterns—including critic–rewriter loops and role-based ensembles—that make it possible to personalize instruction, generate equitable feedback, and sustain engagement across diverse learners. By the end of this session, attendees will have a clear picture of what it takes to move multi-modal, affect-sensing agents from demos to durable production systems: the architectures, the pitfalls, and the metrics that matter. More importantly, we will consider how these lessons extend beyond education to any industry where AI agents must not only think, but also feel with and for the human in the loop.
TRANSCRIPT
Chen Yu Zhang [00:00:05]: So my talk today focuses on what happens when AI agents begin to sense and respond to human emotions in real time. Just a quick self introduction hello everyone. My name is Chen Yu Zhang. I'm a researcher, educator and entrepreneur working at the intersection of effective computing and large language models and learning sciences. So I study how AI can understand not only what learners say, but how they feel and how that awareness can improve instruction and support. My workspace at Multimedia Lab, Stanford Hai and Harvard. And I'm also the founder of Going Star Inc. Where I develop emotionally aware AI tutors designed to give every learner access to the world class personalized learning experience.
Chen Yu Zhang [00:00:52]: I'm excited to share my work with you today. This is the agenda of my talk. We'll begin by grounding the idea of affective agents through a taxonomy that classifies how emotions fit within the agent architectures. Then I will highlight key challenges and the opportunities they unlock, especially around memory, learning and multimodal sensing. So finally we will look at the ethical and societal concerns that come with the emotion aware systems, especially in the sensitive domains like education. Okay, so I want to highlight a very recent shift in the field. On the left you can see the Google Trends graph showing how interest in emotion AI and agentic AI skyrocketed in the past few years. This isn't just academic curiosity.
Chen Yu Zhang [00:01:45]: It reflects a structural change in how AI is is being built and deployed. On the right, we've had a Screenshot from open GPT5.1 update. So they quietly introduced personality presets. So this is important because it signals that even the largest LM providers are beginning to acknowledge the emotional layer of interaction. The presets aren't fully effective reasonings yet, but they are an early step towards modeling tone, warmth, directness and the social style, all of which shape how humans interpret intent and trust. So the real question is why now? As we move from tools to agents, the missing piece is more logic or more memory. It's emotional attunement because humans may rely on affective cues to assess safety, trust, confusion and engagement. Without emotion, agents remain brittle, unable to adjust to frustration, disengagement or excitement.
Chen Yu Zhang [00:02:55]: Emotion isn't optional in AGI development. It's a core part of humans learn, collaborate and making decisions. Okay, so this rise in interest and OpenAI's move toward personality aware interactions both point to the same conclusion. Future AI agents must be able to sense and adapt to how we feel, not just what we say. So there's been some confusions around the terminologies here, right? The AI agents agentic AI and agent AI. So let's talk about them. So oftentimes they use interchangeable link, but they represent different scopes. AI agents are tools or tool using LM loops.
Chen Yu Zhang [00:03:41]: Agentic AI is autonomous with goal directed behavior. Agent AI introduced in recent paper, involves embodied perceptual interactive systems. So when we add affect into this mix, we are expanding the perception and interaction layers by giving agents a sense of how humans feel, not just what we say. Okay, so let's take a look at some early concepts from almost 2,000 years ago. So this slide shows how thinkers like Plato, Aristotle organized the mind into feeling, thinking and doing. These ideas are 2000 years old, but they highlight something we still know today. Emotion is not separate from intelligence. Modern neurosciences know an even tighter integration.
Chen Yu Zhang [00:04:41]: Affect shapes attention, memory, learning and decision making. It isn't just about emotion expression. It's a core part of how humans process information. So current AI agents mostly cover the thinking and doing parts. What's still missing is the affective layer that guides interpretation, motivation and adaptive behavior. Affective agent AI aims to fix exactly that gap. This is the agent architecture proposed by Google. It has all the core components we expect in the modern agent systems.
Chen Yu Zhang [00:05:19]: Orchestration layer that manages profile scores and instructions, memory system, the short term and long term components, a reasoning and planning module and a tool interface that allows the agent to act in the world. That's a solid blueprint for cognitive and behavioral intelligence. But what's notable is what it doesn't include. There is no representation of affective context. The architecture does not sense whether the user is confused, frustrated, disengaged or overwhelmed. Without that layer, even a well designed agent can deliver the right information in the wrong way. This is exactly the gap that effective agent AI tries to fill by adding emotional perception and adaptation alongside the cognitive and action loops. Ladies and gentlemen, I want to propose my definition of affect agent AI.
Chen Yu Zhang [00:06:16]: I expanded the agent architecture by making perception and emotional modeling first class components. In most existing agent frameworks, perception is limited to text inputs or tool inputs, tool outputs. Here I treat perception as multimodal layer that can ingest signals like Western facial expressions, latency and interaction patterns. Next to that, there's an explicit emotional modeling module. Its role is not to generate artificial emotions, but to estimate the user's affective state and feed that into reasoning and planning. This gives the agent contextual awareness that goes beyond content and also when the learner is confused, frustrated, disengaged or motivated. So this also applies to the ticket booking demo you saw in the previous speaker session. In the multi agent version below, orchestration becomes even more important.
Chen Yu Zhang [00:07:17]: Instead of coordinating only tasks or tools, we also coordinate emotional responsibilities across agents. One agent might detect frustration, another might critique the explanation, and another might rewrite the whole thing in a calmer or cleaner style. In this setup, emotion becomes part of the system's control flow, not an afterthought. Okay, so the idea is simple. For agents to behave intelligently with humans, emotional signals must shape their decisions just as strongly as goals or instructions do. This architecture formalizes that by giving effect its own dedicated pathways inside the agent system. Okay, so next I want to talk about each of the taxonomy of effective agent AI. This slide is the perception.
Chen Yu Zhang [00:08:22]: Effective perception spans multiple senses. Humans use vision, voice, interoception and contextual cues. Modern multimodal agents are approaching this, but still struggle with integration. This diagram reminds us how complex the human perceptual system is in production, which is the focus of our whole talk. We use simplified signals, the tune of voice, visual, action units, typing, latency, silence. The challenge here is still fusing these modalities without overfitting or drawing incorrect emotional conclusions. Next up, we have tools and we have to talk about mcp. MCP is an important piece of the puzzle when we build those agents in production because agents need access to tools that help contextualize emotional signals.
Chen Yu Zhang [00:09:27]: For example, a tool can store long term effect history or fetch user specific data. In effective agents, tools don't just help with logic, they help with interpreting patterns such as recurring frustration or disengagement. So tool orchestration becomes part of the emotional intelligence. Okay, memory. So a big blocker is memory. That's a key, key component of the AI agent. Human memory retrieves emotional charged experiences much more readily. But agents treat all input uniformly.
Chen Yu Zhang [00:10:16]: That means they don't prioritize certain inputs over others. They have weak forgetting and weak abstraction. Emotional signals often get lost because they aren't tagged or prioritized. In the current AI agent system, the opportunity is to introduce shallow episodic memory with emotional tagging. Not only what happened, but how the user felt. This is key in the tutoring systems in the production environment. For example, if a learner struggled with recursion last week and showed frustration, the agent should remember and adjust future pacing. Now let's talk about learning in the AI agent.
Chen Yu Zhang [00:11:14]: So humans learn through curiosity, pride, or fear of failure. Agents currently learn only through external goals defined by us. Emotional context can serve as an internal motivation signal. For example, in the education context, if a learner is disengaged, the AI agent Be built should adapt and re engage them. Right. If a learner shows excitement, the agent should deepen the challenge. So emotional modeling allows reasoning not just about logic, but about the truly matters to human decision making. Okay, so last but not least, this is one of the most critical parts of the conversation.
Chen Yu Zhang [00:12:09]: The ethical and societal concerns with those technology. Emotional data is far more sensitive than ordinary behavioral data like Phi and the PI, personal identifiable information and personal health information. Emotional data is far more sensitive than that because it reveals internal states people may not even be consciously expressing. That raised a major question about privacy, ownership, consent and transparency. Users often don't know when their emotional cues are being analyzed, let alone how those inferences are stored or used. There is also the risk of manipulation. When systems can detect fear, confusion or enthusiasm, it becomes easy to cross the line from supporting users to influencing them in ways that they didn't choose. Offloading emotional labor to the AI can also change human relationships, since I've seen many AI companionship apps that cause lots of issues with miners.
Chen Yu Zhang [00:13:28]: Further, it can reduce the development of emotional resilience. Finally, we have significant scientific challenges. Emotion recognition often fails to generalize across cultures, context and demographics. Misclassification is common, yet systems can express high confidence. These issues make it essential that emotional AI be deployed cautiously with rigorous validation and clear ethical boundaries. Okay, with that being said, thank you everyone for your attention. I'm happy to take any questions and if you'd like to continue the conversation or explore collaboration, feel free to connect me with with with me on LinkedIn or reach out by email. Thank you.
Demetrios Brinkmann [00:14:20]: Awesome there. I've got some questions for you. Don't go anywhere. Don't go anywhere just yet. I got a question for you and there's some coming through in the chat. Yeah, so first of all, awesome talk. I love this idea. I think one of the first questions to come through from Marco that I also was thinking, as you were saying, this is why choose to do this at the agent level and not at the model level?
Chen Yu Zhang [00:14:52]: That's a great question. First of all, and there is the debate, right? In the, in the industry or the trends of development, do we need a bigger model or do we need a bunch of. Or do we need a bunch of smaller models? I think right now we are still debating on the directions. I think this can be achieved on both, right? We can either for the big players out there to have a bigger model that can sense emotions, or a bunch of smaller models that offer flexibilities and for users to pick and choose what kind of components they want to use. So I think those two directions are still valid.
Demetrios Brinkmann [00:15:31]: Okay. Yeah, it does seem like giving it up to the user in that case and deciding almost like what. What sentiment you want to use as you're building it is a fairly nice choice. Let the user decide. It's cool that you're going down the agent route and you're seeing that. Another question that's coming through is.
Chen Yu Zhang [00:15:58]: Is.
Demetrios Brinkmann [00:15:59]: It limited to audio? Is it limited to text? Is there certain ways, like, does it have to be? It feels like it can only be audio because that's where you get more rich data.
Chen Yu Zhang [00:16:15]: I love this question. The question is essentially asking about the modality, what kind of signals we are processing and we are outputting. Right now we are limited with text and audio. Basically the input and output are mostly the textual information or the audios or sometimes the images. However, I do want to argue the modalities are not limited by the three we commonly use textual, auditory, and also the visual information. So we can expand our imagination. Right. There are all other different signals we can include in the agent in the future.
Chen Yu Zhang [00:16:57]: So I do want to argue modalities are not limited to those three.
Demetrios Brinkmann [00:17:02]: Wait, what are the other ones? I have to ask.
Chen Yu Zhang [00:17:06]: How many senses does human have? Aha.
Demetrios Brinkmann [00:17:09]: So smell, you're saying too smell, taste, Right.
Chen Yu Zhang [00:17:13]: In the future there could be more.
Demetrios Brinkmann [00:17:17]: I don't understand that at all. You got to say more.
Chen Yu Zhang [00:17:22]: So right now we haven't seen the large foundation models that include those signals yet, but I think the technology, the training process is similar to. So if we want to start modeling those things, we can also do that. Also, another trend is the embodied AI, right? So the robots. I see a big trend. Maybe the robots will have the ChatGPT moments soon. So when the agents becomes the embodied agent, in other words, robots, maybe we want to add taste to them as well.
Demetrios Brinkmann [00:17:55]: Dam, that's some wild stuff. I like the way you think. This is very cool. Okay, so I'll keep rolling with you on that one. We've got. The robots have got ears, they've got eyes, they're seeing that. That's fairly common to us right now. They can talk, they can hear.
Demetrios Brinkmann [00:18:14]: And now you're thinking, well, the touch is the next one, the taste is the next one. How do you even go about training a model in those modalities?
Chen Yu Zhang [00:18:25]: Yeah, so I mean, large language model is still in the world of big data, right? So I mean, if you can model those signals, you can use the similar training process to train the models to behave in a certain way. Also, I think the main theme of this talk is the effect, in other words, emotions. So emotions are not one of the human senses. However, that's crucial and that's the main theme of this talk. So I want to argue the emotions are the missing piece in the AI agent development or is the underdeveloped area right now? Yeah. However, it's very crucial.
Demetrios Brinkmann [00:19:01]: Well, shoot, dude, I can barely understand my own emotions. How are we going to expect the robot to understand them?
Chen Yu Zhang [00:19:08]: Yeah, I mean, that's a fair question. So robots or AI agents? I mean, there are some benchmarks out there. However, I do want to argue. Right. So, I mean, those benchmarks can also being improved right now. So basically, they collect the data, ask three annotators to annotate the data sets. So I would say those three annotators cannot represent you or cannot represent me individually. Also, there are cultural differences, et cetera.
Chen Yu Zhang [00:19:40]: So I think there are still lots of space we can grow within the space of effective AI agent. That's a great question.
Demetrios Brinkmann [00:19:47]: Yeah, yeah. And the training data is my conversations with my therapist.
Chen Yu Zhang [00:19:57]: Hopefully not. And I would argue privacy is extremely important. Unless you give the consent explicitly to allow the agents to train on your data, by default, we should say no.
Demetrios Brinkmann [00:20:10]: Well, bro, I love it when my mind gets bent and I have to try and fit a new idea into my frame of reference. You've done that today with this talk and with our discussion right now. Thank you so much for coming on here. We're going to keep things rolling.
Chen Yu Zhang [00:20:27]: Thank you.

