Multimodal AI Agents: Model Performance Meets User Experience

Date:
Invited Keynote Speech, 16th ACM International Conference on Multimedia Retrieval (ICMR 2026), Amsterdam, Netherlands

Multimodal AI agents are evolving into systems capable of understanding text, vision, and spatial environments while collaborating directly with humans. This talk explores how we should evaluate these agents through through two complementary perspectives: model performance - how effectively they reason, perceive, and execute complex tasks and user experience - how they influence human creativity, learning, and well-being. Through applications in mixed reality guidance, physical assembly benchmarking, virtual co-building, and gaze interpretation, we examine the opportunities and limitations of current multimodal systems. Although these agents show strong potential for transforming learning and creativity across virtual and physical environments, advances in spatial reasoning and long-horizon task execution are still needed before they can become reliable real-world assistants.