Monday, May 18, 2026

AI Papers Daily — Monday, May 18, 2026

The recurring finding across today's papers isn't that AI systems fail—it's that they fail in the specific place nobody was measuring. Models cite wrong evidence while getting answers right. Agents violate security boundaries mid-task while producing correct outputs. Robots struggle with generalization in exactly the conditions that weren't in the test set. The pattern is consistent enough to feel almost deliberate: we benchmark the endpoint, then act surprised when the path there was wrong. Several papers exist precisely because someone noticed the gap between "the model got it right" and "the model got it right for the right reasons." Here's what researchers were building—and auditing—this week.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
Dongsheng Ma, Jiayu Li, Zhengren Wang, Yijie Wang, Jiahao Kong, Weijun Zeng, Jutao Xiao, Jie Yang, Wentao Zhang, Bin Wang, Conghui He - 226 upvotes - arXiv
Researchers created CiteVQA, a test suite that checks whether document-reading AI models not only give correct answers to questions about PDFs, but also point to the exact right passages as evidence. They tested 20 models and found a widespread problem: models often get the answer right while citing completely wrong sections of the document—a critical flaw in legal, medical, or financial work where you need to trust where conclusions come from.
Why it matters: If you're building document AI for regulated industries or anywhere explainability matters, you need to measure whether your model's reasoning is actually correct, not just whether it guesses the right answer.
PhysBrain 1.0 Technical Report
Shijie Lian, Bin Yu, Xiaopeng Lin, Changti Wu, Hang Yuan, Xiaolin Hu, Zhaolong Shen, Yuzhuo Miao, Haishan Liu, Yuxuan Tian, Yukun Shi, Cong Huang, Kai Chen - 134 upvotes - arXiv
PhysBrain 1.0 trains AI models to understand physics and spatial relationships by first extracting structured knowledge from large amounts of human video (what objects are where, how they move, what happens when you interact with them), then teaching robots to use that knowledge. The system outperforms existing approaches on multiple robot control tasks and generalizes well to new environments it hasn't seen before.
Why it matters: If you're building robot systems or embodied AI products, this shows you can bootstrap physical reasoning from cheap human video rather than relying only on expensive robot data—potentially accelerating how fast your system learns to handle new tasks and environments.
MMSkills: Towards Multimodal Skills for General Visual Agents
Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu - 109 upvotes - arXiv
Researchers built MMSkills, a system that packages reusable instructions for visual agents (like AI that controls computers or plays games) by combining text procedures with screenshots and visual checkpoints showing what success looks like. They created a process to automatically extract these skill packages from recorded agent interactions, then designed a way for agents to reference these packages during execution—comparing live screenshots against the stored visuals to stay on track without getting confused.
Why it matters: If you're building visual agents (web automation, game-playing AI, robotics), you can now compose complex tasks from reusable skill libraries that include visual examples, not just text instructions, which makes agents more reliable at following multi-step procedures.
FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization
Quanjian Song, Yefeng Shen, Mengting Chen, Hao Sun, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Liujuan Cao - 54 upvotes - arXiv
Researchers built FashionChameleon, a system that lets users swap clothing on people in videos in real-time (23.8 frames per second) without needing videos of multiple outfits. The trick: train on single-outfit videos but deliberately mismatch the reference outfit during training so the model learns to preserve how people move while changing clothes. They then added a technique to let users switch garments mid-generation and keep motion consistent across long videos without retraining.
Why it matters: If you're building e-commerce or creator tools, this means you can now offer interactive virtual try-on that actually runs fast enough to feel responsive, not as a batch process.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, Junfeng Fang - 51 upvotes - arXiv
Researchers studied why on-policy distillation (OPD)—a technique where you train a large language model to mimic a stronger version of itself—trains so efficiently. They found that OPD works well because it figures out early which parts of the model matter most for reasoning and focuses updates there, plus its weight changes align with the final trained model's direction from the start. They built EffOPD, a method that speeds up OPD by 3x by taking bigger steps in the direction the model is already heading, without adding extra complexity.
Why it matters: If you're fine-tuning large language models, EffOPD gives you a concrete way to cut training time by 3x without rebuilding your training pipeline or tuning new hyperparameters.
DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
Hanwen Wang, Weizhi Zhao, Xiangyu Wang, Siyuan Huang, He Lin, Boyuan Zheng, Rongtao Xu, Gang Wang, Yao Mu, He Wang, Lue Fan, Hongsheng Li, Zhaoxiang Zhang, Tieniu Tan - 48 upvotes - arXiv
Researchers created DexJoCo, a collection of 11 robot manipulation tasks designed to test what dexterous hands (multi-fingered robotic hands, as opposed to simple two-finger grippers) can do uniquely—like using tools, coordinating both hands together, and solving multi-step problems. They collected about 1,100 real robot demonstrations, tested existing AI models on these tasks under different conditions (like changed lighting or camera angles), and found that current approaches struggle with long tasks and generalizing to new situations.
Why it matters: If you're building robotic manipulation products, you now have a standard set of tasks to measure whether your approach actually works better than others, rather than each company testing on their own cherry-picked problems.
Auditing Agent Harness Safety
Chengzhi Liu, Yichen Guo, Yepeng Liu, Yuzhe Yang, Qianqi Yan, Xuandong Zhao, Wenyue Hua, Sheng Liu, Sharon Li, Yuheng Bu, Xin Eric Wang - 44 upvotes - arXiv
Researchers built a testing framework called HarnessAudit to check whether AI agents safely follow rules during their entire execution—not just whether they give the right final answer. They tested 210 real-world tasks across different agent setups and found that agents often violate safety boundaries mid-execution (accessing resources they shouldn't or leaking information to the wrong agent), even when they eventually output a correct response, and these problems get worse as tasks get longer.
Why it matters: If you're deploying multi-agent systems in production, you can't rely on output validation alone—you need to audit what happens during execution, especially resource access and information handoffs between agents, or you'll miss real security violations.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
Taewon Yun, Jisu Shin, Jeonghwan Choi, Seunghwan Bang, Hwanjun Song - 35 upvotes - arXiv
Researchers built CoRD, a system that teaches smaller AI models to reason step-by-step by having multiple larger models work together during training data creation. Instead of just picking the best final answers from big models, CoRD has them collaborate at each reasoning step, using a scoring system to pick the most promising paths forward — similar to how chess engines explore multiple candidate moves. The result: smaller models matched the reasoning ability of larger ones while needing less training data.
Why it matters: If you're deploying reasoning tasks (math, planning, complex QA), CoRD lets you run smaller, faster models without sacrificing quality — cutting inference costs while keeping the reasoning capability.
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
Yang Yue, Fangyun Wei, Tianyu He, Jinjing Zhao, Zanlin Ni, Zeyu Liu, Jiayi Guo, Lei Shi, Yue Dong, Li Chen, Ji Li, Gao Huang, Dong Chen - 31 upvotes - arXiv
Researchers identified that image tokenizers (tools that compress images into discrete chunks for AI models to process) struggle with text and faces because they treat all image content equally when compressing. They built InsightTok, which uses special loss functions that pay extra attention to preserving readable text and facial details during compression, and showed this produces clearer text and better faces in generated images without hurting overall image quality.
Why it matters: If you're building an image generation product, this means you can get better text rendering and face quality from autoregressive models without retraining from scratch—just swap in a better tokenizer.
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
Xiaoxuan He, Siming Fu, Zeyue Xue, Weijie Wang, Ruizhe He, Yuming Li, Dacheng Yin, Shuai Dong, Haoyang Huang, Hongfa Wang, Nan Duan, Bohan Zhuang - 29 upvotes - arXiv
Training video diffusion models (AI systems that generate videos) to match human preferences is extremely slow—taking hundreds of GPU days. This team built Flash-GRPO, which trains the model in a single step per iteration instead of across the entire generation process, and still produces better results faster. They fixed two technical problems: making sure the model learns consistently regardless of which part of video generation it's optimizing, and ensuring gradient updates (learning signals) have consistent strength across different stages.
Why it matters: If you're building video generation products, this cuts training time from weeks to days, making it feasible to iterate on preference-aligned models without massive compute budgets.
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang - 29 upvotes - arXiv
Researchers developed NudgeRL, a method that helps AI models explore more effectively when learning to solve math problems through trial and error. Instead of just sampling more attempts (which is expensive), they guide the model to try diverse reasoning strategies by conditioning each attempt on lightweight context hints, then use a decomposed reward signal to learn from both successful and unsuccessful attempts. The approach solves harder math problems with 8x fewer samples than standard methods.
Why it matters: If you're building AI systems that learn by verification (like math solvers or code generation), this means you can get better results with a fixed compute budget by being smarter about what you explore, rather than just throwing more compute at random sampling.
ReactiveGWM: Steering NPC in Reactive Game World Models
Zeqing Wang, Danze Chen, Zhaohu Xing, Zizhao Tong, Yinhan Zhang, Xingyi Yang, Yeying Jin - 24 upvotes - arXiv
Researchers built ReactiveGWM, a system that generates realistic game video where NPCs (non-player characters) actually respond to what the player does—rather than just treating NPCs as static background. They separated player inputs from NPC behavior in the model, letting them control what strategy an NPC uses (offensive, defensive, etc.) through text prompts, and found they could apply these controls to completely different games without retraining.
Why it matters: If you're building interactive AI game experiences or simulations, you can now steer NPC behavior through language commands and reuse that behavior across games instead of building new models for each one.
Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution
Han Li, Jinyu Tian, Rili Feng, Yuqiao Du, Chong Zheng, Chenyu Wang, Chenchen Liu, Shihao Li, Xinping Lei, Yifan Yao, Weihao Xie, Letian Zhu, Jiaheng Liu - 18 upvotes - arXiv
Researchers built Solvita, a system that helps AI models solve competitive programming problems by organizing the work into four specialist agents (a planner, code writer, verifier, and debugger) that each learn from past attempts. Unlike previous multi-agent approaches that forget what they've learned after each problem, Solvita keeps a learnable memory network around each agent that gets updated based on success/failure signals, so the system gets better at routing strategies based on what worked before.
Why it matters: If you're building AI coding assistants or other tools that need to solve hard problems iteratively, this shows you can get better performance by having the system learn from its own failures and past attempts without retraining the underlying model.
Hölder Policy Optimisation
Yuxiang Chen, Dingli Liang, Yihang Chen, Ziqin Gong, Chenyang Le, Zhaokai Wang, Jiachen Zhu, Lingyu Yang, Jianghao Lin, Weinan Zhang, Jun Wang - 16 upvotes - arXiv
This paper fixes a problem in GRPO, a technique for training large language models where you compare multiple response attempts to figure out which ones are better. The issue: GRPO uses a fixed formula to combine token-level scores into a single training signal, which causes either training to collapse or perform poorly. The researchers propose HölderPO, which swaps in a flexible aggregation formula (the Hölder mean) and automatically adjusts its settings during training—starting strict and gradually relaxing—to balance between focusing learning on rare high-value signals and keeping training stable.
Why it matters: If you're fine-tuning language models for reasoning tasks, this means you can get more stable training and better final performance (7% improvement on math benchmarks in their tests) without manually tweaking hyperparameters for each new problem.
MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning
Yaolun Zhang, Yujie Zhao, Nan Wang, Yiran Wu, Jiayu Chang, Yizhao Chen, Qingyun Wu, Jishen Zhao, Huazheng Wang - 10 upvotes - arXiv
Most systems that automatically build multi-agent workflows (where multiple AI agents work together) either do planning at test-time or train only the high-level 'designer' while keeping individual agents fixed. MetaAgent-X trains both the designer that creates the workflow AND the individual agents that execute it together using reinforcement learning (rewarding good outcomes). They introduce techniques to keep training stable and show 21.7% improvement over existing approaches.
Why it matters: If you're building products with multiple AI agents coordinating, you can now train the entire system end-to-end rather than hand-tuning how agents interact—meaning better agent collaboration without manual tweaking.
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
Jingxuan Wei, Xi Bai, Shan Liu, Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Cheng Tan - 9 upvotes - arXiv
Researchers tackled a hard problem in AI agents that control computers: clicking on exact pixel locations in design tools, where being off by a few pixels breaks the whole drawing (unlike regular web clicking where nearby pixels work fine). They built a benchmark with nearly 5,000 design tasks and a system called PAGER that learns to plan geometric steps carefully and execute them precisely, using both supervised training on pixel-level examples and reinforcement learning with geometric feedback — and it solves 4x more tasks than existing AI models.
Why it matters: If you're building AI agents for design tools, CAD software, or any interface requiring precise coordinate control, general vision-language models will fail silently; you need task-specific training and geometric error handling to make them work.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
Anirudh Sundara Rajan, Krishna Kumar Singh, Yong Jae Lee - 9 upvotes - arXiv
Researchers built a system that breaks down complex image editing instructions (like 'make this ad look more vegetarian-friendly') into smaller steps, then learns how to execute them by trying different tool combinations and getting feedback on whether the edits actually worked. Unlike previous methods that relied on hand-coded rules or copying expert demonstrations, this system learns directly from trial-and-error, letting the planning and execution improve together based on real editing results.
Why it matters: If you're building an image editing product, this shows you can handle vague, multi-step user requests without pre-programming every possible workflow — the system figures out the right sequence and tools by learning from outcomes rather than being told exactly what to do.
Steered LLM Activations are Non-Surjective
Aayush Mishra, Daniel Khashabi, Anqi Liu - 9 upvotes - arXiv
Researchers tested whether you can achieve the same internal behavior in large language models using activation steering (directly editing the model's internal computations) as you can through normal text prompts. They found that activation steering creates internal states that can't be reached by any prompt—proving mathematically that these two control methods are fundamentally different. This matters because it means tricks that work when you edit the model directly don't necessarily mean you can trick the model through clever prompting.
Why it matters: If you're building safety evaluations or interpretability tools using activation steering, you need separate testing for actual prompting attacks—success with steering doesn't prove your model is vulnerable to the attacks you care about in production.
CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage
Jiale Liu, Jungang Li, Jieming Yu, Xinglin Yu, Zihao Dongfang, Zongjian Ding, Kaifeng Ding, Yi Yang, Lidong Chen, Yang Zou, Shunwen Bai, Jiahuan Zhang, Haoran Huang, Shan Huang, Yudong Gao, Mingjun Cheng - 8 upvotes - arXiv
Researchers built a method called COVER that automatically selects the minimum number of 360-degree camera views needed to see every part of a 3D indoor scene without duplicating coverage or creating depth inconsistencies. They used this to create CM-EVS, a dataset of 36,000+ panoramic images with depth maps from 1,275 indoor and outdoor scenes—using only ~25 images per room instead of hundreds, while still capturing the full geometry.
Why it matters: If you're training 3D vision models, this gives you a lean, consistent training dataset that covers full scenes without wasted redundant views, reducing storage and compute costs while improving data quality.
Unlocking Dense Metric Depth Estimation in VLMs
Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei ke - 8 upvotes - arXiv
The researchers took a vision-language model (a system trained on both images and text, like GPT-4V) and added a lightweight module to make it predict depth maps — 3D distance information for every pixel in an image. They trained it using both vision and text data together, and it can now output both depth maps and text descriptions in a single pass, without needing to rely on external 3D models that introduce errors.
Why it matters: If you're building 3D applications (AR, robotics, autonomous systems), you can now use a single model for both language understanding and precise depth perception instead than juggling separate models.
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
Senthil Palanisamy, Abhishek Anand, Satpal Singh Rathor, Pratyush Patnaik, Shubhanshu Khatana - 7 upvotes - arXiv
Researchers built a system that turns smartphones into tools for collecting long videos of people doing tasks—capturing hours of footage instead of the typical few minutes. They released 200 hours of this egocentric video data, open-sourced an app anyone can use to record more, and created a pipeline to turn raw phone videos into training data for AI models that learn to do tasks by watching humans.
Why it matters: If you're building robotic systems or embodied AI, you currently can't train on enough long-duration human behavior data—this removes the expensive hardware barrier and lets you crowdsource real-world examples of how people complete multi-step tasks.
Look Before You Leap: Autonomous Exploration for LLM Agents
Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, Fuli Feng - 6 upvotes - arXiv
The researchers found that AI agents built on language models often fail in new environments because they act too quickly based on what they already know, without first understanding what's actually possible in that environment. They created a metric to measure how well an agent explores and discovers different options, then developed a training method where agents practice both exploration (learning what's available) and task-solving separately—showing that agents perform better when they're forced to explore first before trying to solve problems.
Why it matters: If you're building AI agents that need to work in real-world or varied environments, you need to invest in making them explore and map out their options before executing tasks, otherwise they'll confidently do the wrong thing in situations they haven't seen before.
Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models
Fabian Morelli, Arnas Uselis, Ankit Sonthalia, Seong Joon Oh - 6 upvotes - arXiv
When you fine-tune CLIP (a vision model trained on images and text) on a specific task, it gets better at that task but worse at handling images it hasn't seen before. This team built a method called SAE-FT that uses a sparse autoencoder—a tool that identifies which visual features the model actually uses—to constrain which parts of the model can change during fine-tuning. By protecting the important features the model learned during pre-training, they kept performance high on new tasks while maintaining robustness to distribution shifts.
Why it matters: If you're shipping CLIP-based products, you can now fine-tune on your data without sacrificing reliability on edge cases or slightly different inputs, and you'll know exactly which visual concepts changed in the process.
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
Devin Yasith De Silva, Dhaval Patel, Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Paul J Adams, Sal Rosato, Nicolas Constantinides, Deborah L. McGuinness, Jayant Kalagnanam - 6 upvotes - arXiv
Researchers built DiagnosticIQ, a test with 6,690 multiple-choice questions based on real industrial maintenance rules, to see if large language models (LLMs) can help technicians decide what to do when sensor alarms trigger. They tested 29 different LLMs and found that while top models perform similarly to each other, they all struggle badly when rules are slightly reworded or conditions are flipped—suggesting they're memorizing patterns rather than understanding the maintenance logic.
Why it matters: If you're building AI for industrial maintenance or operational decision support, this shows frontier LLMs aren't reliable for this yet; they need better ways to verify they actually understand the maintenance rules, not just pattern-match against training data.
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
Alberto Pepe, Chien-Yu Lin, Despoina Magka, Bilge Acun, Yannan Nellie Wu, Anton Protopopov, Carole-Jean Wu, Yoram Bachrach - 5 upvotes - arXiv
Researchers used teams of AI agents to automatically design new neural network architectures (the computational structures that power language models) without human guidance. The agents discovered 14 different designs that, when trained at 1 billion parameters, outperformed Llama 3.2 on both general benchmarks and specific tasks, and some versions scaled more efficiently—meaning they got better results with less computation.
Why it matters: If AI agents can reliably find better architectures than human-designed ones, you could automate the expensive experimentation phase of building foundation models instead of relying on manual architecture choices.
Efficient Image Synthesis with Sphere Latent Encoder
Tung Do, Thuan Hoang Nguyen, Hao Li - 5 upvotes - arXiv
Researchers improved a fast image generation method called Sphere Encoder by splitting it into two separate parts: a fixed image encoder (converts images to a compressed representation) and a denoising model that works entirely in that compressed space. This avoids repeatedly converting between compressed and full-resolution formats during generation, making it faster and better quality while letting each component focus on its own job instead of competing for optimization.
Why it matters: If you're building products that need fast image generation (thumbnails, previews, real-time editing), this means you can generate decent images in fewer steps without sacrificing quality—reducing latency and compute costs.
FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction
Thuan Hoang Nguyen, Jiahao Luo, Yinyu Nie, Hao Li, Gordon Guocheng Qian, Jian Wang - 5 upvotes - arXiv
FFAvatar takes a few casual photos of someone's face and converts them into an animated 3D avatar in seconds, without needing hours of per-person optimization. It does this by training a neural network on over 1 million videos to learn general patterns of human faces, then fine-tuning on high-quality 360-degree captures so it understands faces from any angle, and optionally personalizing to individual people.
Why it matters: If you're building avatar or video conferencing products, you can now generate realistic, animatable 3D heads from casual snapshots in real-time instead of requiring studio setups or waiting for expensive processing.
WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes
Jichen Hu, Jiawei Guo, Jiazhong Cen, Chen Yang, Sikuang Li, Wei Shen - 4 upvotes - arXiv
Researchers built WorldAct, a system that takes static 3D worlds generated by AI (like a fully rendered game level) and breaks them down into individual, moveable objects with proper physics. The system uses an AI agent to identify which objects can be interacted with, reconstructs their shapes accurately, and fills in the background—letting you actually pick up and move things around in these generated worlds while keeping everything looking coherent.
Why it matters: If you're building immersive apps or games with AI-generated 3D environments, this lets you move from static assets you can only look at to scenes where users can actually manipulate objects and interact physically.
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Mengjie Ren, Jie Lou, Boxi Cao, Xueru Wen, Hongyu Lin, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu - 4 upvotes - arXiv
When training AI models to solve math and coding problems using reinforcement learning, the models often fail but don't learn much from those failures because the reward signal is just 'right' or 'wrong.' This paper proposes CIPO, which takes the model's own failed attempts, automatically fixes them, and trains the model to learn from those corrections—without needing human-annotated fixes. Testing across math and coding tasks, CIPO outperforms baselines and actually improves the model's reasoning ability rather than just making it pick from existing solutions.
Why it matters: If you're building an AI product that needs to solve complex problems reliably, CIPO suggests you can get better performance by teaching your model to learn from its own mistakes rather than only rewarding success—reducing your dependence on expensive human feedback.
Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
Saisab Sadhu, Pratinav Seth, Vinay Kumar Sankarapu - 3 upvotes - arXiv
Researchers found that when language models are compressed using quantization (a technique that shrinks models for deployment by reducing numerical precision), previous 'unlearning' methods—which are supposed to make models forget sensitive information—stop working. The root problem: the tiny adjustments these methods make to individual parameters are too small to survive quantization's rounding process. They built MANSU, a new approach that identifies exactly which parts of the model store the information to forget, makes bigger, more targeted changes to those parts, and ensures those changes survive quantization.
Why it matters: If you're deploying a model that needs to forget data (for privacy reasons), standard unlearning breaks the moment you compress it for production—MANSU is the first method that actually holds up after quantization.
Learning POMDP World Models from Observations with Language-Model Priors
Valentin Six, Frederik Panse, Mathis Fajeau, Lancelot Da Costa, Mridul Sharma, Alfonso Amayuelas, Tim Z. Xiao, David Hyland, Philipp Hennig, Bernhard Schölkopf - 3 upvotes - arXiv
Researchers built Pinductor, a system that uses a large language model to guess what the hidden rules of an environment are from just a few examples of actions and observations—like figuring out how a game works by watching someone play it a few times. The LLM proposes candidate models, then the system refines them by checking how well they predict what actually happened, and it learns world models as efficiently as methods that cheated by seeing the hidden state directly.
Why it matters: If you're building an agent or robot that needs to understand its environment, you can now use an LLM to bootstrap learning from way fewer real-world interactions, cutting down expensive trial-and-error.
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
Tao Zhong, Dongzhe Zheng, Christine Allen-Blanchette - 3 upvotes - arXiv
When you compress Mixture-of-Experts models (systems that route work through specialized sub-networks), existing methods only check if pairs of experts can merge together—but miss cases where three experts look fine in pairs yet break when merged as a trio. This paper treats expert compatibility as a geometric problem, finding that the problematic patterns form a specific mathematical structure (the harmonic kernel). HodgeCover uses this insight to identify which experts to safely merge, matching or beating existing compression methods while keeping the model more balanced.
Why it matters: If you're deploying sparse MoE models and need to cut inference costs without retraining, HodgeCover can compress them more aggressively than existing techniques while maintaining performance—directly reducing your inference bill.
ChangeFlow -- Latent Rectified Flow for Change Detection in Remote Sensing
Blaž Rolih, Matic Fučka, Filip Wolf, Luka Čehovin Zajc - 2 upvotes - arXiv
Researchers built ChangeFlow, a system that detects changes between satellite images by generating multiple possible change masks rather than making a single prediction. Instead of working with full-resolution images (which is slow), it works in a compressed 'latent space' using rectified flow (a technique for gradually transforming random noise into structured outputs), and gets better results than existing methods while staying fast.
Why it matters: If you're building satellite imagery analysis tools, you can now get more reliable change detection with built-in confidence scores by sampling multiple predictions—useful for flagging uncertain areas that need human review.
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
Fanxu Meng - 2 upvotes - arXiv
DeepSeek's MLA attention mechanism is super efficient on expensive H100 GPUs but locks you into one specific hardware setup. The researchers built GQLA, which keeps those same trained weights but exposes two different ways to decode them — one optimized for H100s and another for cheaper GPUs like H20. A single model automatically picks the right decoding path for whatever hardware you're running on, no retraining required.
Why it matters: If you're deploying LLMs, you can now use one trained model across different hardware without retraining, reducing both your engineering work and the cost of supporting multiple inference setups.
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
Shan Yang - 2 upvotes - arXiv
The researchers found that existing physics problem datasets used to evaluate vision-language models (AI systems that read images and text) have three hidden problems: some training data sneaks into test sets through paraphrasing, problems lose meaning when translated between languages, and multiple-choice questions inflate performance scores compared to harder open-ended problems. They built a cleaner 6,400-problem dataset and training recipes that improved a small model's physics reasoning by 18 percentage points, though it still lags behind larger commercial models.
Why it matters: If you're benchmarking your model's physics or reasoning abilities, your eval scores are probably inflated—you need to audit for hidden data leakage and format bias before trusting comparisons with competitors.
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, Mike Zheng Shou - 2 upvotes - arXiv
Researchers built a system that watches videos of one robot or human moving and generates videos of a different robot performing the same motion—without needing paired examples of both doing the movement. The key insight: they separated 'how things move' (which transfers between robots) from 'what things look like' (which is robot-specific), using lightweight adapters to handle robot-specific appearance when only unpaired videos are available.
Why it matters: If you're training embodied AI systems (robots, humanoids), you can now generate synthetic training data for new robot designs without collecting expensive paired videos of humans and that specific robot doing the same tasks.
Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction
Hao Phung, Hadar Averbuch-Elor - 2 upvotes - arXiv
Researchers built a system that converts messy floorplan photos into clean, computer-readable vector drawings (like you'd see in CAD software). The system treats this as a sequence prediction problem—it looks at the image, then predicts one corner of a room at a time, using learned spatial 'anchors' to focus on relevant parts of the image, and can flexibly handle floorplans with many rooms and irregular shapes.
Why it matters: If you're building software that needs to digitize physical spaces (real estate platforms, AR apps, facility management tools), you can now automatically convert photos into structured data instead of manual tracing.
AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting
Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, Gustavo Carneiro - 1 upvotes - arXiv
Researchers extended SAM2 (a video segmentation model) to work with audio by creating a module called AuralFuser that combines audio and visual features to generate prompts that guide what to segment. The key problem they solved: when audio signals were added to the model through existing methods, the audio information got weaker as it moved through network layers, hurting accuracy. Their approach keeps the audio signal strong by routing it through the model's visual layer structure, achieving better segmentation accuracy without slowing down interactive use.
Why it matters: If you're building video products that need to segment or identify objects based on sound (like finding who's speaking in a video, or isolating instruments in a music video), this gives you a way to do it efficiently without needing separate specialized models.
No One Knows the State of the Art in Geospatial Foundation Models
Isaac Corley, Nils Lehmann, Caleb Robinson, Gabriel Tseng, Anthony Fuller, Hamed Alemohammad, Evan Shelhamer, Jennifer Marcus, Hannah Kerner - 1 upvotes - arXiv
Researchers analyzed 152 papers on geospatial foundation models (AI trained on satellite imagery to identify things like crop health or disaster damage) and found the field has a serious measurement problem: the same model gets wildly different reported performance across papers, most papers use unique training setups making comparisons impossible, and many don't release their code. They're proposing six concrete fixes—like standardized benchmarks, releasing model weights, and reporting uncertainty—so companies and researchers can actually know which models work best for their specific problem.
Why it matters: If you're building disaster response or agriculture tools using satellite imagery, you can't trust published comparisons between geospatial models right now, so you may be picking the wrong foundation model; these standards would let you actually benchmark fairly before shipping.
Follow the Mean: Reference-Guided Flow Matching
Pedro M. P. Curvo, Maksim Zhdanov, Floor Eijkelboom, Jan-Willem van de Meent - 1 upvotes - arXiv
The researchers discovered that flow matching models (a type of generative AI that gradually transforms noise into images) can be controlled by showing examples instead of retraining or adding extra networks. They built two systems: one that instantly steers a frozen pretrained model by analyzing reference images you provide, and another that learns to adapt to new reference examples at runtime without changing the model weights.
Why it matters: You can now control image generation by simply swapping in different reference examples at inference time—no retraining, no fine-tuning, no architectural tricks—making it easier to build adaptable AI products where users provide their own style guides or constraints.
Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism
Konstantine Arkoudas, Serafim Batzoglou - 1 upvotes - arXiv
Researchers built ProofGrid, a test suite that checks whether large language models can actually reason correctly by having them write and verify formal proofs in a simple notation called NDL, rather than just checking if they get the right final answer. They tested leading models and found they handle basic reasoning tasks but fail badly on complex ones requiring multi-step logical planning, and they discovered a weird instability where models generate broken proofs but can correctly identify those same mistakes when shown them separately.
Why it matters: If you're shipping an AI product that needs to make reliable logical decisions (legal analysis, code verification, reasoning chains), this work shows current models will confidently produce internally-inconsistent reasoning—you need verification layers that check intermediate steps, not just final outputs.
Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces
William Lugoloobi, Samuelle Marro, Jabez Magomere, Joss Wright, Chris Russell - 0 upvotes - arXiv
Researchers showed that websites can identify which AI model is controlling a browser agent just by watching how it clicks, scrolls, and types—achieving 96% accuracy across 14 different models. They found this fingerprinting works even with timing noise added for obfuscation, and the attack generalizes across different versions and families of models, meaning a classifier trained on one model often works on others it hasn't seen.
Why it matters: If you're building AI agents that interact with websites on behalf of users, websites can now passively detect and target attacks at your specific model—you'll need to build in behavioral randomization or other defenses beyond timing delays to protect users.
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim - 0 upvotes - arXiv
When search systems need to find relevant information across documents in multiple languages, existing benchmarks only measure whether the content is relevant—ignoring whether users can actually read the results. This team built MLAIRE, an evaluation framework that measures two separate things: (1) whether a retriever finds semantically correct information regardless of language, and (2) whether it prefers returning results in the language the user queried in. Testing 31 different retrieval systems, they found that some rank high on semantic accuracy but return answers in the wrong language, while others prioritize matching the query language even if the content is less relevant.
Why it matters: If you're building RAG (Retrieval-Augmented Generation) systems or multilingual search, your retriever's language behavior directly affects whether downstream LLMs can verify and ground answers—standard metrics alone won't catch when you're returning semantically perfect but linguistically unusable results.