AFUN: Towards an Affordance Foundation
Model for Functionality Understanding

Zhaoning Wang^1,*, Yi Zhong^1,*, Jiawei Fu², Henrik I. Christensen², Jun Gao^1,3

¹University of Michigan ²University of California, San Diego ³NVIDIA

^*Equal contribution

Where + How for Functionality Understanding

A single forward pass predicts both a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact).

SOTA Affordance Segmentation

+23.9 / +26.3 mean gIoU/cIoU over the best baseline, across 8 test sets from 4 affordance benchmarks.

Largest Public Affordance Data

One of the largest public affordance datasets to date: robot, human egocentric, simulation, and real-world scan data.

We present AFUN, a step toward an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, AFUN predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels.

Prediction Results

AFUN predictions across diverse scenes. Pick a scene below to see AFUN's prediction for every language query in that scene, side by side. Points inside the predicted affordance mask are highlighted in red, and the trajectory threads from yellow (contact) to blue (end). drag to orbit, scroll to zoom.

Start End

Real-Robot Deployment

Without any robot-specific finetuning, AFUN predicts a precise functional mask and 3D motion that the robot uses to plan and execute manipulation in the real world. The same model generalizes across object categories, language instructions, and embodiments, suggesting a practical path toward open-world affordance models that unify functionality perception with executable action.

Method Overview

Given an RGB-D observation and a language task description, AFUN jointly predicts where to interact (a task-conditional functional segmentation mask) and how to interact (a 3D post-contact motion represented as a Bézier spline curve). The model routes pretrained vision–language features through lightweight metaqueries into a segmentation decoder for the mask and a curve head for the 3D motion, leveraging strong visual–language, segmentation, and 3D geometric priors with lightweight trainable modules—enabling joint mask and motion prediction without finetuning the large backbones.

Data Pipeline

We build a unified data pipeline that converts heterogeneous robot, human egocentric, simulation, and real-world scan data into a shared affordance schema with language task phrases, functional masks, and object-centric 3D motion labels. Rather than approximating object motion via hand or gripper proxies, we track the object itself through depth-fused mask propagation, yielding on-object 3D trajectories at scale and producing one of the largest public affordance datasets to date.

Qualitative Results

Affordance Segmentation

AFUN qualitative segmentation comparison

3D Motion Prediction

BibTeX

@misc{wang2026afun,
  title         = {{AFUN}: Towards an Affordance Foundation Model for Functionality Understanding},
  author        = {Wang, Zhaoning and Zhong, Yi and Fu, Jiawei and Christensen, Henrik I. and Gao, Jun},
  year          = {2026},
  eprint        = {2606.02551},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2606.02551}
}

AFUN: Towards an Affordance FoundationModel for Functionality Understanding