AFUN: Towards an Affordance Foundation
Model for Functionality Understanding

1University of Michigan 2University of California, San Diego 3NVIDIA
*Equal contribution
AFUN method overview
Where + How, Jointly
A single forward pass predicts both a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact).
SOTA Affordance Segmentation
+23.9 / +26.3 mean gIoU/cIoU over the best baseline, across 8 test sets from 4 affordance benchmarks.
Largest Public Affordance Data
One of the largest public affordance datasets to date: robot, human egocentric, simulation, and real-world scan data.

We present AFUN, a step toward an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, AFUN predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels.

Prediction Results

AFUN predictions across diverse scenes. Pick a scene below, then choose a language query to see the matching prediction. Points inside the predicted affordance mask are highlighted in red, and the trajectory threads from yellow (contact) to blue (end). drag to orbit, scroll to zoom.

Start End

Loading 3D…
Input RGB
drag to orbit

Real-Robot Deployment

Without any robot-specific finetuning, AFUN predicts a precise functional mask and 3D motion that the robot uses to plan and execute manipulation in the real world. The same model generalizes across object categories, language instructions, and embodiments, suggesting a practical path toward open-world affordance models that unify functionality perception with executable action.

Method Overview

AFUN model pipeline

Given an RGB-D observation and a language task description, AFUN jointly predicts where to interact (a task-conditional functional segmentation mask) and how to interact (a 3D post-contact motion represented as a Bézier spline curve). The model routes pretrained vision–language features through lightweight metaqueries into a segmentation decoder for the mask and a curve head for the 3D motion, leveraging strong visual–language, segmentation, and 3D geometric priors with lightweight trainable modules—enabling joint mask and motion prediction without finetuning the large backbones.

Data Pipeline

AFUN data pipeline

We build a unified data pipeline that converts heterogeneous robot, human egocentric, simulation, and real-world scan data into a shared affordance schema with language task phrases, functional masks, and object-centric 3D motion labels. Rather than approximating object motion via hand or gripper proxies, we track the object itself through depth-fused mask propagation, yielding on-object 3D trajectories at scale and producing one of the largest public affordance datasets to date.

Qualitative Results

Affordance Segmentation

AFUN qualitative segmentation comparison

3D Motion Prediction

AFUN qualitative 3D motion comparison

BibTeX

@article{wang_afun,
  title   = {{AFUN}: Towards an Affordance Foundation Model for
             Functionality Understanding},
  author  = {Wang, Zhaoning and Zhong, Yi and Fu, Jiawei and
             Christensen, Henrik I. and Gao, Jun},
  note    = {$^*$Equal contribution: Z.~Wang and Y.~Zhong.}
}