ACCEPTED TO IEEE FG 2026

Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition

Naga VS Raviteja Chappa1, Evangelos Sariyanidi1, Lisa Yankowitz1, Gokul M. Nair1, Casey J. Zampella1,2, Robert T. Schultz1,2, Birkan Tunç1,2
1The Children's Hospital of Philadelphia  ·  2Perelman School of Medicine, University of Pennsylvania
Paper Code

TL;DR

Micro-actions like head scratches and finger taps need different processing depending on whether they're defined by spatial pose or temporal rhythm. Micro-DualNet processes body-part features through parallel Spatial→Temporal and Temporal→Spatial paths, with learned per-entity routing. State-of-the-art on iMiGUE (76.88%) and competitive on MA-52 (65.10%), with clinical validation on 290 individuals.

Key Idea

Why Two Processing Orders?

"Covering face" is defined by its final spatial configuration. "Leg shaking" is characterized by repetitive temporal rhythm. No single decomposition handles both.

🔷

Spatial → Temporal

Spatial arrangements first, then temporal evolution. Best for position-defined actions.

🔶

Temporal → Spatial

Motion patterns first, then spatial relationships. Best for rhythm-defined actions.

🔀

Entity-Level Routing

Each body part learns its own optimal ST/TS blend via lightweight gating.

🔗

MAC Loss

Cross-path coherence via entity-aware contrastive learning.

Method

How Micro-DualNet Works

Select an action, then switch views to see how prediction confidence changes. Watch how each body part gets routed differently.

Action:
View:
🎞 Video Frames → ResNet-101 + TSM
📌 Spatial Entity Module — Keypoint-guided cropping
👤Head
😶Face
🤚L.Hand
R.Hand
🎽Torso
🦵Lower
ST Path
📐 Spatial-T
Temporal-T
TS Path
Temporal-T
📐 Spatial-T
🔀 Entity-Level Adaptive Routing
Architecture

Full Pipeline

Micro-DualNet architecture diagram
Figure 2. Full architectural overview of Micro-DualNet.
Motivation

Motion-Based vs. Position-Based Micro-Actions

Motion-based vs position-based micro-actions
Figure 1. Motion-based actions benefit from TS; position-based favor ST. Dual-path outperforms both.
Results

Comparison with State-of-the-Art

MethodTypeMA-52 Top-1MA-52 F1meaniMiGUE Top-1iMiGUE Top-5
TSM CNNRGB56.75%61.3961.10%91.24%
MANet CNNRGB+Pose61.33%65.5962.54%92.18%
SlowFast 3DRGB59.60%63.0958.73%89.41%
UniFormer TransRGB58.89%64.4357.29%89.95%
CTR-GCN GCNSkeleton52.61%56.4852.94%89.76%
PoseConv3D GCNSkeleton63.52%66.6664.38%93.52%
PCAN CNNRGB+Pose66.74%69.97
Micro-DualNet (Ours)RGB+Pose65.10%68.7276.88%96.72%

Component Ablation — MA-52 Top-1 (%)

TSM baseline
52.15
+ TS Path
55.96
+ SEM
59.21
+ Dual Path
62.14
+ MAC Loss
64.40
Full Model
65.10

Component Ablation — iMiGUE Top-1 (%)

TSM baseline
58.73
+ TS Path
63.48
+ SEM
68.87
+ Dual Path
72.65
+ MAC Loss
75.52
Full Model
76.88
Qualitative Results

Visualization & Analysis

t-SNE visualizations
Figure 4. Progressive improvement in class clustering from baseline to Micro-DualNet.
Class-wise performance analysis
Figure 5. Dual-path achieves +31% on hard categories.
Clinical Validation

Behavioral Differences Across Diagnostic Groups

Bridging Benchmarks and Clinical Utility

Applied to 290 individuals (ages 5–52) across ASD, PSY, and TDC groups, Micro-DualNet-detected micro-actions reveal statistically significant differences — elevated "retracting feet" in PSY (p < 0.001 vs ASD) and differential "leg shaking" in ASD (p = 0.002 vs PSY).

Clinical violin plots
Figure 3. Violin plots of micro-action engagement by diagnostic group.
Citation

BibTeX

@inproceedings{chappa2026microdualnet,
  title     = {Micro-DualNet: Dual-Path Spatio-Temporal Network 
               for Micro-Action Recognition},
  author    = {Chappa, Naga VS Raviteja and Sariyanidi, Evangelos 
               and Yankowitz, Lisa and Nair, Gokul 
               and Zampella, Casey J. and Schultz, Robert T. 
               and Tun\c{c}, Birkan},
  booktitle = {International Conference on Automatic Face and 
               Gesture Recognition (FG)},
  year      = {2026}
}
Acknowledgements

Supported by OD, NICHD, and NIMH grants R01MH122599, R01MH118327, P50HD105354, R21HD102078; and the IDDRC at CHOP/Penn.