Micro-DualNet | IEEE FG 2026

TL;DR

Micro-actions like head scratches and finger taps need different processing depending on whether they're defined by spatial pose or temporal rhythm. Micro-DualNet processes body-part features through parallel Spatial→Temporal and Temporal→Spatial paths, with learned per-entity routing. State-of-the-art on iMiGUE (76.88%) and competitive on MA-52 (65.10%), with clinical validation on 290 individuals.

Key Idea

Why Two Processing Orders?

"Covering face" is defined by its final spatial configuration. "Leg shaking" is characterized by repetitive temporal rhythm. No single decomposition handles both.

🔷

Spatial → Temporal

Spatial arrangements first, then temporal evolution. Best for position-defined actions.

🔶

Temporal → Spatial

Motion patterns first, then spatial relationships. Best for rhythm-defined actions.

🔀

Entity-Level Routing

Each body part learns its own optimal ST/TS blend via lightweight gating.

🔗

MAC Loss

Cross-path coherence via entity-aware contrastive learning.

Method

How Micro-DualNet Works

Select an action, then switch views to see how prediction confidence changes. Watch how each body part gets routed differently.

Action:

View:

🎞 Video Frames → ResNet-101 + TSM

📌 Spatial Entity Module — Keypoint-guided cropping

👤Head

😶Face

🤚L.Hand

✋R.Hand

🎽Torso

🦵Lower

ST Path

📐 Spatial-T

⏱ Temporal-T

TS Path

⏱ Temporal-T

📐 Spatial-T

🔀 Entity-Level Adaptive Routing

Results

Comparison with State-of-the-Art

Method	Type	MA-52 Top-1	MA-52 F1_mean	iMiGUE Top-1	iMiGUE Top-5
TSM CNN	RGB	56.75%	61.39	61.10%	91.24%
MANet CNN	RGB+Pose	61.33%	65.59	62.54%	92.18%
SlowFast 3D	RGB	59.60%	63.09	58.73%	89.41%
UniFormer Trans	RGB	58.89%	64.43	57.29%	89.95%
CTR-GCN GCN	Skeleton	52.61%	56.48	52.94%	89.76%
PoseConv3D GCN	Skeleton	63.52%	66.66	64.38%	93.52%
PCAN CNN	RGB+Pose	66.74%	69.97	–	–
Micro-DualNet (Ours)	RGB+Pose	65.10%	68.72	76.88%	96.72%

Component Ablation — MA-52 Top-1 (%)

TSM baseline

52.15

+ TS Path

55.96

+ SEM

59.21

+ Dual Path

62.14

+ MAC Loss

64.40

Full Model

65.10

Component Ablation — iMiGUE Top-1 (%)

TSM baseline

58.73

+ TS Path

63.48

+ SEM

68.87

+ Dual Path

72.65

+ MAC Loss

75.52

Full Model

76.88

Qualitative Results

Visualization & Analysis

Figure 4. Progressive improvement in class clustering from baseline to Micro-DualNet.

Figure 5. Dual-path achieves +31% on hard categories.

Clinical Validation

Behavioral Differences Across Diagnostic Groups

Bridging Benchmarks and Clinical Utility

Applied to 290 individuals (ages 5–52) across ASD, PSY, and TDC groups, Micro-DualNet-detected micro-actions reveal statistically significant differences — elevated "retracting feet" in PSY (p < 0.001 vs ASD) and differential "leg shaking" in ASD (p = 0.002 vs PSY).

Figure 3. Violin plots of micro-action engagement by diagnostic group.

Citation

BibTeX

@inproceedings{chappa2026microdualnet,
  title     = {Micro-DualNet: Dual-Path Spatio-Temporal Network 
               for Micro-Action Recognition},
  author    = {Chappa, Naga VS Raviteja and Sariyanidi, Evangelos 
               and Yankowitz, Lisa and Nair, Gokul 
               and Zampella, Casey J. and Schultz, Robert T. 
               and Tun\c{c}, Birkan},
  booktitle = {International Conference on Automatic Face and 
               Gesture Recognition (FG)},
  year      = {2026}
}

Acknowledgements

Supported by OD, NICHD, and NIMH grants R01MH122599, R01MH118327, P50HD105354, R21HD102078; and the IDDRC at CHOP/Penn.

Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition