Edge Weight Prediction For Category-Agnostic Pose Estimation

Edge Weight Prediction For Category-Agnostic Pose Estimation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Category-Agnostic Pose Estimation (CAPE) localizes keypoints across diverse object categories with a single model, using one or a few annotated support images. Recent works have shown that using a pose graph (i.e., treating keypoints as nodes in a graph rather than isolated points) helps handle occlusions and break symmetry. However, these methods assume a static pose graph with equal-weight edges, leading to suboptimal results. We introduce EdgeCape, a novel framework that overcomes these limitations by predicting the graph’s edge weights which optimizes localization. To further leverage structural priors, we propose integrating Markovian Structural Bias, which modulates the self-attention interaction between nodes based on the number of hops between them. We show that this improves the model’s ability to capture global spatial dependencies. Evaluated on the MP-100 benchmark, which includes 100 categories and over 20K images, EdgeCape achieves state-of-the-art results in the 1-shot setting and leads among similar-sized methods in the 5-shot setting, significantly improving keypoint localization accuracy. Our code is publicly available.


💡 Research Summary

EdgeCape addresses a fundamental limitation in category‑agnostic pose estimation (CAPE): existing graph‑based methods treat the pose graph as static and unweighted, assuming all edges contribute equally to keypoint localization. The proposed framework introduces two complementary innovations. First, it predicts edge weights for a user‑provided unweighted graph by learning a residual adjacency matrix ΔA. The predictor receives the prior adjacency A_prior together with support image features (F_s) and keypoint features (F_k^s). ΔA is computed as the pairwise cosine similarity between refined keypoint embeddings, which naturally encodes both direction and strength of relationships. To keep training stable, a learnable scalar c (initialized near zero) scales ΔA before it is added to A_prior and passed through a ReLU, yielding the final weighted adjacency A′ = ReLU(A_prior + c·ΔA). This residual formulation allows the model to keep the structural prior while adapting edge importance to each specific instance.

Second, EdgeCape augments the transformer‑based graph decoder with a Markov Attention Bias. Standard self‑attention treats all node pairs equally, ignoring the graph‑theoretic distance (hop count) between them. By adding a bias term proportional to the hop distance (β·d_ij) to the attention logits, the model emphasizes information flow between nearby nodes and attenuates long‑range interactions, mirroring the decay of transition probabilities in a Markov chain. This bias is applied on top of the dual‑attention decoder, which now exchanges information bidirectionally between support image features and keypoint features of the same image. The dual‑attention mechanism enriches keypoint embeddings with global visual context, which is crucial when object orientation and appearance vary widely across categories.

The architecture is built on top of GraphCape, reusing its graph transformer decoder but extending it to process support image features alongside keypoint features. The residual edge predictor is lightweight, relying on cosine similarity rather than expensive MLPs or full attention heads, keeping computational overhead modest.

Experiments on the MP‑100 benchmark (100 categories, >20K images) demonstrate that EdgeCape achieves state‑of‑the‑art performance in both 1‑shot and 5‑shot settings. It improves PCK@0.05 by roughly 4 % points over the unweighted GraphCape baseline, with the largest gains observed on categories with strong asymmetry or severe occlusions. Ablation studies confirm that (i) removing edge‑weight prediction degrades accuracy, (ii) omitting the Markov bias reduces the model’s ability to capture global spatial dependencies, and (iii) fixing the scaling factor c to zero leads to unstable training. Alternative designs such as MLP‑based edge prediction or larger scaling factors provide marginal benefits at higher computational cost.

In summary, EdgeCape introduces a principled way to learn instance‑specific weighted pose graphs in a category‑agnostic setting and couples this with a distance‑aware attention bias. By preserving the user‑defined structural prior while allowing data‑driven refinement, it bridges the gap between prior knowledge and visual evidence, delivering more robust and precise keypoint localization across diverse object categories. This approach opens new avenues for applying CAPE in real‑world scenarios such as robotics, AR/VR, and medical imaging where flexible, accurate pose estimation is essential.


Comments & Academic Discussion

Loading comments...

Leave a Comment