MIDAS: Mosaic Input-Specific Differentiable Architecture Search

MIDAS: Mosaic Input-Specific Differentiable Architecture Search
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Differentiable Neural Architecture Search (NAS) provides efficient, gradient-based methods for automatically designing neural networks, yet its adoption remains limited in practice. We present MIDAS, a novel approach that modernizes DARTS by replacing static architecture parameters with dynamic, input-specific parameters computed via self-attention. To improve robustness, MIDAS (i) localizes the architecture selection by computing it separately for each spatial patch of the activation map, and (ii) introduces a parameter-free, topology-aware search space that models node connectivity and simplifies selecting the two incoming edges per node. We evaluate MIDAS on the DARTS, NAS-Bench-201, and RDARTS search spaces. In DARTS, it reaches 97.42% top-1 on CIFAR-10 and 83.38% on CIFAR-100. In NAS-Bench-201, it consistently finds globally optimal architectures. In RDARTS, it sets the state of the art on two of four search spaces on CIFAR-10. We further analyze why MIDAS works, showing that patchwise attention improves discrimination among candidate operations, and the resulting input-specific parameter distributions are class-aware and predominantly unimodal, providing reliable guidance for decoding.


💡 Research Summary

This paper introduces MIDAS (Mosaic Input‑Specific Differentiable Architecture Search), a novel extension of the DARTS framework that replaces the static, global architecture parameters with dynamic, input‑specific weights computed via a lightweight self‑attention mechanism. The core idea is to treat the set of candidate operation outputs at each node as “tokens” and to compute attention scores between a query derived from the node’s current inputs and keys derived from each candidate operation’s activation map. Unlike conventional DARTS, which uses a single scalar α per operation, MIDAS computes a probability distribution over operations for every spatial patch of the activation map (the “mosaic” approach). Each activation map is divided into P² non‑overlapping patches; within each patch the candidate maps are average‑pooled, projected to keys by a shallow two‑layer MLP, and combined with a query (also obtained by pooling the concatenated inputs) to produce patch‑level attention weights via a dot‑product followed by softmax. The patch‑level distributions are then averaged to obtain an image‑level distribution that guides the mixing of candidate operations. This patchwise design preserves local spatial information, improving discrimination among operations especially in early cells where features are highly localized.

A second contribution is a parameter‑free topology search. DARTS requires selecting exactly two incoming edges per node, traditionally handled by separate topology parameters β. MIDAS instead enumerates all valid pairs of (input, operation) edges and computes a joint attention score for each pair as the sum of the two corresponding keys dotted with the query, scaled by √C. After softmax normalization, the top‑scoring pair of edges is selected for each node, eliminating the need for extra topology parameters while still respecting the two‑edge constraint.

Training follows the standard bilevel optimization used in DARTS: one split of the training data updates the network weights ω, while another split updates the attention projection matrices (the key and query MLPs) for each node. Because the attention parameters are node‑specific, the total parameter overhead scales linearly with the number of nodes, not with the size of the search space. After convergence, the input‑specific distributions are marginalized over a subset of training samples to produce a fixed architecture: the mean probabilities for each operation are computed, and the two strongest edges per node are retained.

Extensive experiments validate MIDAS across three popular search spaces. On NAS‑Bench‑201, MIDAS consistently discovers the globally optimal architecture for CIFAR‑10, CIFAR‑100, and ImageNet‑16‑120, matching or surpassing all prior methods. In the original DARTS space, MIDAS achieves 97.42 % top‑1 accuracy on CIFAR‑10 and 83.38 % on CIFAR‑100, exceeding the previous best DARTS‑based results (97.35 % and 83.18 %). In the RDARTS benchmark (four search spaces S1‑S4), MIDAS sets state‑of‑the‑art performance on two of the four spaces (S2 and S4) while remaining competitive on the others. Ablation studies show that the patchwise attention dramatically sharpens the operation probability distribution, and that the resulting input‑specific distributions are class‑aware and largely unimodal, which explains the robustness of the simple averaging decoding step.

The authors discuss limitations: the attention cost grows linearly with the number of patches, so very high‑resolution inputs may increase memory and compute demands; also, the final architecture depends on the subset of samples used for marginalization, introducing slight variability. Future work could explore more efficient patch sampling, multi‑scale attention, or directly deploying input‑specific architectures without a final averaging step.

In summary, MIDAS advances differentiable NAS by (1) making architecture parameters responsive to each input via mosaic self‑attention, (2) integrating topology selection without extra parameters, and (3) delivering superior empirical results across multiple benchmarks while retaining the computational efficiency characteristic of DARTS.


Comments & Academic Discussion

Loading comments...

Leave a Comment