Bandit Learning in Matching Markets with Interviews

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Two-sided matching markets rely on preferences from both sides, yet it is often impractical to evaluate preferences. Participants, therefore, conduct a limited number of interviews, which provide early, noisy impressions and shape final decisions. We study bandit learning in matching markets with interviews, modeling interviews as \textit{low-cost hints} that reveal partial preference information to both sides. Our framework departs from existing work by allowing firm-side uncertainty: firms, like agents, may be unsure of their own preferences and can make early hiring mistakes by hiring less preferred agents. To handle this, we extend the firm’s action space to allow \emph{strategic deferral} (choosing not to hire in a round), enabling recovery from suboptimal hires and supporting decentralized learning without coordination. We design novel algorithms for (i) a centralized setting with an omniscient interview allocator and (ii) decentralized settings with two types of firm-side feedback. Across all settings, our algorithms achieve time-independent regret, a substantial improvement over the $O(\log T)$ regret bounds known for learning stable matchings without interviews. Also, under mild structured markets, decentralized performance matches the centralized counterpart up to polynomial factors in the number of agents and firms.

💡 Research Summary

The paper tackles the problem of learning stable matchings in two‑sided markets where both agents (e.g., job seekers) and firms (e.g., employers) must discover their preferences over time. Unlike prior work that assumes firms have fixed, known rankings and that agents only receive bandit feedback after a match, this study introduces two realistic features: (1) low‑cost “interviews” that act as noisy hints about the underlying reward distributions, and (2) uncertainty on the firm side, meaning firms also have to learn which agents they prefer.

Model Overview

There are n agents and m firms (n ≤ m). Each pair (a, f) has two bounded reward distributions: Dₐ,₍f₎ for the agent and D_f,₍a₎ for the firm, with means uₐ,₍f₎ and u_f,₍a₎ respectively. These means induce strict total orderings, i.e., each side has a well‑defined preference list.
In each round, every agent selects up to k firms to interview (k ≥ 2). An interview yields a stochastic signal drawn from the corresponding distribution, providing a noisy estimate of the mean but not counting as a reward. Firms also interview a subset of agents and receive analogous signals.
After interviewing, each agent applies to exactly one of the firms it interviewed. A firm receives all applications, selects its most‑preferred applicant according to its current estimated ranking, or defers (i.e., remains vacant).
At the end of the round, agents observe a public feedback set. Two feedback models are considered: V(t) – the set of firms that are vacant; and V⁺(t) – the union of vacant firms and firms whose hiring status changed since the previous round. No identities are revealed; the feedback is completely anonymous.

Why Deferral Is Crucial
If firms were forced to hire every round, early estimation errors could lock the market into a sub‑optimal, unstable matching: a firm would keep hiring an agent it mistakenly believes to be best, while the truly preferred agent never gets a chance to be reconsidered. By allowing a firm to defer, it publicly signals vacancy through V(t) or V⁺(t), enabling previously rejected agents to re‑apply in later rounds. The paper provides a concrete example where deferral is the only mechanism that can break a deadlock caused by mutual mis‑estimation.

Algorithmic Contributions

Centralized Interview Allocator (CIA) – A central planner decides which interviews occur each round, coordinating agents’ interview budgets. Building on the classic Deferred Acceptance algorithm, the authors design a two‑phase process that uses interview hints to update estimates and then matches agents to firms. The algorithm learns the agent‑optimal stable matching in O(n m²) rounds, regardless of whether firms are certain or uncertain.
Decentralized Algorithms with Minimal Feedback
- Vacancy‑only feedback (V): Algorithm 2 introduces a coordination step where agents share limited information about which firms are vacant, allowing them to re‑target those firms. In markets that are α‑reducible (every sub‑market contains a fixed pair of mutually top‑choice participants), the regret per agent is O(n³ m²); in general markets it grows to O(n⁴ m²).
- Anonymous hiring‑change feedback (V⁺): Algorithm 3 eliminates any coordination phase. Using only the stronger V⁺ signal, agents can infer when a firm’s hiring status has changed and adjust their interview/application choices accordingly. In α‑reducible markets the regret is O(n³ m²). Extending to arbitrary markets requires three interviews per round (k = 3) and a mild lower bound on reward gaps, after which the regret becomes constant (O(1)).

All three algorithm families achieve time‑independent regret: the cumulative regret does not grow with the horizon T, a stark improvement over the O(log T) bounds typical in bandit matching without interviews. The key insight is that two interviews per round already provide enough side information to drive the learning process to convergence, whereas single‑hint models often need three hints to guarantee similar performance.

Theoretical Foundations

The paper formalizes α‑reducible markets, a structural condition guaranteeing a unique stable matching. This condition is weaker than serial dictatorship and includes many realistic settings (e.g., markets with a chain of top‑choice pairs).
Regret analysis hinges on bounding the number of “mistake” rounds where an agent applies to a non‑optimal firm. Interviews reduce uncertainty quickly, and deferral ensures that a mistaken hire does not permanently block better matches.
The authors derive explicit regret bounds that depend polynomially on n and m but are independent of T. They also prove lower‑bound arguments showing that without at least two interviews per round, constant regret cannot be achieved in the competitive two‑sided setting.

Practical Implications and Limitations

The model captures real‑world practices such as career fairs (interviews) and hiring freezes or “hold” decisions (deferral). Hence, the algorithms could be implemented in online labor platforms, college admissions portals, or medical residency matching systems without requiring a central authority to dictate every interview.
The analysis assumes bounded, independent reward distributions and strict total order preferences, which may not hold in settings with multi‑dimensional utilities or correlated outcomes.
Empirical validation is absent; future work should test the algorithms on simulated markets and real datasets to assess robustness to noise, strategic manipulation, and dynamic entry/exit of participants.

Conclusion
By integrating low‑cost interview hints and a strategic deferral option for firms, the authors demonstrate that two‑sided markets can learn stable matchings with constant regret, even when both sides are learning simultaneously. The work bridges a gap between classic matching theory and modern bandit learning, offering both centralized and fully decentralized solutions that are theoretically sound and potentially applicable to a wide range of real‑world matching platforms.

Bandit Learning in Matching Markets with Interviews

💡 Research Summary

Comments & Academic Discussion

Leave a Comment