Thompson Sampling-Based Learning and Control for Unknown Dynamic Systems

Thompson Sampling-Based Learning and Control for Unknown Dynamic Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Thompson sampling (TS) is a Bayesian randomized exploration strategy that samples options (e.g., system parameters or control laws) from the current posterior and then applies the selected option that is optimal for a task, thereby balancing exploration and exploitation; this makes TS effective for active learning-based controller design. However, TS relies on finite parametric representations, which limits its applicability to more general spaces, which are more commonly encountered in control system design. To address this issue, this work proposes a parameterization method for control law learning using reproducing kernel Hilbert spaces and designs a data-driven active learning control approach. Specifically, the proposed method treats the control law as an element in a function space, allowing the design of control laws without imposing restrictions on the system structure or the form of the controller. A TS framework is proposed in this work to reduce control costs through online exploration and exploitation, and the convergence guarantees are further provided for the learning process. Theoretical analysis shows that the proposed method learns the relationship between control laws and closed-loop performance metrics at an exponential rate, and the upper bound of control regret is also derived. Furthermore, the closed-loop stability of the proposed learning framework is analyzed. Numerical experiments on controlling unknown nonlinear systems validate the effectiveness of the proposed method.


💡 Research Summary

The paper “Thompson Sampling‑Based Learning and Control for Unknown Dynamic Systems” tackles a fundamental limitation of existing Thompson Sampling (TS) approaches in control: they are confined to finite‑dimensional parametric representations, which makes them unsuitable for many modern control problems where the controller must be designed in an infinite‑dimensional function space. To overcome this, the authors propose a novel framework that treats the control law as an element of a reproducing kernel Hilbert space (RKHS). By embedding the controller in an RKHS, the method inherits the inner‑product structure, completeness, and the ability to represent highly nonlinear mappings through kernel basis functions, while still allowing rigorous analysis.

The methodology proceeds as follows. An initial controller (\hat g) (often obtained from legacy design or empirical tuning) is used to construct a candidate function space (\widehat{\mathcal G}). This space is defined as the convex hull of a finite set of kernel basis functions selected from a chosen kernel (e.g., Gaussian, polynomial). The convex hull ensures that any admissible controller can be expressed as a linear combination of these bases with non‑negative coefficients that sum to one, preserving feasibility and enabling convex optimization techniques.

The control objective is to minimize a cost functional (J(g)) (quadratic, risk‑sensitive, or any Lipschitz‑continuous performance metric) over (\widehat{\mathcal G}). Because the true dynamics (f(\cdot)) are unknown, the cost cannot be evaluated analytically; instead, it is estimated online from trajectory data. The authors adopt a segment‑based learning scheme: the time horizon is divided into segments of length (K); at the beginning of each segment a controller (g_t) is fixed, data from the segment are collected, and then used to update the posterior distribution of the unknown cost function.

The Bayesian update is performed using a conjugate prior (e.g., Gaussian process or Gaussian‑inverse‑Gamma) that yields a tractable posterior over the cost function. Thompson Sampling is then applied: a sample of the cost function is drawn from the posterior, the controller that minimizes this sampled cost within (\widehat{\mathcal G}) is computed (often via convex optimization), and the resulting controller is deployed in the next segment. This procedure naturally balances exploration (sampling from uncertain regions of the posterior) and exploitation (using the currently best‑estimated controller).

Four main theoretical contributions are provided. Theorem 1 quantifies the performance loss caused solely by the RKHS parameterization, showing that the regret contributed by the function‑space approximation is bounded by a constant that depends on the kernel choice and the number of basis functions. Theorem 2 proves exponential convergence of the posterior estimate of the cost function to the true cost, i.e., the mean‑square error decays as (\mathcal O(e^{-\alpha t})). Theorem 3 derives an overall regret bound for the closed‑loop system: the cumulative regret consists of a constant term (from Theorem 1) plus an exponentially decaying term (from Theorem 2). Finally, Theorem 4 establishes mean‑square boundedness of the closed‑loop state trajectory, guaranteeing that the learning‑driven controller does not destabilize the plant despite the continual policy updates.

Numerical experiments validate the theory. Two benchmark nonlinear systems—a two‑link robotic arm and a nonlinear oscillator—are controlled using the proposed RKHS‑TS method, a traditional TS method that operates on a finite set of parametric controllers, and a UCB‑based active learning controller. Results show that RKHS‑TS achieves significantly lower cumulative regret, faster convergence of the cost, and superior tracking performance. Importantly, the method requires no prior knowledge of the system structure; the kernel choice alone suffices to capture the necessary richness of the controller class.

The paper also discusses extensions, such as handling non‑stationary reward distributions, reducing computational complexity via low‑rank kernel approximations, and applying the framework to multi‑agent or distributed settings. Overall, the work bridges Bayesian reinforcement learning and functional‑analysis‑based control design, offering a general, theoretically grounded, and practically effective solution for data‑driven control of unknown, highly nonlinear dynamical systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment