Latency-aware Human-in-the-Loop Reinforcement Learning for Semantic Communications

Latenc y-a ware Human-in-the-Loop Reinforcement Learning for Semantic Communications Peizheng Li, Xinyi Lin, Adnan Aijaz Bristol Research and Innov ation Laboratory , T oshiba Europe Ltd., U.K. Email: { peizheng.li, xinyi.lin, adnan.aijaz } @toshiba.eu Abstract —Semantic communication pr omises task-aligned transmission but must reconcile semantic ﬁdelity with stringent latency guarantees in immersive and safety-critical ser vices. This paper introduces a time-constrained human-in-the-loop reinf orcement learning (TC-HITL-RL) framework that embeds human feedback, semantic utility , and latency control within a semantic-aware Open radio access network (RAN) architecture. W e formulate semantic adaptation driven by human feedback as a constrained Markov decision process (CMDP) whose state captures semantic quality , human prefer ences, queue slack, and channel dynamics, and solve it via a primal–dual proximal policy optimization algorithm with action shielding and latency- aware r eward shaping. The resulting policy preserves PPO- level semantic rewards while tightening the variability of both air -interface and near-real-time RAN intelligent controller pro- cessing budgets. Simulations over point-to-multipoint links with heterogeneous deadlines show that TC-HITL-RL consistently meets per -user timing constraints, outperforms baseline sched- ulers in reward, and stabilizes resource consumption, providing a practical blueprint for latency-aware semantic adaptation. Index T erms —6G, AI, human-in-the-loop, radio access net- work, reinf orcement learning, semantic communication I . I N T RO D U C T I O N Semantic communication (SemCom) shifts the design fo- cus from bit-lev el ﬁdelity to task- or meaning-lev el utility , transmitting only task-relev ant information and enabling joint design of physical, link, and inference layers for improv ed spectral and energy ef ﬁciency as well as reduced latency [1]– [3]. Speciﬁcally , deep learning–based SemCom systems, often realized via joint source–channel coding (JSCC) [4], [5], hav e demonstrated strong robustness to channel impairments and performance gains. Howe ver , most existing designs treat semantic models as static once trained and therefore struggle to maintain alignment when wireless conditions, user prefer- ences, or task objecti ves ev olve ov er time. From a service per- spectiv e, adaptiv e mechanisms are essential to keep semantic ﬁdelity aligned with user intent and application context. The rapid progress of generativ e AI and reinforcement learning from human feedback (RLHF) [6] underscores the value of learning directly from human preferences. Human- in-the-Loop Reinforcement Learning (HITL-RL) incorporates subjectiv e feedback into the rew ard design and policy updates [7]. It has been successfully applied in robotics, preference learning, and controllable text generation, and has recently been advocated for SemCom to align models with user- perceiv ed utility [8], [9]. Bringing HITL-RL into networked SemCom loops, howe ver , introduces domain-speciﬁc chal- lenges. In wireless systems, human feedback trav els ov er bandwidth- and latency-limited links, and semantic model updates must meet strict timing constraints. In point-to-multipoint deployments with heterogeneous users, feedback delays and reconﬁguration latencies can render otherwise beneﬁcial updates infeasible for a subset of users. Ignoring these temporal effects leads to per-user deadline violations and degraded quality of experience (QoE). T ime-aware decision mechanisms are therefore required to couple semantic utility with the realities of scheduling and deployment. Meanwhile, the granularity of model updates (e.g., partial refresh vs. full retraining) should be carefully chosen to balance semantic gains against latency ov erhead. Constrained Markov decision processes (CMDPs) [10] pro- vide a principled way to enforce latenc y or safety b udgets via Lagrangian or primal–dual methods [11]. Proximal Policy Optimization (PPO) [12], kno wn for stability and sample efﬁcienc y , can be endowed with cost critics and dual vari- ables to form constrained PPO (PPO-C), and recent work has brought such RL ideas to RIC optimization [13], [14]. Howe ver , prior studies either focus on av erage QoS or resource slicing, without incorporating human preference signals or per - frame feasibility mechanisms as introduced here. In this work, we introduce a time-constrained HITL-RL framew ork for semantic adaptation in point-to-multipoint set- tings. W e formulate semantic adaptation as a CMDP with per - user deadline budgets and latency-aw are reward shaping, and we solve it with a primal–dual PPO algorithm augmented with an action shield that enforces instantaneous feasibility during both training and deplo yment. T o our kno wledge, this is among the ﬁrst integrations of HITL-RL with SemCom under explicit real-time constraints, bridging preference-dri ven learning with implementable timing control. The main contributions are: • Latency-aware CMDP . W e couple human-aligned se- mantic utility with near-R T RIC budgets and per-user deadlines, yielding a tractable CMDP abstraction for semantic broadcasting under latency guarantees. • TC-PPO with shielding. A primal–dual PPO variant with cost critics, adapti ve multipliers, and an action shield enforces both av erage and instantaneous feasibility . • Implementation and evidence. W e map the frame work to an NR-like slot structure and show on JSCC-enabled N on - R T R I C DU RU S e r vi c e M a na ge m e nt a nd O r c he s t r a t i on CU D R L a g e n t U s e r s S e m a nt i c m ode l di s pa t c h U E f e e dba c k S e m a nt i c s e r vi c e a nd e va l ua t i on N e ar - R e al T i m e R I C A c t i on O p e n R A N i n t e rn a l i n t e rfa c e F e e db a c k w i t h t i m e s e ns i t i vi t y Fig. 1: Illustration of the system model. multi-user links that TC-PPO preserves PPO-lev el re ward while tightening resource dispersion; tar geted ablations highlight the role of each component. I I . S Y S T E M M O D E L W e consider an AI-driv en next-generation RAN where a semantic-a ware base station (gNB) serves latency- heterogeneous user equipments (UEs) N = { 1 , . . . , N } over a shared downlink. As illustrated in Fig. 1, the architecture follows the Open RAN functional split [2], [15]: the near- R T RIC hosts the HITL-RL agent, while distributed units (DUs) and radio units (R Us) handle physical-layer connec- tivity . Semantic models operate as encoder–decoder pairs, with the encoder at the gNB and UE-speciﬁc decoders at the terminals. Human operators ev aluate reconstructed semantics and send preference feedback to the RIC, which fuses these signals, updates the models, and disseminates conﬁguration deltas under strict timing budgets. The control loop is discretized into gNB scheduling sub- frames indexed by t ∈ { 0 , 1 , . . . } , and each sub-frame com- prises slot grants (akin to NR mini-slot allocations) that are dynamically carved out for semantic adaptation. Every UE i belongs to a service class k ( i ) with a deadline d i representing the maximum allow able time between observing semantic degradation and deploying a refreshed decoder . A. Semantic Delivery Pipeline At frame t the gNB ingests source features x t ∈ R n s together with historical context m t . The encoder parameters ϕ t generate a latent embedding z t = f ϕ t ( x t , m t ) , (1) which is mapped onto a complex-v alued symbol block s t = E ( z t ) ∈ C n c satisfying ∥ s t ∥ 2 2 ≤ n c P max for transmission, where P max denotes the per-symbol power budget. The R U simultaneously deli vers s t ov er a block-fading MIMO channel, y i,t = H i,t s t + n i,t , (2) where H i,t captures path loss and small-scale fading tow ard UE i and n i,t ∼ C N ( 0 , σ 2 i I ) . Each UE maintains a personal- ized decoder ψ i,t that incorporates local side information c i,t (e.g., service context information or sensor snapshots), ˆ x i,t = g ψ i,t ( y i,t , c i,t ) . (3) The reconstruction is assessed through a task loss ℓ i ( ˆ x i,t , x t ) and its complementary quality score q i,t = 1 − ℓ i ( ˆ x i,t , x t ) , which is reported to the near-R T RIC. B. Human F eedback Acquisition UE-side human ev aluators send scalar or vector feedback F i,t through uplink control or data bearers conﬁgured by the gNB. The speciﬁc PHY channels used for this purpose depend on deployment choices and are therefore abstracted in this study . Our focus is on ho w the near -R T RIC acquires, aggregates, and exploits the feedback recei ved at the gNB. W ithin the Open RAN frame work, the SMO and near-R T RIC modules fuse these signals and map them into normalized preference scores, ˜ U i,t = (1 − η i ) U i ( q i,t , p i,t ) + η i H pref ( F i,t ) , (4) where U i encodes objective task KPIs p i,t (e.g., detection accuracy) and H pref captures subjective satisfaction, while η i ∈ [0 , 1] balances machine and human inputs. The near - R T RIC maintains an exponentially weighted moving aver - age, ¯ U i,t +1 = (1 − α i ) ¯ U i,t + α i ˜ U i,t , with smoothing factor α i ∈ (0 , 1] , which becomes part of the RL state and anchors long-term semantic alignment. C. Latency Budget Decomposition Executing an adaptation action a t ∈ U , ranging from lightweight decoder -statistic refreshes, through mid-tier fea- ture reﬁnement, to full retraining, cached rollback, or a no- operation (NoOp), triggers an operation sequence in the Open RAN stack. For UE i , the end-to-end latency is modeled as C i,t ( u t ) = C fb i,t + C RIC i,t ( u t ) + C tx i,t ( u t ) + C reconf i,t ( u t ) , (5) where C fb i,t denotes feedback acquisition (human response and uplink), C RIC i,t cov ers analytics and decision making at the near-R T RIC (including queuing), C tx i,t accounts for fron- thaul/backhaul dissemination of the new model parameters, and C reconf i,t describes UE-side deployment and warm-start. Hard timing constraints require C i,t ( u t ) ≤ d i , ∀ i ∈ N . (6) Service classes k ∈ K organize UEs by application, yielding representativ e budgets aligned with 3GPP delay targets in the standardized 5G QoS Identiﬁer (5QI): d i = T fb k ( i ) + T RIC k ( i ) + T tx k ( i ) + T reconf k ( i ) . (7) The RIC exposes the residual slack ∆ i,t = d i − C i,t ( u t ) and the normalized deadline debt δ i,t = [ − ∆ i,t ] + /d i , where [ x ] + ≜ max { x, 0 } , to the learning agent for scheduling 1 . 1 Intuitiv ely , ∆ i,t ≥ 0 indicates remaining time budget before the deadline d i , whereas ∆ i,t < 0 quantiﬁes lateness. The normalized debt δ i,t ∈ [0 , 1] expresses an y lateness as a fraction of d i . D. T ime Resource Coupling W ithin each 10 ms gNB frame we assume a ﬁxed numerol- ogy µ ∈ { 0 , 1 , 2 } , which sets the slot duration T slot ( µ ) = 1 ms / 2 µ and symbol duration T sym ( µ ) = T slot ( µ ) / 14 . Seman- tic adaptation then occupies mini-slot-like grants of n sym ,t ∈ { 2 , 4 , 7 } symbols that are dynamically provisioned to the semantic slice. If κ t such grants are allocated and T ctrl ,t captures control overhead, the av ailable processing windo w is T av ail ,t = κ t n sym ,t T sym ( µ ) − T ctrl ,t . Let b i,t ∈ { 0 , 1 } indicate whether UE i is scheduled for adaptation in frame t . The aggregate time spent by the selected UEs must satisfy N X i =1 b i,t C RIC i,t ( u t ) ≤ T av ail ,t . (8) Each per -UE near -R T RIC queue integrates arri vals and service Q i,t +1 =  Q i,t + A i,t − b i,t C RIC i,t ( u t )  + , (9) where A i,t is an arri val process capturing newly arrived adap- tation jobs (e.g., fresh feedback or analytics-triggered updates). Maintaining Q i,t ≤ Q max i mitigates deadline violations and provides additional state information to the CMDP agent. I I I . C M D P - B A S E D C O N S T R A I N E D P O L I C Y O P T I M I Z A T I O N A. CMDP Pr oblem Setup W e formulate the time-constrained semantic adaptation problem as a CMDP M = ( S , A , P , r , c , γ ) whose com- ponents reﬂect the semantics, human feedback, and latency dynamics dev eloped in Sec. II. State space. At the beginning of frame t , the near-R T RIC observes the state s t =  q t , ¯ U t , ∆ t , δ t , Q t , H t , T av ail ,t  , (10) whose elements are: • q t = [ q 1 ,t , . . . , q N ,t ] ⊤ : instantaneous semantic quality seen by the UEs. • ¯ U t = [ ¯ U 1 ,t , . . . , ¯ U N ,t ] ⊤ : human-aligned utility estimates maintained by the RIC. • ∆ t = [∆ 1 ,t , . . . , ∆ N ,t ] ⊤ and δ t = [ δ 1 ,t , . . . , δ N ,t ] ⊤ : residual slack and normalized deadline debt per UE. • Q t = [ Q 1 ,t , . . . , Q N ,t ] ⊤ : RIC queue backlogs of pending adaptation jobs. • H t = { H i,t } N i =1 : effecti ve channel matrices inﬂuencing shared-link reliability . • T av ail ,t : mini-slot–like budget granted to the semantic slice in frame t . Action space. The agent chooses a composite action a t = ( u t , b t ) , where u t ∈ U = { L I G H T A D A P T , F E A T R E FI N E , F U L L R E T R A I N , D E P L O Y C AC H E D , N O O P } selects the adapta- tion primitive, and b t = [ b 1 ,t , . . . , b N ,t ] ⊤ ∈ { 0 , 1 } N schedules the UEs that will ex ecute the primiti ve in frame t . In practice, L I G H T A DA P T refreshes decoder statistics or adapters with minimal latency , F E A T R E FI N E performs moderate-strength ﬁne-tuning (e.g., LoRA layers), F U L L R E T R A I N deploys a heavy knowledge-base update, and D E P L OY C A C H E D rolls back or reinstates a cached stable model, while N O O P lea ves the current conﬁguration untouched. The instantaneous feasi- bility region is A feas ( s t ) =      ( u, b )        N X i =1 b i C RIC i,t ( u ) ≤ T av ail ,t , C i,t ( u ) ≤ d i if b i = 1      , (11) which ensures that the near-R T RIC processing window and ev ery UE deadline remain satisﬁed. T ransition Kernel. Given ( s t , a t ) , the next state s t +1 is sampled from P ( s t +1 | s t , a t ) . The kernel encompasses: (i) JSCC reconstructions through the fading channels H i,t ; (ii) human-feedback fusion producing ¯ U i,t ; and (iii) the latency decomposition that updates ∆ i,t and Q i,t via the components of C i,t ( u ) detailed in Sec. II. Reward and Constraint Signals. The per-frame re ward bal- ances semantic improvement with the computational ov erhead of large updates: r ( s t , a t ) = N X i =1 w i ¯ U i,t +1 − β u χ ( u t ) − β δ N X i =1 δ i,t +1 , (12) where w i encode service priorities, χ ( u t ) quantiﬁes the com- pute cost at the gNB/RIC, and β u , β δ ≥ 0 modulate the trade- off between semantic gain and deadline stress. W e adopt post-transition rewards that depend on s t +1 (via ¯ U i,t +1 and δ i,t +1 ) to couple the learning targets with observed outcomes. T wo cost signals monitor resource feasibility: c (1) ( s t , a t ) = N X i =1 b i,t C RIC i,t ( u t ) , (13) c (2) ( s t , a t ) = N X i =1 [ C i,t ( u t ) − d i ] + , (14) representing the near -R T RIC processing time consumed in frame t and the aggregate deadline overshoot. Optimization Objective. Let π θ ( a | s ) be a stationary stochas- tic policy with parameters θ and discount factor γ ∈ (0 , 1) . The CMDP seeks max π θ E π θ " ∞ X t =0 γ t r ( s t , a t ) # (15a) s.t. lim sup T →∞ 1 T E π θ " T − 1 X t =0 c ( j ) ( s t , a t ) # ≤ d ( j ) , j = 1 , 2 , (15b) where d (1) ≜ E [ T av ail ,t ] denotes the long-term RIC processing budget and d (2) ≜ 0 enforces zero average deadline violations (a relaxed d (2) > 0 encodes a tolerable violation probability). Solving (15) produces a policy that maximizes human-aligned semantic utility while respecting both resource and latenc y constraints. B. Primal–Dual PPO Surr ogate W e solve (15) using a primal–dual variant of PPO that maintains the clipped surrogate structure while introducing Lagrange multipliers λ = [ λ 1 , λ 2 ] ⊤ for the constraints. The stochastic Lagrangian is L ( θ , λ ) = E π θ " ∞ X t =0 γ t  r ( s t , a t ) − 2 X j =1 λ j  c ( j ) ( s t , a t ) − d ( j )   # . (16) A clipped mini-batch surrogate for gradient ascent is L B ( θ , λ ) = 1 |B | X t ∈B h min  ρ t ˆ A r t , clip( ρ t , 1 − ϵ, 1 + ϵ ) ˆ A r t  − 2 X j =1 λ j ˆ A c ( j ) t i , (17) where ρ t = π θ ( a t | s t ) π θ old ( a t | s t ) and ϵ controls the clipping region. C. Critic Learning and Advantage Estimation Low-v ariance generalized adv antage estimates (GAE) are obtained from v alue baselines. The rew ard critic V r ψ minimizes J r ( ψ ) = 1 2 |B| X t ∈B ( V r ψ ( s t ) − ˆ R t ) 2 , (18) with targets ˆ R t = P L − 1 ℓ =0 γ ℓ r t + ℓ + γ L V r ψ ( s t + L ) . T wo cost critics V c ( j ) ν j are trained analogously using discounted cost rollouts. Advantages use the GAE recursion, e.g. ˆ A r t = L − 1 X ℓ =0 ( γ λ GAE ) ℓ δ r t + ℓ , δ r t = r t + γ V r ψ ( s t +1 ) − V r ψ ( s t ) , (19) and similarly for ˆ A c ( j ) t . These baselines embed all state v ari- ables so that gradients reﬂect the coupling between human utility , slack, and queuing delays. D. Dual Updates and Long-T erm Guarantees After each policy update, the dual variables ascend along constraint gradients: λ j ← [ λ j + η λ (ˆ c ( j ) B − d ( j ) )] + , ˆ c ( j ) B = 1 |B| X t ∈B c ( j ) ( s t , a t ) . (20) Projection onto R + enforces non-negati vity , and exponential smoothing ﬁlters stochastic noise. Con ver gence of the pri- mal–dual iterates implies that the learned policy satisﬁes the long-term resource and latency budgets. E. Action Shielding for Instantaneous F easibility It should be noted that average constraints cannot guarantee per-frame safety , so the near -R T RIC implements an action shield that maps tentativ e outputs of π θ to a feasible pair in A feas ( s t ) via a discrete projection: keep the primiti ve u t whenev er feasible and greedily prune the scheduling mask b t ; Algorithm 1 TC-HITL-RL Constrained PPO 1: Initialize policy parameters θ , re ward critic ψ , cost critics ν 1 , ν 2 , dual multipliers λ 1 , λ 2 , and empty buf fer D . 2: for each training iteration do 3: Collect L -frame rollouts using Π A feas ( π θ ) to ensure feasible actions and store transitions in D . 4: Compute advantages ˆ A r , ˆ A c ( j ) and returns ˆ R , ˆ C ( j ) from the buf fered trajectories. 5: Update reward and cost critics via gradient descent on J r and J c ( j ) . 6: Update θ by maximizing L B ( θ , λ ) (17) with PPO- style clipped gradients. 7: Ascend dual variables: λ j (20), followed by expo- nential moving average (EMA) smoothing. 8: Refresh safety-layer latency predictors using the new latency observations if av ailable. 9: end for 10: Deploy the ﬁnal policy with online shielding for infer- ence. if no feasible mask e xists, back of f to the next lighter primiti ve, repeating until feasibility or N O O P . Concretely , a latency- aware knapsack routine prunes the mask based on residual slack or queue urgency; if still infeasible, the shield backs off to a lighter primiti ve or N O O P . This ensures per -frame real- time feasibility during both training and deployment. Algorithm 1 summarizes the training pipeline, highlighting how rollouts, critic learning, dual updates, and action shielding interact within the TC-HITL-RL framew ork. I V . S I M U L AT I O N A N D R E S U LT S A. Simulation Setup W e simulate a single semantic-aware gNB serving N ∈ { 8 , 16 } UEs with heterogeneous deadlines and backlog. Each 10 ms interval samples a numerology µ ∈ { 0 , 1 , 2 } , grants minislot resources, and exposes the CMDP state v ariables in Sec. II; all radio, latency , and learning hyperparameters are ﬁxed across runs and listed in T able I, with conﬁguration ﬁles released alongside the code. The agent selects among the primitiv es in Sec. III-A (F U L L R E T R A I N , F E A T R E FI N E , L I G H T A DA P T , D E P L OY - C AC H E D , N O O P ), while the safety shield of Sec. III-E prunes infeasible UE masks and falls back along that order until per- user timing constraints are met. Latency components follow Sec. II but inject stochastic congestion (backlog-driv en) and fading penalties, and human feedback is drawn from a noisy oracle that rewards semantic quality yet penalises tardy up- dates, shaping the CMDP rew ard/cost landscape. For learning, the constrained PPO agent employs two shared-hidden-layer MLPs for the policy and critics; the asso- ciated optimization hyperparameters follow the v alues listed in T able I. 0 200 400 600 800 Iteration 40 60 80 100 Cumulative reward per rollout DQN PPO Random TCPPO (a) 0 200 400 600 800 Iteration 150 175 200 225 250 275 Cumulative reward per rollout DQN PPO Random TCPPO (b) Fig. 2: T raining rew ard trajectories (mean ± std). (a) corre- sponds to N = 8 , (b) to N = 16 . 0 200 400 600 800 Iteration 0.5 1.0 1.5 2.0 2.5 A vg comm overhead (ms/frame) DQN PPO Random TCPPO (a) 0 200 400 600 800 Iteration 0.4 0.6 0.8 1.0 1.2 RIC processing (ms/frame) DQN PPO Random TCPPO (b) 0 200 400 600 800 Iteration 0.5 1.0 1.5 2.0 2.5 3.0 A vg comm overhead (ms/frame) DQN PPO Random TCPPO (c) 0 200 400 600 800 Iteration 0.4 0.6 0.8 1.0 1.2 1.4 RIC processing (ms/frame) DQN PPO Random TCPPO (d) Fig. 3: A verage communication and RIC processing budgets consumed during training. (a)–(b) correspond to N = 8 and (c)–(d) to N = 16 . T ABLE I: Simulation parameters Category V alue/Setting T opology N ∈ { 8 , 16 } UEs with deadlines d i ∈ [6 , 12] ms. Radio resources Scenario-speciﬁc numerology µ ∈ { 0 , 1 , 2 } ; mini- slot-like grants with n sym ∈ { 2 , 4 , 7 } symbols; Gaussian T ctrl for HARQ/CSI overhead. Primitiv e latencies (ms) RIC: { 5 . 0 , 2 . 8 , 1 . 1 , 1 . 5 , 0 } ; end-to-end: { 8 . 4 , 5 . 0 , 2 . 4 , 3 . 1 , 0 . 1 } for (F UL L R E T RA I N , F EAT R E FI NE , L I GH T A D AP T , D E PL OY C AC HE D , N O O P ). Semantic gain means { 0 . 028 , 0 . 019 , 0 . 011 , 0 . 014 , 0 } for the above primiti ves; compute penalties χ ( u ) ∈ { 2 . 7 , 1 . 6 , 0 . 7 , 0 . 9 , 0 } . Learning setup T wo shared hidden layers (256/128, ReLU); learning rates 3 × 10 − 4 ; ϵ = 0 . 2 ; λ GAE = 0 . 95 ; γ = 0 . 99 ; dual step size 10 − 3 with EMA 0 . 9 ; rollout length L = 64 ; mini-batch size 256 ; 120 updates. Seeds Fiv e seeds per agent ( 42 – 46 ). W e benchmark TC-PPO against three baselines imple- mented in the shared simulation harness: (i) an unconstrained PPO variant with deactiv ated multipliers, (ii) a discrete-action DQN scheduler whose actions index primitive–mask templates built from slack/backlog statistics, and (iii) a random feasible scheduler 2 . T o keep this DQN baseline from collapsing into near-idle schedules, we introduce a light under-utilisation penalty that subtracts reward whenev er the reported communi- cation or RIC budgets fall below preset service targets, ensur- 2 Unless noted, all results aggregate ﬁve seeds per agent (seeds 42 – 46 ) and report mean trajectories with standard-error bands. ing that e very agent is ev aluated under comparable minimum service lev els. B. Results Fig. 2 compares the mean training re ward for N ∈ { 8 , 16 } users over ﬁve seeds. The unconstrained PPO and the con- strained TC-PPO con verge within approximately 200 iterations and sustain the highest semantic utility . For the DQN baseline, because it lacks dual updates and only selects among pre- deﬁned schedule templates, it plateaus below the PPO vari- ants e ven though it no w acti vates multi-user masks. Random scheduler remains well below the learning-based methods. The resulting traces emphasise that the DQN regulariser enforces comparable service lev els, so the observed reward gap is due to the absence of CMDP guidance rather than artiﬁcially low utilisation. Fig. 3 reports the associated air-interface overhead C fb + C tx and the near-R T RIC processing time C RIC . When N = 8 , PPO/TC-PPO opportunistically alternate between F U L L R E - T R A I N and light-weight primitives whene ver slack permits, producing larger error bars but unlocking higher re ward. At N = 16 , the same agents must remain in heavy-update mode and the overhead curves stabilise. DQN, driv en by the utilisation penalty , now consumes communication time that is comparable to TC-PPO while still lagging in reward, whereas the Random scheduler remains the only option with consistently low overhead. The right-hand panels further show that TC-PPO tracks PPO’ s av erage RIC consumption but with visibly tighter bands, indicating that enforcing latenc y constraints also stabilises near-R T compute demand. T o quantify deployment stability of each agent, Fig. 4 plots the metrics aggregated 30 inference episodes after training and log the per-episode metrics and visualizes their v ariability . TC- PPO matches the reward of PPO at both user densities while keeping air-interface and RIC overhead within comparable or tighter dispersion bands; all episodes satisfy the deadline constraints (hit rate = 1 ). DQN and Random deliv er lower re- ward/utility while the former expends resource budgets similar to TC-PPO because of the minimum-service regulariser .These results conﬁrm that TC-PPO attains the semantic beneﬁt of PPO with predictable resource usage during deployment. Ablation study . W e further dissect the framew ork by disabling four ke y components: (i) removing the safety shield and rely- ing solely on the av erage CMDP constraints, (ii) removing the cost critics so that timing penalties are injected directly into the rew ard, (iii) freezing the dual multipliers to a constant penalty , and (i v) re versing the shield fallback order (Light → Feat → Full). Fig. 5 summarises the ev aluation-phase averages (ﬁ ve seeds, N = 8 ). Eliminating the shield notably reduces communica- tion/RIC budgets b ut yields the lowest rew ard because the dual controller alone cannot prev ent aggressiv e updates. Penalty- only learning con ver ges but exhibits higher resource variance than the dual formulation, whereas ﬁxed multipliers over - constrain the policy . Rev ersing the shield order prioritises lightweight updates and trades rew ard for reduced overhead, 8 16 Number of UEs 0 1 2 3 Commus. Overhead (ms/frame) Agent DQN PPO Random TCPPO 8 16 Number of UEs 0.0 0.5 1.0 1.5 RIC Processing (ms/frame) (a) 8 16 Number of UEs 0 2 4 Reward Agent DQN PPO Random TCPPO 8 16 Number of UEs 0.0 0.2 0.4 A verage Utility (b) Fig. 4: Inference-phase v ariability across 30 e valuation episodes (mean, SE, and 95th percentile). (a) Communication and RIC resource dispersion; (b) rew ard and utility . TC-PPO matches PPO’ s reward while stabilizing resource usage. TC-PPO No Shield No Cost Critics Fixed Penalty Shield Order 0.0 0.5 1.0 1.5 2.0 2.5 Metric V alue Reward Comm. Overhead (ms/frame) RIC Processing (ms/frame) Fig. 5: Ablation comparison for N = 8 (mean over ﬁve seeds): rew ard, air-interf ace overhead, and RIC processing. underscoring the complementary nature of average constraints and instantaneous feasibility . V . D I S C U S S I O N A N D C O N C L U S I O N TC-HITL-RL shows that semantic adaptation can satisfy both av erage and per-frame latency requirements without sacriﬁcing rew ard, yet sev eral assumptions merit future work. Reliable and lo w-latency uplink feedback is presumed; deplo y- ing in congested en vironments will require prioritized bear - ers, lightweight compression, and deadline-aware b uf fering. Our ev aluation focuses on a single cell, so multi-cell and cooperativ e-edge scenarios, with coupled RICs and shared fronthaul limits, remain open problems. Finally , we collect scalar human preferences; richer multi-dimensional feedback with conﬁdence scores could further sharpen the CMDP state. Addressing these limitations, along with hardware-in-the-loop validation on NR testbeds, forms our next research agenda. W ithin this scope we presented a TC-HITL-RL framework that casts semantic broadcasting as a CMDP and solves it with a primal–dual PPO agent plus an action shield. The resulting policy maintains PPO-level rew ard while tightening the dispersion of air -interface and RIC resources; ablations conﬁrm the complementary impact of dual updates, shielding, and primiti ve design. These ﬁndings suggest that principled CMDP control is a promising path toward deployable, latenc y- aware semantic communication. A C K N O W L E D G M E N T This work is supported by the 6G-GOALS project under the 6G SNS-JU Horizon program, n.101139232. R E F E R E N C E S [1] Q. Lan et al. , “What is Semantic Communication? A V ie w on Con veying Meaning in the Era of Machine Intelligence, ” Journal of Communica- tions and Information Networks , vol. 6, no. 4, pp. 336–371, 2021. [2] E. C. Strinati et al. , “Goal-Oriented and Semantic Communication in 6G AI-Native Networks: The 6G-GO ALS Approach, ” in Pr oc. of 2024 EuCNC/6G Summit , 2024, pp. 1–6. [3] X. Lin, P . Li, and A. Aijaz, “Rl-driv en semantic compression model selection and resource allocation in semantic communication systems, ” IEEE PIMRC 2025, arXiv preprint arXiv:2506.18660 , 2025. [4] E. Bourtsoulatze, D. B. Kurka, and D. G ¨ und ¨ uz, “Deep joint source- channel coding for wireless image transmission, ” IEEE T rans. Cogn. Commun. Netw . , vol. 5, no. 3, pp. 567–579, 2019. [5] D. G ¨ und ¨ uz et al. , “Joint source–channel coding: Fundamentals and recent progress in practical designs, ” Proceedings of the IEEE , 2024. [6] N. Lambert, “Reinforcement learning from human feedback, ” arXiv pr eprint arXiv:2504.12501 , 2025. [7] C. O. Retzlaff et al. , “Human-in-the-loop reinforcement learning: A survey and position on requirements, challenges, and opportunities, ” Journal of Artiﬁcial Intelligence Researc h , vol. 79, pp. 359–415, 2024. [8] P . Li, X. Lin, and A. Aijaz, “Building the Self-Improvement Loop: Error Detection and Correction in Goal-Oriented Semantic Communications, ” in Proc. of IEEE CSCN , 2024, pp. 294–300. [9] P . Li and A. Aijaz, “T ask-oriented connectivity for networked robotics with generative ai and semantic communications, ” in IEEE INFOCOM 2025 - IEEE Confer ence on Computer Communications W orkshops (INFOCOM WKSHPS) , 2025, pp. 1–6. [10] E. Altman, Constrained Markov decision processes . Routledge, 2021. [11] J. Achiam et al. , “Constrained policy optimization, ” in International confer ence on machine learning . PMLR, 2017, pp. 22–31. [12] J. Schulman et al. , “Proximal policy optimization algorithms, ” arXiv pr eprint arXiv:1707.06347 , 2017. [13] P . Li et al. , “Rlops: Development life-cycle of reinforcement learning aided open ran, ” IEEE Access , vol. 10, pp. 113 808–113 826, 2022. [14] H. Li et al. , “T oward Practical Operation of Deep Reinforcement Learning Agents in Real-W orld Network Management at Open RAN Edges, ” IEEE Communications Magazine , 2025. [15] P . Li and A. Aijaz, “Open RAN meets semantic communications: A synergistic match for open, intelligent, and knowledge-driv en 6G, ” in Pr oc. of IEEE CSCN . IEEE, 2023, pp. 87–93.

Latency-aware Human-in-the-Loop Reinforcement Learning for Semantic Communications

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment