SAC-NeRF: Adaptive Ray Sampling for Neural Radiance Fields via Soft Actor-Critic Reinforcement Learning

SA C-NeRF: Adaptiv e Ray Sampling f or Neural Radiance Fields via Soft Actor -Critic Reinf orcement Lear ning Chenyu Ge 1 1 Uni versity of Southern California, Los Angeles, CA 90089, USA, gechenyu@usc.edu Abstract Neural Radiance Fields (NeRF) hav e achie ved photorealistic nov el vie w synthesis but suf fer from computational inefﬁcienc y due to dense ray sampling during v olume rendering. W e propose SA C-NeRF , a reinforcement learning framew ork that learns adaptiv e sampling policies using Soft Actor-Critic (SA C). Our method formulates sampling as a Markov Decision Process where an RL agent learns to allocate samples based on scene characteristics. W e introduce three technical components: (1) a Gaussian mix- ture distribution color model providing uncertainty estimates, (2) a multi-component re ward function balancing quality , efﬁcienc y , and consistency , and (3) a two-stage training strategy addressing en viron- ment non-stationarity . Experiments on Synthetic-NeRF and LLFF datasets show that SA C-NeRF re- duces sampling points by 35-48% while maintaining rendering quality within 0.3-0.8 dB PSNR of dense sampling baselines. While the learned policy is scene-speciﬁc and the RL framew ork adds complexity compared to simpler heuristics, our work demonstrates that data-driv en sampling strate gies can disco ver effecti ve patterns that would be dif ﬁcult to hand-design. K eywords: Neural Radiance Fields, Reinforcement Learning, Soft Actor-Critic, Adapti ve Sampling 1 Intr oduction Neural Radiance Fields (NeRF) [1] hav e re volutionized the ﬁeld of nov el view synthesis by representing three-dimensional scenes as continuous 5D functions that map spatial coordinates and vie wing directions to volumetric density and view-dependent color . Through differentiable volumetric rendering and coordinate- based neural networks with positional encoding, NeRF achiev es unprecedented photorealistic quality that surpasses traditional geometry-based and image-based rendering methods. This breakthrough has enabled numerous downstream applications including robotics navigation, autonomous driving scene understanding, virtual and augmented reality content creation, and digital preserv ation of cultural heritage. Ho wev er , the computational ef ﬁciency of NeRF remains a critical challenge that se verely limits its prac- tical deployment. The volumetric rendering integral must be approximated through numerical quadrature, requiring the neural network to be ev aluated at 192-384 sample points along each camera ray . For a single 800 × 800 image with 200 samples per ray , the original NeRF MLP (8 layers × 256 hidden units) requires on the order of 10 11 multiply-accumulate operations, taking se veral seconds e ven on modern GPUs. This com- putational bottleneck makes real-time rendering infeasible and pre vents NeRF from being used in interacti ve applications or deployed on resource-constrained edge de vices. The fundamental inef ﬁciency stems from NeRF’ s sampling strategy . The original NeRF employs a coarse-to-ﬁne hierarchical sampling approach: ﬁrst uniformly sampling along the ray , then reﬁning sample positions based on coarse density predictions. While this two-stage strategy concentrates more samples in 1 high-density regions, it still relies on hand-crafted heuristics rather than learning scene-speciﬁc character - istics. Follo wing [4], we deﬁne a sample’ s contrib ution as its rendering weight w i = T i α i . Analysis on Synthetic-NeRF sho ws that a majority of samples hav e w i < 0 . 01 , contributing negligibly to the ﬁnal ren- dered color—either f alling in empty space with near-zero density or lying be yond occluding surf aces where accumulated transmittance has decayed. This observation suggests potential for learned adapti ve sampling policies that concentrate computational resources where they pro vide the most v alue. Recent acceleration ef forts hav e pursued v arious complementary directions. Explicit data structure approaches [11, 12] replace MLPs with multi-resolution hash grids or vox el grids, achie ving substantial speedups through efﬁcient spatial queries. 3D Gaussian splatting methods [8 – 10] abandon volumetric ren- dering entirely in fav or of explicit point-based representations with fast rasterization. While these methods achie ve impressiv e rendering speeds, they sacriﬁce NeRF’ s elegant continuous representation and often re- quire substantial memory . An orthogonal direction is optimizing the sampling process itself—reducing the number of network e v aluations required per ray while maintaining rendering quality . W e propose SA C-NeRF , a reinforcement learning framew ork that learns adaptiv e sampling policies for efﬁcient neural radiance ﬁeld rendering. Unlike prior heuristic approaches that rely on hand-designed rules [5, 13, 14], our method learns sampling strategies end-to-end through direct interaction with the render- ing process. W e formulate adapti ve sampling as a Markov Decision Process (MDP) where an RL agent se- quentially decides where to place samples along each ray based on observed scene properties and rendering objecti ves. W e employ Soft Actor-Critic (SA C) [16], a state-of-the-art off-policy RL algorithm, incorporat- ing recent algorithmic improvements [18] that enhance stability and sample efﬁcienc y in continuous action spaces. Our approach introduces se veral key technical innovations to make RL-based adaptive sampling prac- tical and ef fective. First, we extend NeRF’ s color prediction with a Gaussian mixture distribution model that naturally provides uncertainty estimates—regions with high predicted variance indicate where addi- tional samples could reduce rendering uncertainty . Second, we design a multi-component reward function that carefully balances three objecti ves: maximizing rendering quality (PSNR), minimizing unnecessary samples, and maintaining spatial smoothness in sample placement. Third, we de velop an enhanced state representation that combines local features from current samples with global geometric priors, enabling the policy to make informed decisions. Finally , we employ a two-stage training strategy that ﬁrst pre-trains the NeRF model to stabilize the en vironment, then trains the RL policy with a ﬁxed NeRF backbone, addressing the challenging non-stationarity that arises from jointly optimizing both components. Our ke y contributions are: • SA C-NeRF Framework : An RL-based adapti ve sampling system for neural radiance ﬁelds, demon- strating learning capability and potential ef ﬁciency gains o ver heuristic baselines. • Mixtur e Distribution Color Model : A principled extension of NeRF that outputs Gaussian mix- ture distributions, pro viding uncertainty quantiﬁcation to guide intelligent sampling decisions (Sec- tion 3.3). • Multi-Component Reward Design : A carefully designed re ward function that balances rendering quality , sampling efﬁciency , and spatial consistency , enabling stable policy learning (Section 3.5). • T wo-Stage T raining Strategy : A training methodology that addresses en vironment non-stationarity by decoupling NeRF pre-training and policy optimization (Section 3.6). • Compr ehensiv e Ev aluation : Experiments on Synthetic-NeRF and LLFF datasets with ablation stud- ies v alidating component contributions (Section 4). 2 Our work demonstrates that reinforcement learning provides a viable approach to adaptive sampling in neural rendering, opening research directions for applying learned optimization strategies to other compu- tationally intensi ve components of neural scene representations. 2 Related W ork 2.1 Neural Radiance Fields and Acceleration The original NeRF [1] represents scenes using multi-layer perceptrons (MLPs) with sinusoidal positional encoding to map 5D coordinates (3D position plus 2D vie wing direction) to volume density and vie w- dependent color . This coordinate-based representation enables continuous scene modeling and achiev es photorealistic nov el view synthesis through differentiable v olumetric rendering. Howe ver , rendering a single image requires millions of network e v aluations, taking sev eral seconds even on modern GPUs. Recent research has pursued multiple complementary acceleration strategies. Explicit data structures replace or augment MLPs with spatially-organized features for efﬁcient queries. Instant-NGP and its vari- ants [11] achieve real-time rendering through multi-le vel hash feature grids combined with lightweight MLPs. NGP-R T [11] further optimizes this with fused hash features and attention mechanisms, achieving 108 fps. Compression studies [12] e xplore the limits of quantizing and pruning these hash-based representa- tions while preserving quality . FrugalNeRF [7] enables fast con ver gence with few-shot inputs by le veraging geometric priors and regularization. Explicit point-based repr esentations abandon implicit volumetric rendering entirely . 3D Gaussian Splatting methods represent scenes as collections of oriented 3D Gaussian primitives that can be efﬁciently rasterized. DashGaussian [8] achieves full scene optimization in approximately 200 seconds through care- fully designed initialization and optimization strate gies. FlashGS [9] scales Gaussian splatting to large-scale high-resolution rendering with hierarchical culling and le vel-of-detail management. 3D-HGS [10] intro- duces half-Gaussian k ernels to better model geometric discontinuities and sharp edges. While these e xplicit methods achie ve impressi ve speeds, they require substantial memory for millions of primitiv es and sacriﬁce the compact continuous representation that makes NeRF appealing. Our work takes an orthogonal approach by optimizing the sampling process itself—maintaining NeRF’ s continuous representation while reducing network e v aluations through learned adaptiv e policies. 2.2 Sampling Optimization f or Neural Rendering Sampling strate gies directly determine NeRF’ s computational cost, making sampling optimization a critical research direction. The original hierarchical sampling in NeRF uses a coarse network to predict importance weights, then concentrates ﬁne samples in high-density regions. While effecti ve, this still requires ev aluating two separate networks at numerous points. DONeRF [4] proposes a depth oracle network that predicts ray sample locations with a single net- work ev aluation, achieving up to 48 × inference cost reduction. AdaNeRF [5] introduces a dual-network architecture that learns to reduce sample counts through joint training of sampling and shading networks. NerfAcc [6] provides a uniﬁed frame work in vestigating multiple sampling approaches under the concept of transmittance estimators, achie ving 1.5-20 × training speedups. More recent work has proposed additional sampling heuristics. The probability-guided sampler [13] models density probability distributions in 3D projection space, enabling more tar geted ray sampling by av oiding empty regions. MBS-NeRF [14] addresses motion blur in neural rendering through depth-constrained adapti ve sampling. MFNeRF [15] introduces memory-efﬁcient mixed-feature hash tables to reduce storage while maintaining quality . 3 All these methods rely on hand-crafted heuristics based on geometric or statistical assumptions. In con- trast, our approach learns sampling policies end-to-end through reinforcement learning, allowing the system to discover strategies adapted to speciﬁc scene characteristics and rendering objectiv es without manual de- sign. 2.3 Reinf orcement Lear ning f or Continuous Control Soft Actor-Critic (SAC) [16] is a state-of-the-art off-polic y RL algorithm for continuous control that max- imizes both expected return and policy entropy . The entropy regularization encourages exploration and improv es robustness. Recent advances ha ve addressed v arious limitations of the original SAC algorithm. Corrected SAC [18] identiﬁes and ﬁxes action distribution distortion issues in the original formulation, improving sample efﬁcienc y and ﬁnal performance. Bayesian SAC [19] incorporates directed acyclic strat- egy graphs for better credit assignment in complex tasks. Bidirectional SA C [20] lev erages both forward and re verse KL di vergence for more stable policy updates. The Broad Critic frame work [17] combines broad learning systems with deep networks to impro ve v alue function approximation. W e adapt these algorithmic advances to the neural rendering domain, particularly leveraging Corrected SA C’ s improvements for stable policy learning. Ho wev er , adaptiv e sampling for NeRF presents unique challenges: the environment (NeRF network) is non-stationary during joint training, rew ards are computed from high-dimensional rendering outputs, and the state space must encode both local sample information and global geometric context. 2.4 From Discr ete Featur e Selection to Continuous Budget Allocation A recurring theme in machine learning is budget-awar e selection : identifying which parts of the input deserve computation and which can be ignored with minimal loss. Feature selection provides a classic instantiation of this idea, typically combining ranking criteria (e.g., Fisher score) with iterativ e elimination schemes (e.g., recursiv e feature elimination) to remove redundant dimensions in high-dimensional models while preserving predicti ve performance [21, 22]. W e draw a direct analogy to NeRF rendering: ray samples can be viewed as “features” used to estimate the rendered color . Howe ver , unlike standard feature selection over discrete variables, sampling decisions in NeRF take place ov er a continuous one-dimensional domain (ray depth). Reinforcement learning pro- vides a natural mechanism for continuous, sequential budget allocation—deciding where additional netw ork e valuations are most v aluable and where they are wasteful—without relying on hand-designed heuristics. 3 Method 3.1 Preliminaries: Neural Radiance Fields NeRF represents a 3D scene as a continuous volumetric function F Θ : ( x , d ) → ( c , σ ) implemented by a multi-layer perceptron (MLP) with parameters Θ . This function maps a 3D spatial position x ∈ R 3 and 2D vie wing direction d ∈ S 2 to an RGB color c ∈ R 3 and v olume density σ ∈ R + . The density σ ( x ) represents the differential probability of a ray terminating at an inﬁnitesimal particle at position x , while the color c ( x , d ) captures view-dependent appearance ef fects like specularities. T o render the color observed along a camera ray r ( t ) = o + t d (where o is the camera origin and t is the distance parameter), NeRF uses classical volume rendering. The expected color is computed by integrating the color and density along the ray: C ( r ) = Z t f t n T ( t ) σ ( r ( t )) c ( r ( t ) , d ) dt, (1) 4 where t n and t f are the near and far bounds of the scene, and T ( t ) is the accumulated transmittance repre- senting the probability that the ray trav els from t n to t without hitting any particles: T ( t ) = exp  − Z t t n σ ( r ( s )) ds  . (2) In practice, this continuous integral must be approximated numerically . Standard NeRF uses stratiﬁed sampling to select N discrete points { t i } N i =1 along the ray , then applies quadrature: ˆ C ( r ) = N X i =1 T i α i c i , T i = i − 1 Y j =1 (1 − α j ) , α i = 1 − exp( − σ i δ i ) , (3) where δ i = t i +1 − t i is the distance between adjacent samples, α i is the probability of the ray terminating at sample i , and c i = c ( r ( t i ) , d ) and σ i = σ ( r ( t i )) are the queried color and density . The rendering weight w i = T i α i represents the contribution of sample i to the ﬁnal color . High- quality rendering requires many samples ( N = 192 for coarse network, N = 192 for ﬁne network in original NeRF), but analysis sho ws most samples ha ve ne gligible weights ( w i ≈ 0 ), moti vating our adapti ve sampling approach. 3.2 MDP F ormulation f or Adaptive Sampling W e formulate the adaptiv e sampling problem as a Markov Decision Process (MDP) ( S , A , P , R , γ ) where an RL agent iterativ ely reﬁnes sample positions along each ray to maximize rendering quality while mini- mizing computation. State Space S : The state s t at iteration t must encode all information rele vant to deciding where to place samples. W e design a rich state representation comprising: • P er -sample featur es : For each of the N current sample positions t i , we include: (1) the normalized position t i / ( t f − t n ) along the ray , (2) the predicted RGB color c i ∈ R 3 , (3) the volume density σ i ∈ R + , (4) the accumulated transmittance T i indicating how much light reaches this sample, and (5) the rendering weight w i = T i α i sho wing this sample’ s contribution to the ﬁnal color . • Ray geometry : Global features including the ray origin o , direction d , near/f ar bounds [ t n , t f ] , and the pixel coordinates ( u, v ) in the image. • Uncertainty estimates : The predicted variance from our mixture distribution model (Section 3.3), indicating regions where additional samples could reduce uncertainty . • Historical context : T o enable temporal reasoning, we maintain an aggregated representation of sample placements from previous iterations using a lightweight attention mechanism that pools features from earlier states. The full state has dimension |S | = N × 9 + 11 + K × 3 , where K = 3 is the number of mixture components. Action Space A : W e use continuous actions a ∈ [ − 1 , 1] N where each component a i speciﬁes a relative adjustment to sample position t i . Actions are mapped to valid positions via: t new i = clip ( t i + a i · ∆ max , t n , t f ) , (4) where ∆ max = 0 . 1 · ( t f − t n ) limits the maximum adjustment per step. T o maintain v alid sample ordering, we apply post-hoc sorting: { t new i } ← sort ( { t new i } ) , and enforce a minimum spacing δ min = 0 . 001 · ( t f − t n ) 5 between adjacent samples. The sorting operation is not dif ferentiable, b ut since SA C uses stochastic policies with entropy re gularization, the gradient through clip boundaries has minimal impact on training stability . T ransition Dynamics P : Given the NeRF parameters Θ (ﬁxed during policy training in Stage 2), the transition is deterministic: new sample positions deterministically yield new network outputs ( c new i , σ new i ) through forward passes. Ho wev er, from the policy’ s perspecti ve during Stage 1 joint training, the en viron- ment appears non-stationary as Θ e volv es, moti vating our tw o-stage approach. Reward Function R : Designed to balance quality and efﬁcienc y (detailed in Section 3.5). Discount Factor γ : Set to 0.99 to balance immediate rendering improvements against long-term sample ef ﬁciency . This choice reﬂects our preference for policies that make steady progress rather than optimizing only for immediate gains. 3.3 Mixture Distrib ution Color Model Standard NeRF outputs deterministic point estimates for colors, providing no quantiﬁcation of prediction uncertainty . Howe ver , uncertainty information is crucial for guiding adaptiv e sampling: re gions with high uncertainty beneﬁt most from additional samples. T o address this, we extend NeRF to output probabilistic color predictions via a Gaussian Mixture Model (GMM). Speciﬁcally , instead of outputting a single RGB color c ∈ R 3 , our modiﬁed NeRF network predicts a distribution o ver colors: p ( c | x , d ) = K X k =1 π k ( x , d ) N ( c ; µ k ( x , d ) , diag ( σ 2 k ( x , d ))) , (5) where K is the number of mixture components, π k are mixture weights (satisfying P k π k = 1 , π k ≥ 0 ), µ k ∈ R 3 are mean colors, and σ 2 k ∈ R 3 are per-channel v ariances. Implementing this requires the network to output K × 7 additional parameters per sample: 3 for µ k , 3 for σ 2 k , and 1 for the (pre-softmax) weight. W e use softplus acti vation for v ariances to ensure positi vity and softmax for mixture weights to ensure valid probabilities. The network architecture adds a small auxiliary head ( K × 7 output neurons) alongside the standard color and density outputs, adding minimal computational ov erhead. The expected color and total uncertainty are: E [ c ] = K X k =1 π k µ k , V ar [ c ] = K X k =1 π k σ 2 k | {z } aleatoric + K X k =1 π k ( µ k − E [ c ]) 2 | {z } epistemic . (6) The variance decomposes into aleatoric uncertainty (inherent randomness within each component) and epis- temic uncertainty (disagreement between components). High uncertainty re gions—such as texture edges, specular highlights, or uncertain geometry—indicate where additional samples could most improv e render- ing accuracy . During rendering, we use the e xpected color E [ c ] for the ﬁnal image, while the v ariance V ar [ c ] enters the RL state to guide sampling. W e found K = 3 components pro vides a good balance between expressi veness and computational cost. T raining uses a negati ve log-likelihood loss that encourages the mixture to ﬁt the observed colors while re gularizing against ov erly conﬁdent predictions. 3.4 SA C-based P olicy Learning W e emplo y Soft Actor-Critic (SAC) [16], a state-of-the-art of f-policy RL algorithm designed for continuous control, to learn our adapti ve sampling polic y . SA C is particularly well-suited for our problem due to its sample ef ﬁciency , stability , and principled exploration through entropy re gularization. 6 The SA C objecti ve maximizes not only expected cumulati ve reward but also policy entropy , encouraging exploration and pre venting premature con vergence to suboptimal deterministic policies: J ( π ) = T X t =0 E ( s t ,a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ·| s t ))] , (7) where ρ π is the state-action distribution induced by polic y π , H ( π ( ·| s t )) = − E a ∼ π [log π ( a | s t )] is the policy entropy , and α > 0 is the temperature parameter controlling the e xploration-exploitation tradeof f. The policy π ϕ with parameters ϕ outputs Gaussian distrib ution parameters: a ∼ π ϕ ( ·| s ) = N ( µ ϕ ( s ) , diag ( σ 2 ϕ ( s ))) , (8) where the mean µ ϕ ( s ) and standard deviation σ ϕ ( s ) are computed by a neural network. Actions are sampled using the reparameterization trick: a = µ ϕ ( s ) + σ ϕ ( s ) ⊙ ϵ with ϵ ∼ N (0 , I ) , enabling backpropagation through sampling. T o improve stability , SAC maintains twin Q-networks Q θ 1 and Q θ 2 and uses the minimum for bootstrap- ping (reducing ov erestimation bias): y = r + γ  min i =1 , 2 Q θ ′ i ( s ′ , a ′ ) − α log π ϕ ( a ′ | s ′ )  , (9) where a ′ ∼ π ϕ ( ·| s ′ ) is sampled from the current policy and θ ′ i are target network parameters updated via exponential mo ving av erage: θ ′ i ← τ θ i + (1 − τ ) θ ′ i with τ = 0 . 005 . The Q-networks are trained to minimize the Bellman error: L Q ( θ i ) = E ( s,a,r,s ′ ) ∼D  ( Q θ i ( s, a ) − y ) 2  , (10) where D is the experience replay b uf fer storing transitions from interaction with the en vironment. The policy is updated to maximize the e xpected Q-v alue while maintaining high entropy: L π ( ϕ ) = E s ∼D ,a ∼ π ϕ  α log π ϕ ( a | s ) − min i =1 , 2 Q θ i ( s, a )  . (11) W e use automatic temperature adjustment [16] to maintain target entropy , adapting α during training to balance exploration and e xploitation. Network Architecture : The policy network π ϕ is a 3-layer MLP with hidden dimensions [256, 256], ReLU acti v ations, and layer normalization. The output layer splits into two heads: one for means µ ϕ ( s ) (tanh activ ation) and one for log standard de viations (unbounded). The twin Q-networks Q θ 1 , Q θ 2 hav e identical architecture, taking the concatenation [ s ; a ] as input and outputting scalar Q-values. All networks use Xavier initialization and are trained with Adam optimizer (learning rate 3 × 10 − 4 ). 3.5 Multi-Component Reward Design Designing an effecti ve reward function is crucial for RL-based adapti ve sampling. The rew ard must bal- ance multiple competing objectives: maximizing rendering quality , minimizing unnecessary computation, and maintaining reasonable sample distributions. W e propose a multi-component re ward combining three carefully designed terms: R = λ q R quality + λ e R efﬁcienc y + λ c R consistency . (12) Quality Reward R quality : The primary objecti ve is rendering ﬁdelity . W e re ward impro vements in Peak Signal-to-Noise Ratio (PSNR) relati ve to the ground truth: R quality = PSNR curr − PSNR prev , (13) 7 where PSNR curr is computed from the current sample positions and PSNR prev from the previous iteration. This incremental formulation provides dense feedback, rewarding actions that improv e rendering ev en if absolute quality remains imperfect. During ev aluation, we compute PSNR against held-out test views. Efﬁciency Reward R efﬁciency : T o encourage computational ef ﬁciency , we penalize samples that con- tribute ne gligibly to the ﬁnal rendering: R efﬁcienc y = − λ eff N X i =1 I [ w i < τ w ] , (14) where w i = T i α i is the rendering weight of sample i (its contrib ution to the ﬁnal pixel color), and τ w = 0 . 01 is a threshold belo w which samples are considered wasteful. This term directly incenti vizes the policy to eliminate low-contrib ution samples, either by moving them to high-density regions or by implicitly reducing sample count through concentrated placement. W e set λ eff = 0 . 1 to balance quality and ef ﬁciency . Consistency Reward R consistency : W ithout regularization, the policy might place samples erratically , creating large gaps that miss important geometric features. W e encourage spatially smooth sample distribu- tions via: R consistency = − N − 1 X i =1 ( δ i +1 − δ i ) 2 , (15) where δ i = t i +1 − t i is the inter -sample distance. This term penalizes high v ariance in sample spacing, promoting relativ ely uniform local density while still allowing global variation. The coefﬁcient λ c = 0 . 01 provides light re gularization without ov erly constraining the policy . The ﬁnal weighting λ q = 1 . 0 , λ e = 0 . 1 , λ c = 0 . 01 was determined through grid search on a validation scene, prioritizing quality while incorporating efﬁciency and smoothness objecti ves. These weights proved robust across dif ferent scenes in our experiments. 3.6 T wo-Stage T raining T o address en vironment non-stationarity , we adopt a two-stage training procedure: Stage 1 (NeRF Pre-training) : T rain NeRF with mixture model for 100K iterations using standard photometric loss plus distribution re gularization: L NeRF = ∥ ˆ C ( r ) − C gt ∥ 2 + λ reg KL ( p ∥ p prior ) . (16) Stage 2 (P olicy Optimization) : Freeze the NeRF backbone and train only the SA C policy for 200K iterations with e xperience replay (b uffer size 10 6 , batch size 256). This decoupling ensures the RL agent learns in a stationary en vironment. T raining-Inference Gap : During training, the quality re ward R quality is computed using ground truth images from the training set only . The policy learns to identify high-contribution regions based on den- sity patterns and geometric features. At inference, the learned policy generalizes to novel views without requiring ground truth—it has learned a sampling heuristic from the supervised training signal. 4 Experiments 4.1 Experimental Setup Datasets : • Synthetic-NeRF [1]: 8 synthetic scenes, 800 × 800 resolution, 100 train / 200 test vie ws 8 T able 1: Sampling ef ﬁciency comparison on Synthetic-NeRF (av eraged over 8 scenes). Speedup is theoret- ical, based on sample count reduction. Method Samples/Ray Reduction Effective Rate Theor etical Speedup NeRF [1] 192 - 18.3% 1.0 × DONeRF [4] 48 75% 34.7% 2.1 × NerfAcc [6] 32 83% 41.2% 2.8 × AdaNeRF [5] 64 67% 38.5% 1.9 × SA C-NeRF (Ours) 100-125 35-48% 52.3% 1.5-1.9 × T able 2: Rendering quality comparison. ↑ : higher is better; ↓ : lo wer is better . Synthetic-NeRF LLFF Method PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ NeRF 31.01 0.947 0.050 26.50 0.811 0.250 DONeRF 29.85 0.931 0.068 25.72 0.789 0.285 NerfAcc 30.21 0.938 0.061 26.01 0.798 0.268 AdaNeRF 30.45 0.940 0.058 26.18 0.802 0.260 SA C-NeRF 30.68 0.943 0.054 26.22 0.805 0.258 • LLFF [2]: 8 real forward-f acing scenes Baselines : NeRF [1], DONeRF [4], NerfAcc [6], AdaNeRF [5]. For fair comparison, all methods use the same NeRF backbone architecture (8-layer MLP , 256 hidden units). NerfAcc uses 128 3 occupancy grid with update frequency 16. AdaNeRF and DONeRF use their ofﬁcial implementations with default hyperparameters. Metrics : PSNR, SSIM, LPIPS for rendering quality; samples/ray for sampling efﬁcienc y . Speedup is measured as theor etical speedup based on sample count reduction, as actual wall-clock time depends on implementation details. The policy network forward pass adds approximately 0.8ms o verhead per 1024 rays on R TX 3090. Implementation : PyT orch on NVIDIA R TX 3090. NeRF: Adam optimizer , lr= 5 × 10 − 4 , 100K itera- tions. SA C: lr= 3 × 10 − 4 , 200K iterations, batch 256, replay buf fer size 10 6 . T raining Protocol : Each scene requires independent training (scene-speciﬁc policy). T otal training time: ∼ 8 hours per scene (5h NeRF pre-training + 3h policy optimization). 4.2 Sampling Efﬁciency Results T able 1 shows sampling efﬁcienc y results. SA C-NeRF achie ves 35-48% sample reduction with 52.3% ef fectiv e sampling rate (vs. 18.3% baseline). While more aggressive methods like NerfAcc achieve higher reduction through explicit occupanc y grids, our learned policy generalizes without such e xplicit structures. 4.3 Rendering Quality T able 2 shows rendering quality results. SAC-NeRF achiev es 30.68 dB on Synthetic-NeRF (0.33 dB be- lo w baseline) and 26.22 dB on LLFF (0.28 dB below), demonstrating competitive performance with other sampling optimization methods while using fe wer samples. 9 T able 3: Ablation study on Synthetic-NeRF (Lego scene). Conﬁguration PSNR (dB) Samples/Ray Con vergence Full SA C-NeRF 30.82 108 Stable w/o Mixture Model 30.65 (-0.17) 118 Stable w/o R efﬁcienc y 30.79 (-0.03) 152 Stable w/o R consistency 30.58 (-0.24) 95 Unstable w/o T wo-stage T raining 30.12 (-0.70) 125 Unstable PPO instead of SA C 30.51 (-0.31) 115 Moderate TD3 instead of SA C 30.72 (-0.10) 112 Stable Figure 1: Qualitative comparison showing SA C-NeRF maintains visual quality while reducing samples. From left to right: Ground T ruth, NeRF (192 samples), SA C-NeRF (108 samples). The proposed method achie ves comparable visual quality with 44% fewer samples, with only slight edge softening visible in the SA C-NeRF result. 4.4 Ablation Studies T able 3 presents ablation results validating our design choices: • Mixture Model : Provides +0.17 dB and 10 fe wer samples through uncertainty guidance • Efﬁciency Reward : Critical for sample reduction (152 vs. 108 without) • Consistency Reward : Prevents sampling g aps, improves stability • T wo-stage T raining : Essential for con ver gence (+0.70 dB) • SA C vs. alter natives : SA C provides best quality-stability trade-of f Reward W eight Sensitivity : W e vary λ e ∈ { 0 . 05 , 0 . 1 , 0 . 2 , 0 . 5 } while ﬁxing λ q = 1 . 0 , λ c = 0 . 01 . Results show the method is robust for λ e ∈ [0 . 05 , 0 . 2] , with PSNR varying by only 0.14 dB. Setting λ e > 0 . 3 causes excessi ve sample reduction and quality degradation. 4.5 Qualitative Results Figures 1 and 2 illustrate qualitativ e results. SA C-NeRF learns to concentrate samples near surfaces while reducing density in empty regions, achie ving efﬁciency without sacriﬁcing visual quality . 10 Figure 2: Learned sampling distributions sho wing adaptiv e concentration near scene geometry . T op: uni- form sampling baseline. Bottom: learned adapti ve sampling by SA C-NeRF , which concentrates samples near surfaces (high density re gions) and reduces samples in empty regions. Figure 3: T raining curve showing stable con ver gence of the RL policy . The reward signal improves ov er 200K iterations, demonstrating successful policy learning. The conv erged policy achiev es stable rendering quality while gradually reducing samples per ray . 4.6 T raining Analysis Figure 3 shows training dynamics. The re ward signal con ver ges ov er 200K iterations, demonstrating suc- cessful policy learning. PSNR stabilizes while samples/ray gradually decreases. 5 Discussion Methodological P ositioning : SA C-NeRF uses reinforcement learning to learn a sampling heuristic from supervised signals (ground truth images). While framed as RL, the learned policy essentially encodes a data-dri ven sampling strate gy . The advantage o ver hand-crafted heuristics is automatic adaptation to scene- speciﬁc characteristics; the limitation is the need for per-scene training. Comparison to Heuristic Methods : Unlike probability-guided samplers [13] or depth oracle approaches [4] that use auxiliary networks with ﬁxed architectures, SA C-NeRF learns sampling beha vior end-to-end. How- e ver , methods like NerfAcc [6] with occupancy grids achiev e higher sample reduction with simpler imple- mentations. Generalization : Current policies are scene-speciﬁc, requiring 3 hours of policy training per scene 11 after NeRF pre-training. Cross-scene transfer shows degraded performance (approximately 2 dB PSNR drop), suggesting the policy o verﬁts to scene-speciﬁc density distributions. Future work could explore meta-learning for rapid adaptation. Computational T rade-offs : The policy network adds ∼ 0.8ms o verhead per 1024 rays. For a full 800 × 800 image, this amounts to ∼ 500ms total policy overhead, partially of fsetting the gains from reduced NeRF e valuations. Limitations : (1) Scene-speciﬁc training limits practical deployment; (2) The RL framework adds com- plexity compared to simpler heuristics; (3) Sample reduction is more modest than explicit acceleration structures; (4) The method requires ground truth for training re wards. 6 Conclusion W e presented SA C-NeRF , a reinforcement learning frame work that learns adapti ve sampling policies for ef ﬁcient Neural Radiance Field rendering. By formulating adapti ve sampling as a Mark ov Decision Process and employing Soft Actor-Critic with carefully designed components, our method achiev es computational savings while preserving rendering quality . Our key technical contributions include: (1) a Gaussian mixture distribution color model that provides uncertainty quantiﬁcation to guide sampling decisions, (2) a multi-component re ward function balancing quality , efﬁcienc y , and spatial consistency , (3) an enhanced state representation encoding both local sample features and global geometric conte xt, and (4) a two-stage training strategy that addresses en vironment non-stationarity by decoupling NeRF pre-training from policy optimization. Experiments on Synthetic-NeRF and LLFF datasets demonstrate that SA C-NeRF reduces sampling points by 35-48% while maintaining rendering quality within 0.3-0.8 dB PSNR of dense sampling base- lines. Our work demonstrates the viability of RL-based adaptiv e sampling for neural rendering, showing that learned policies can discov er ef ﬁcient strategies that would be dif ﬁcult to encode through manual heuristics. This opens research directions for applying learning-based optimization to other computationally intensi ve aspects of neural scene representations. Future work could e xplore: (1) cr oss-scene generalization through meta-learning, (2) inte gration with modern NeRF variants such as hash-based representations or 3D Gaussian splatting, (3) e xtension to dy- namic scenes , and (4) application to other r endering modalities . Refer ences [1] B. Mildenhall, P . P . Srini vasan, M. T ancik, J. T . Barron, R. Ramamoorthi, and R. Ng, “NeRF: Repre- senting scenes as neural radiance ﬁelds for view synthesis, ” in Pr oc. Eur opean Conference on Com- puter V ision (ECCV) , 2020, pp. 405–421. [2] B. Mildenhall, P . P . Sriniv asan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar , “Local light ﬁeld fusion: Practical view synthesis with prescripti ve sampling guidelines, ” ACM T rans. Graph. , v ol. 38, no. 4, pp. 1–14, 2019. [3] A. Dai, A. X. Chang, M. Savv a, M. Halber , T . Funkhouser , and M. Nießner , “ScanNet: Richly- annotated 3D reconstructions of indoor scenes, ” in Pr oc. IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2017, pp. 5828–5839. 12 [4] T . Neff, P . Stadlbauer , M. Parger , A. Kurz, J. H. Mueller , C. R. A. Chaitanya, A. Kaplanyan, and M. Steinberger , “DONeRF: T o wards real-time rendering of compact neural radiance ﬁelds using depth oracle networks, ” Computer Graphics F orum , vol. 40, no. 4, pp. 45–59, 2021. [5] A. Kurz, T . Nef f, Z. Lv , M. Zollhöfer , and M. Steinberger , “ AdaNeRF: Adaptive sampling for real-time rendering of neural radiance ﬁelds, ” in Pr oc. Eur opean Confer ence on Computer V ision (ECCV) , 2022, pp. 254–270. [6] R. Li, H. Gao, M. T ancik, and A. Kanazawa, “NerfAcc: Efﬁcient sampling accelerates NeRFs, ” in Pr oc. IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2023, pp. 18537–18546. [7] C.-Y . Lin, C.-H. W u, C.-H. Y eh, S.-H. Y en, C. Sun, and Y .-L. Liu, “FrugalNeRF: F ast con ver gence for extreme fe w-shot novel vie w synthesis without learned priors, ” in Pr oc. IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2025, pp. 11227–11238. [8] Y . Chen, J. Jiang, K. Jiang, X. T ang, Z. Li, X. Liu, and Y . Nie, “DashGaussian: Optimizing 3D Gaussian splatting in 200 seconds, ” in Pr oc. IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2025, pp. 11146–11155. [9] G. Feng, S. Chen, R. Fu, Z. Liao, Y . W ang, T . Liu, Z. Pei, H. Li, X. Zhang, and B. Dai, “FlashGS: Ef ﬁcient 3D Gaussian splatting for large-scale and high-resolution rendering, ” in Pr oc. IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2025, pp. 26652–26662. [10] H. Li, J. Liu, M. Sznaier, and O. Camps, “3D-HGS: 3D half-Gaussian splatting, ” in Pr oc. IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2025. [11] Y . Hu, X. Guo, Y . Xiao, J. Huang, and Y .-J. Liu, “NGP-R T : Fusing multi-lev el hash features with lightweight attention for real-time nov el view synthesis, ” in Pr oc. Eur opean Conference on Computer V ision (ECCV) , 2024, pp. 153–170. [12] Y . Chen, Q. W u, W . Zheng, and J. Cai, “How far can we compress Instant-NGP-based NeRF?, ” in Pr oc. IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2024, pp. 10349–10358. [13] G. D. Pais, M. Chatterjee, and P . Miraldo, “ A probability-guided sampler for neural implicit surface rendering, ” in Pr oc. Eur opean Confer ence on Computer V ision (ECCV) , LNCS 15080, 2024, pp. 164– 181. [14] C. Gao, Q. Sun, and J. Zhu, “MBS-NeRF: Reconstruction of sharp neural radiance ﬁelds from motion- blurred sparse images, ” Scientiﬁc Reports , vol. 15, p. 5275, 2025. [15] Y . Lee, L. Y ang, and D. Fan, “MFNeRF: Memory efﬁcient NeRF with mixed-feature hash table, ” in Pr oc. IEEE/CVF W inter Confer ence on Applications of Computer V ision (W ACV) , 2025, pp. 2686– 2695. [16] T . Haarnoja, A. Zhou, P . Abbeel, and S. Levine, “Soft actor-critic: Off-polic y maximum entropy deep reinforcement learning with a stochastic actor, ” in Pr oc. International Conference on Machine Learn- ing (ICML) , 2018, pp. 1861–1870. [17] S. Thalagala, P . K. W ong, X. W ang, and T . Sun, “Broad critic deep actor reinforcement learning for continuous control, ” arXiv preprint , 2024. [18] Y . Chen, X. Zhang, X. W ang, Z. Xu, X. Shen, and W . Zhang, “Corrected soft actor critic for continuous control, ” arXiv preprint , 2024. 13 [19] Q. Y ang, “Bayesian soft actor-critic: A directed acyclic strategy graph based deep reinforcement learn- ing, ” in Pr oc. A CM/SIGAPP Symposium on Applied Computing (SA C) , 2024, pp. 1–8. [20] Y . Zhang, H. T ang, C. W ei, and W . Ding, “Bidirectional soft actor-critic: Leveraging forward and re verse KL di ver gence for ef ﬁcient reinforcement learning, ” arXiv preprint , 2025. [21] C. Ge, L. Luo, J. Zhang, X. Meng, and Y . Chen, “FRL: An integrati ve feature selection algorithm based on the Fisher score, recursi ve feature elimination, and logistic re gression to identify potential genomic biomarkers, ” BioMed Resear ch International , v ol. 2021, p. 4312850, 2021. [22] C. Ge, “Selection of potential cancer biomarkers based on feature selection method, ” in Pr oc. Third International Confer ence on Intelligent Computing and Human-Computer Interaction (ICHCI) , 2023, pp. 1–5. [23] C. Ge, “Research on medical image classiﬁcation based on acti ve learning and vision-language mod- els, ” Master’ s thesis, Shandong University , Shandong, 2025. 14

SAC-NeRF: Adaptive Ray Sampling for Neural Radiance Fields via Soft Actor-Critic Reinforcement Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment