RF-GPT: Teaching AI to See the Wireless World

Large language models (LLMs) and multimodal models have become powerful general-purpose reasoning systems. However, radio-frequency (RF) signals, which underpin wireless systems, are still not natively supported by these models. Existing LLM-based ap…

Authors: ** - **Hang Zou** (Khalifa University) - **Yu Tian** (Khalifa University) - **B. Wang** (Khalifa University) - **Lina Bariah** (Khalifa University) - **Merouane Debbah** (Khalifa University) - **Bohao Wang** (Zhejiang University) - **Chongwen Huang** (Zhejiang University) - **Samson Lasaulce** (Université de Lorraine, CNRS, CRAN) *(소속 및 연락처는 논문 본문에 명시된 바와 동일)* --- **

RF-GPT: Teaching AI to See the Wireless World
1 RF-GPT : T eaching AI to See the W ireless W orld Hang Zou, Y u T ian, Bohao W ang, Lina Bariah, Samson Lasaulce, Chongwen Huang, and M ´ erouane Debbah Abstract —Large language models (LLMs) and multimodal models ha ve become po werful general-purpose reasoning systems. Howev er , radio-frequency (RF) signals, which underpin wireless systems, are still not natively supported by these models. Existing LLM-based approaches f or telecom focus mainly on text and structured data, while con ventional RF deep-learning models ar e built separately for specific signal-processing tasks, highlighting a clear gap between RF perception and high-level reasoning. T o bridge this gap, we introduce RF-GPT, a radio-frequency language model (RFLM) that utilizes the visual encoders of multimodal LLMs to process and understand RF spectrograms. In this framework, complex in-phase/quadrature (IQ) wav eforms are mapped to time–fr equency spectrograms and then passed to pretrained visual encoders. The resulting representations are injected as RF tokens into a decoder -only LLM, which generates RF-grounded answers, explanations, and structured outputs. T o train RF-GPT , we perform supervised instruction fine-tuning of a pretrained multimodal LLM using a fully synthetic RF corpus. Standards-compliant wavef orm generators produce wideband scenes for six wireless technologies, from which we derive time–frequency spectrograms, exact configuration metadata, and dense captions. A text-only LLM then con verts these captions into RF-grounded instruction–answer pairs, yielding roughly 12,000 RF scenes and 0.625 million instruction examples with- out any manual labeling. Across benchmarks f or wideband modulation classification, overlap analysis, wireless-technology recognition, WLAN user counting, and 5G NR information extraction, RF-GPT achieves strong multi-task performance, whereas general-purpose VLMs with no RF grounding largely fail. Index T erms —Radio frequency language models, vision- language models, radio frequency signals, modulation classifi- cation, spectrograms, RF-GPT I . I N T R O D U C T I O N Large language models (LLMs) have significantly advanced natural language processing, enabling powerful capabilities in long-form generation, code synthesis, tool use, and multi-step reasoning. Multimodal systems no w extend these capabilities beyond te xt, where multimodal and vision–language models (VLMs), such as GPT -4o [ 1 ], Gemini [ 2 ], LLaV A [ 3 ], Qwen- VL [ 4 ], and InternVL [ 5 ] integrate image and text inputs to support visual reasoning, captioning, and visual question answering, among other tasks. On the other hand, in the audio domain, large transformer-based models such as Whisper [ 6 ], Qwen2-Audio [ 7 ], Minimax-Speech [ 8 ] demonstrate that mas- siv e unlabeled speech and sound corpora can be le veraged to learn robust representations for transcription, generation, and H. Zou, Y . Tian, B. W ang, L. Bariah and M. Debbah are with Research Institute for Digital Future, Khalifa University , 127788 Abu Dhabi, U AE (e- mails: { hang.zou, yu.tian, lina.bariah, merouane.debbah } @ku.ac.ae). B. W ang and C. Huang are with College of Information Science and Elec- tronic Engineering, Zhejiang University , 310027, Hangzhou, China (email: { bohao.wang, chongwen.huang } @zju.edu.cn) S. Lasaulce is with Univ ersit ´ e de Lorraine, CNRS, CRAN, F-54000 Nancy , France (email: samson.lasaulce@univ-lorraine.fr). audio understanding. Despite these advances in text, vision, and audio, radio-frequency (RF) signals, which represent the physical layer for wireless communications, radar sensing, and integrated sensing-and-communications (ISA C), have not yet been integrated into these foundation model frame works. Existing machine learning-driven RF intelligence is mainly designed based on narrow , task-specific models, for automatic modulation classification, channel estimation, beam selection, interference identification, and spectrum sensing [ 9 ], to name a few . These models are typically trained on small, heteroge- neous datasets under constrained assumptions about channel models, hardware impairments, and traffic patterns, and they are ev aluated on task-specific metrics. While such models can achiev e high accurac y , they exhibit sev eral limitations. First, each task requires its o wn architecture, dataset, and train- ing pipeline, resulting in limited reuse across tasks. Second, building di verse and well-labeled RF datasets is expensi ve and usually requires expert annotation, making large-scale supervision dif ficult. Third, models trained under particular setups experience degraded performance when deployed under different SNR ranges, channel conditions, or hardware scenar- ios. Finally , most RF models produce only labels or regression outputs, without explanations or a natural interface for hu- man interaction. Within this context, a language-model-based approach introduces a radical change in these approaches in two important ways. First, it allows multiple RF tasks to be achiev ed within a single model through instructions, rather than through separate architectures. Second, it introduces an interface for reasoning and interaction, where the model can describe what it observes, justify its predictions, and respond to follow-up questions. Instead of training a new neural net- work for each RF objecti ve, tasks can be designed as prompts that are executed ov er a shared representation. Despite recent progress on wireless and RF foundation models, such as WFM [ 10 ], L WM [ 11 ], and WirelessGPT - like channel models [ 12 ], most e xisting approaches still rely on task-specific output heads or per-task fine-tuning to achieve competitiv e performance on v arious downstream tasks. In practice, each new application (e.g., channel estimation, lo- calization, sensing, or RF classification) requires its own pre- diction head, loss design, and fine-tuning pipeline on labeled data, significantly limiting the potential of an RF foundation model . At the same time, many of these models are relatively small and optimized for a small set of benchmarks, making it difficult to balance performance across heterogeneous tasks or to scale them according to the famous scaling law [ 13 ]. In parallel, the 6G research roadmap en visions AI-nativ e networks that integrate sensing, communication, computing, and control, with autonomous, intent-dri ven operation across the radio access and core network [ 14 ]–[ 16 ]. Within this vision, LLMs have been proposed as unified interfaces for 2 knowledge access, reasoning, tool orchestration, and policy optimization, enabling automated fault diagnosis, configura- tion generation, and closed-loop network management. Early work on LLM4T elecom has explored network and service management assistants, domain-specialized instruction-tuned models, and agentic workflo ws that connect LLMs to mon- itoring systems, operations support systems (OSS) and busi- ness support systems (BSS) data [ 17 ]–[ 21 ]. T elecom-specific LLMs, such as T elecomGPT [ 22 ], inte grate domain kno wledge directly into the language model to improve its reasoning capabilities over alarms, KPIs, logs, and configuration data. Howe ver , it is important to emphasize that the integration of LLMs into telecom networks is largely te xt-centric . Existing T elecom LLMs are primarily designed to process human- readable data, including tickets, log messages, configuration tables, and structured KPIs. They do not directly process observations pertinent to the physical layer, such as RF wav e- forms, spectrograms, or channel estimates. Subsequently , this modality gap prev ents the use of LLMs in advanced RF- based tasks that require direct access to the RF spectrum, such as identifying coe xisting technologies, detecting inter - ference, analyzing occupancy patterns, or validating standard compliance at the signal le vel. Having said that, future AI- nativ e networks are not en visaged to support reasoning o ver text and structured data only , but will require as well deep RF perception capabilities that are explainable, queryable, and integrated with higher -lev el decision making. These limitations motiv ate the de velopment of radio- fr equency language models (RFLMs), which are foundation models that can process RF data and respond in natural language. In principle, an RFLM is a conditional language model grounded on RF tokens. Given an RF recording, it should be able to answer questions such as “Which modu- lations and technologies are present in this band?”, “ Are there ov erlapping transmissions and what is their time–frequency relationship?”, or “Is this behavior compliant with the relev ant wireless standard?”. Beyond simple classification, an RFLM can provide textual explanations, structured JSON summaries for do wnstream controllers, and interacti ve dialogue with RF engineers or higher-lev el LLM agents. This language interface is aimed for practical reasons, including providing a unified instruction-following interface that comprises a wide range of RF downstream tasks, facilitating the integration of domain knowledge (e.g., ITU/3GPP standards, link-budget formulas, antenna constraints) and human feedback, as well as allowing human operators and autonomous agents to query , inspect, and manage RF behavior using plain language. In this sense, RF modeling becomes more flexible and interactive, rather than isolated predictors. RF systems inherently generate large amounts of data (wideband monitoring, SDR deployments, cellular logs), and realistic simulators can produce vast amounts of labeled or semi-labeled signals under controlled conditions. This makes RF a promising candidate for self-supervised and synthetic pretraining. Using realistic wa veform generators, we can gen- erate signal-metadata samples, with accurate ground truth, such as modulation type, SNR, bandwidth, or resource alloca- tion. Such a pipeline reduces the reliance on expert annotations and enables lar ge-scale pretraining and instruction tuning. A pretrained RFLM can then be compressed or adapted for deployment across different RF scenarios, and can serve as an RF reasoning module across O-RAN, non-terrestrial networks, and ISA C scenarios, among other scenarios. Howe ver , it is worth emphasizing that realizing such an RFLM is challenging. Unlike images and audio signals, RF wa veforms are time-series, complex-v alued signals sampled at very high rates, and understanding them depends on un- derstanding time–frequency structure, protocol standards, and propagation ef fects. T o the best of the authors’ knowledge, large-scale datasets of real RF signals in diverse en vironment with expert textual annotations do not exist in the literature. Building such a dataset requires long-term RF measurements, specialized hardware, and domain experts for labeling. Over - the-air data collection also raises priv acy concerns, and typical RF en vironments are highly imbalanced, with rare but critical scenarios, such as extreme interference or unusual coexistence, being difficult to capture. In this paper , we propose RF-GPT , a radio-frequency language model that integrates RF spectrograms into a multimodal LLM. T o achiev e our objectiv es, we treat RF spectrograms as visual inputs and reuse the visual compo- nents of advanced vision–language models. Specifically , we first con vert complex in-phase/quadrature (IQ) samples into time–frequency spectrograms using STFT . These spectrograms are mapped to pseudo-RGB or grayscale images and encoded by a pretrained vision encoder to produce RF tok ens. The tokens are then projected into the language-model embedding space through a lightweight adapter, after which a decoder- only LLM generates RF-grounded text conditioned on this RF prefix. T o ov ercome the lack of RF–text pairs, we b uild a large synthetic spectrogram–caption dataset using realistic wa veform generators for wireless technologies, record full configuration metadata, conv ert it into technical captions using a deterministic captioning pipeline, and then use a strong text-only LLM to synthesize div erse instruction–answer pairs (descriptions, counting, ov erlap analysis, structured JSON). W e then perform RF-grounded supervised fine-tuning on multimodal LLM backbones (e.g., Qwen2.5-VL [ 4 ]) on this synthetic RF instruction set, without any human-annotated spectrogram labels. At inference time, RF-GPT takes an RF wav eform/spectrogram and a natural-language query and returns explanations, predictions, or structured outputs that reflect RF-aware reasoning. W e ev aluate the model on a benchmark suite cov ering wideband modulation classification, ov erlap analysis, wireless-technology recognition, WLAN user counting, and 5G NR information extraction. The main con- tributions of this paper are summarized as follows: • W e formulate the notion of a radio-frequency language model (RFLM) as a conditional language model grounded on RF tokens, and realize this concept through RF-GPT , which inte grates RF spectrograms into a multimodal LLM via a vision encoder and a lightweight modality adapter through simple linear projection. • W e design standards-compliant pipelines to synthesize a large RF spectrogram corpus for wideband settings and six wireless technologies, including 5G NR, L TE, 3 UMTS, WLAN, D VB-S2, and Bluetooth, and introduce deterministic captioning schemes that con vert signal-le vel attributes into structured textual descriptions. • W e propose an automatic instruction-synthesis frame- work that con verts RF captions into a di verse set of RF-grounded instruction–answer pairs, including explanation-style questions, quantitativ e queries (e.g., counts, overlaps), and structured JSON outputs, and we construct benchmarks that e v aluate component recogni- tion, overlap reasoning, technology identification, user counting, and NR-specific attrib ute extraction. • W e fine-tune pretrained multimodal LLMs on these syn- thetic RF instructions to obtain RF-GPT and demonstrate that RF grounding enables them to perform a range of RF understanding tasks on RF signals, including modulation classification, technology recognition, ov erlap analysis, and free-form RF question answering, while generic VLMs without RF grounding fail on the same benchmarks. I I . P R O B L E M F O R M U L AT I O N A N D R F - G P T A R C H I T E C T U R E In this section, we present the RF-GPT architecture adopted in this work. Let x ∈ C T denotes a complex baseband IQ sequence of length T samples, and let y = ( y 1 , . . . , y N ) denote a text sequence of N tok ens from a vocab ulary V . This text sequence is generally rele vant to the RF signals, e.g., a description of their characteristics or intents associated with them. W e denote by ϕ RF an RF encoder that maps the raw IQ sequence into a sequence of M RF tokens in a latent space, ϕ RF : C T → R M × d , (1) where d is the embedding dimension of each RF token. In our formulation, an RFLM is a conditional language model that generates a token sequence y gi ven an RF input x that has been encoded into a sequence of RF tokens ϕ RF ( x ) . The model defines the autore gressiv e conditional distrib ution as P Θ ( y | ϕ RF ( x )) = N Y t =1 P Θ  y t | y 0 is a small constant for numerical stability . The resulting time–frequency matrix is normalized and mapped to an image I ∈ R H × W × C , where H and W are the image height and width and C is the number of channels. W e either consider A dB as a single-channel grayscale image ( C = 1 ) or apply a fix ed colormap (e.g., viridis) to obtain a pseudo- RGB image ( C = 3 ). This yields an RF spectrogram image that can be processed by a standard vision encoder E v . W e define the complete RF encoder ϕ RF as the composition of the spectrogram pipeline and the vision pipeline, as ϕ RF ( x ) = E v (I) = E v (Sp ec( x )) , (5) where Sp ec( · ) denotes the spectrogram pipeline (STFT + magnitude/dB + colormap). Patch embedding . The spectrogram image I is partitioned into a re gular grid of non-ov erlapping patches of size P × P pixels, where both H and W are divisible by P . The number of patches (and thus RF tokens) is M = H P · W P . Let x i ∈ R P 2 C denote the vector obtained by flattening the i - th patch (row-major order), for i = 1 , . . . , M . Each patch is linearly projected to a d -dimensional embedding using a learnable matrix E ∈ R d × ( P 2 C ) , e i = Ex i ∈ R d , i = 1 , . . . , M . (6) This patchification and linear projection step acts as an RF tokenizer , turning each spectrogram into a sequence of M RF tokens that the language model can consume. W e also use a learnable positional embedding p i ∈ R d for each patch index i , which encodes its 2D location on the spectrogram grid (time–frequenc y position). The input token sequence to the transformer encoder is then gi ven by z (0) i = e i + p i , i = 1 , . . . , M , (7) and we stack them into Z (0) =     ( z (0) 1 ) ⊤ . . . ( z (0) M ) ⊤     ∈ R M × d . (8) Unlike some V iT variants [ 24 ], [ 25 ], we do not prepend a special classification token, instead, all M patch tokens are treated as RF tok ens and later passed to the language model. T ransformer encoder layers. The matrix Z (0) is processed by L stacked transformer encoder blocks. Each block consists 4 Fig. 1: Basic structure of RF-GPT , comprising a visio-based radio-frequency (RF) encoder (implemented by a vision encoder on RF spectrograms), an RF adapter (linear projection) that projects RF embeddings to the LLM dimension, and a decoder-only LLM. STFT stands for short-time F ourier transform. of multi-head self-attention (MHA) followed by a position- wise multi-layer perceptron (MLP), both wrapped with resid- ual connections and layer normalization. Given Z ( ℓ ) ∈ R M × d as the input to layer ℓ ( ℓ = 0 , . . . , L − 1 ), the output Z ( ℓ +1) is obtained as e Z ( ℓ ) = Z ( ℓ ) + MHA  LN( Z ( ℓ ) )  , (9) Z ( ℓ +1) = e Z ( ℓ ) + MLP  LN( e Z ( ℓ ) )  , (10) where LN( · ) is layer normalization and the MLP is a two- layer feed-forw ard network (FFN) with nonlinearity , applied row-wise as MLP( z ) = σ  zW 1 + b 1  W 2 + b 2 , (11) for learnable weights W 1 , W 2 and biases b 1 , b 2 , and activ a- tion function σ ( · ) (e.g., GELU). The multi-head self-attention (MHA) operates as follows. For a given input U ∈ R M × d , the h -th attention head computes Q h = UW Q h , K h = UW K h , V h = UW V h , (12) A h = softmax  Q h K ⊤ h √ d k  , H h = A h V h , (13) where W Q h , W K h , W V h ∈ R d × d k are learnable projection matrices (query , ke y , and value for head h ), d k is the head dimension, and softmax is applied row-wise ov er the attention scores. The outputs of all heads are concatenated and linearly projected, as follows MHA( U ) =  H 1 ∥ H 2 ∥ . . . ∥ H H  W O , (14) where W O ∈ R ( H d k ) × d is the projection matrix and H represents the number of attention heads. After L layers, the final patch embeddings are represented as h i = z ( L ) i ∈ R d , i = 1 , . . . , M . (15) These vectors form the RF-grounded latent sequence h 1: M used for downstream conditioning. Remark. In our frame work, the vision encoder serves as the RF encoder by treating the spectrogram as an image and encoding time–frequency patterns such as modulation structure, resource allocation, interference, and coexistence into a sequence of latent RF tokens. There are two main motiv ations for transforming a vision encoder into an RF encoder , namely , (i) RF signals span very different carrier fre- quencies, bandwidths, and sampling rates across technologies, making it dif ficult to design a single raw-IQ tokenizer that is simultaneously efficient and robust for narrowband, wideband, and multi-standard scenarios, and (ii) the STFT serves as an effecti ve RF feature extractor , while a vision encoder pretrained on lar ge-scale image datasets can extract universal semantic structure from images, including spectrograms. A k ey design decision in RF-GPT is to operate on magni- tude spectrograms rather than ra w IQ signals. From a time– frequency analysis viewpoint, the spectrogram is a smoothed energy distribution associated with the signal, and can be seen as a windowed version of W igner –V ille type representations [ 26 ]. As such, many RF attrib utes of interest, e.g., occupied bandwidth, center frequency and Doppler shifts, burst timing, time–frequency sparsity patterns, and ev en modulation struc- ture, are encoded geometrically in the spectrogram and are therefore accessible to a vision encoder . It is important to distinguish between two regimes. First, true phase r etrie val aims to reconstruct the entire complex baseband wav eform from magnitude-only measurements (e.g., STFT magnitude). This problem is nontri vial, but the phase- retriev al literature shows that, under suitable redundancy and support assumptions, a signal can be uniquely determined (up to a global phase) and stably reco vered from such magnitude data; see, e.g., work on STFT phase retriev al [ 27 ] and conv ex formulations such as PhaseLift [ 28 ]. Second, in our setting we do not attempt to reconstruct x [ n ] itself, but only to infer task-relev ant attributes such as modulation family , technology , SNR regime, or o verlap structure. These tasks are man y-to-one mappings of the waveform and depend primarily on the time– frequency ener gy patterns, not on the exact sample-wise phase. 5 RF-GPT therefore uses magnitude-only spectrograms as a rich but lossy front-end representation. While they are not sufficient for all possible RF tasks (e.g., those requiring absolute phase information or precise multi-antenna phase relationships), they retain enough structure to support the perception-oriented tasks considered in this work. In Sec. IV we empirically v alidate that this spectrogram-based approach allows the model to recov er the RF attributes required by our benchmarks with high accuracy . W ith the RF encoding pipeline established, we no w describe how the resulting RF tokens are used to condition the language model and ground its predictions in RF information. B. Languag e Model Ar c hitecture and RF Conditioning The backbone of RF-GPT is a decoder-only T ransformer language model with parameters Θ LM , which are part of the ov erall parameter set Θ . The LLM operates on a sequence of token embeddings u 1:( M + N ) ∈ R ( M + N ) × d LM , where d LM is the model dimension of the LLM. W e add standard positional embeddings e.g., rotary position embeddings (RoPE [ 29 ]) to u 1:( M + N ) before passing them to the decoder layers. First, the RF encoder produces M visual tokens h 1: M = E v ( I ) , with h i ∈ R d as defined in the previous subsection. An RF adapter through simple linear projection maps these to the LLM dimension, r i = W pro j h i + b pro j ∈ R d LM , i = 1 , . . . , M , (16) where W pro j ∈ R d LM × d and b pro j ∈ R d LM are trainable pa- rameters. This linear layer plays the role of a visual–language adapter , similar to the projection modules used in LLaV A-style models [ 3 ], mapping the vision-encoder features (which en- codes RF information) into the LLM embedding space. T ext tokens y t are embedded via a standard lookup table E tok ∈ R |V |× d LM : e t = E tok [ y t ] ∈ R d LM , t = 1 , . . . , N , (17) with |V | the v ocabulary size of the LLM backbone. W e then form a single joint sequence by concatenating RF and text embeddings: u 1:( M + N ) =  r 1 , . . . , r M , e 1 , . . . , e N  . (18) Let U (0) ∈ R ( M + N ) × d LM be the matrix whose rows are u 1 , . . . , u M + N . The sequence is processed by L LM stacked decoder layers. Each layer follows a modern pr e-normalized T ransformer design with root mean square (RMS) normaliza- tion, grouped-query attention (GQA) with a causal mask, and a gated Up/Do wn MLP , as adopted in recent Qwen-series LLMs. For layer ℓ = 0 , . . . , L LM − 1 we write e U ( ℓ ) = U ( ℓ ) + GQA causal  RMSNorm  U ( ℓ )  , (19) U ( ℓ +1) = e U ( ℓ ) + MLP gated  RMSNorm  e U ( ℓ )  , (20) where RMSNorm( · ) is root-mean-square layer normalization [ 30 ], GQA causal is grouped-query self-attention [ 31 ] with a causal mask, and MLP gated is a gated feed-forward block (Up/Down MLP) [ 32 ]. Since decoder-only transformers and these building blocks are not yet standard in RF modeling, we briefly recall their definitions. RMS normalization. For a token vector x ∈ R d LM , RMSNorm is defined as RMSNorm( x ) = γ ⊙ x q 1 d LM ∥ x ∥ 2 2 + ε , (21) where γ ∈ R d LM is a learned scale, ε > 0 is a small constant, and ⊙ denotes element-wise multiplication. RMSNorm sta- bilizes activ ations while being slightly simpler than standard LayerNorm. Grouped-query Attention (GQA). Let U ∈ R ( M + N ) × d LM denote a generic layer input. Grouped-query attention uses H q query heads and H k key–v alue heads with H q ≥ H k . W e first compute Q = UW Q , K = UW K , V = UW V , (22) where W Q ∈ R d LM × H q d k and W K , W V ∈ R d LM × ( H k d k ) are learnable projection matrices, and d k is the head dimension. T o facilitate grouped-query computation, we reshape Q into H q separate heads and K , V into H k groups. Formally , let Q h ∈ R ( M + N ) × d k denote the h -th query head for h ∈ { 1 , . . . , H q } , and let K g ( h ) , V g ( h ) ∈ R ( M + N ) × d k represent the key and v alue heads shared by the group associated with query head h . The grouping is determined by the index function g ( h ) = ⌊ ( h − 1) /r ⌋ + 1 , where r = H q /H k is the group size, mapping the h -th query head to its corre- sponding key-v alue group index. W ith a causal attention mask M causal ∈ R ( M + N ) × ( M + N ) (entries 0 for allo wed positions and −∞ otherwise), GQA is A h = softmax Q h K ⊤ g ( h ) √ d k + M causal ! , H h = A h V g ( h ) , (23) and the heads are concatenated and linearly projected back to d LM : GQA causal ( U ) =  H 1 ∥ . . . ∥ H H q  W O , (24) with W O ∈ R ( H q d k ) × d LM . The causal mask enforces that po- sition t attends only to positions ≤ t , ensuring autoregressiv e generation. Gated Up/Do wn MLP . The feed-forw ard block uses a gated Up/Down structure (often instantiated with a SwiGLU-style activ ation) instead of a plain two-layer MLP . For an input matrix X ∈ R T × d LM , we compute U up = XW up + b up , (25) U gate = XW gate + b gate , (26) where W up , W gate ∈ R d LM × d ff project into a larger hidden dimension d ff , and b up , b gate ∈ R d ff . A pointwise nonlinear gating (e.g., SwiGLU) is then applied: G = ϕ SwiGLU ( U gate ) ⊙ U up , (27) MLP gated ( X ) = GW down + b down , (28) with W down ∈ R d ff × d LM and b down ∈ R d LM . The gating mechanism improves expressi vity at similar or lower compu- tational cost than a standard two-layer MLP . 6 Finally , the last decoder layer produces U L LM ∈ R ( M + N ) × d LM . W e take the final N positions corresponding to the text tokens (indices M + 1 to M + N ) and apply a linear output head and softmax to obtain the conditional distribution P Θ ( y | ϕ RF ( x )) in ( 2 ). Apart from the RF tokens being concatenated as a prefix, the LLM follo ws a modern architecture of LLMs with grouped-query attention, RMSNorm, and gated MLPs (e.g., Qwen- and LLaMA-style models). C. RF-Gr ounded Supervised F ine-T uning W ith the model structure of RF-GPT in place, we now discuss ho w to inject kno wledge of RF signals into it. T o adapt a generic VLM into an RFLM, we perform RF-grounded supervised fine-tuning (SFT) on a synthetic instruction dataset built from standards-compliant wa veform generators. W e as- sume access to a dataset D RF =  ( x ( i ) , q ( i ) , y ( i ) )  K i =1 , (29) where x ( i ) is an RF wav eform (IQ samples) that can be con verted into a spectrogram image as described in Sec. II-A , q ( i ) is a natural-language instruction or question about the corresponding spectrogram (e.g., “Describe the signal types and overlaps in this RF scene. ”), and y ( i ) is the desired answer (e.g., a caption, explanation, or JSON summary). These triplets are constructed in two stages, namely , RF spectrogram captioning and RF instruction synthesis (see Sec. III ). Gi ven this dataset, RF-GPT is trained to minimize the standard autoregressi ve cross-entropy loss ov er the answer tokens, conditioned on both the RF tokens and the instruction, as follows L (Θ) = − X ( x , q , y ) ∈D RF | y | X t =1 log P Θ  y t   y

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment