Towards unified brain-to-text decoding across speech production and perception

T o w ards uniﬁed brain-to-text deco ding across sp eec h pro duction and p erception Zhizhang Y uan 1 † , Y ang Y ang 1* † , Gaorui Zhang 1 , Bao wen Cheng 2 , Zehan W u 3 , Y uhao Xu 3 , Xiao ying Liu 3 , Liang Chen 3 , Ying Mao 3 , Meng Li 2* 1 Computer Science and T echnology , Zhejiang Univ ersity , Hangzhou, Zhejiang, China. 2 Shanghai Institute of Microsystem and Information T echnology , Chinese Academy of Sciences, Shanghai, China. 2 Departmen t of Neurosurgery , Huashan Hospital, F udan Univ ersity , Shanghai, China. *Corresp onding author(s). E-mail(s): yangy a@zju.edu.cn ; li.meng@mail.sim.ac.cn ; Con tributing authors: zhizhangyuan@zju.edu.cn ; gaoruizhang@zju.edu.cn ; chengbao w en23@mails.ucas.ac.cn ; zh wu08@fudan.edu.cn ; 23211220055@m.fudan.edu.cn ; 7631935@qq.com ; hsc henliang@fudan.edu.cn ; mao ying@fudan.edu.cn ; † These authors contributed equally to this w ork. Abstract Sp eec h pro duction and perception constitute tw o fundamental and distinct mo des of h uman communication. Prior brain-to-text deco ding studies ha ve largely fo cused on a single modality and alphabetic languages. Here, w e present a uniﬁed brain-to-sen tence deco ding framework for b oth sp eech pro duction and p ercep- tion in Mandarin Chinese. The framew ork exhibits strong generalization ability , enabling sen tence-level deco ding when trained only on single-character data and supp orting c haracters and syllables unseen during training. In addition, it allows direct and con trolled comparison of neural dynamics across modalities. W e col- lected neural data from 12 participan ts implanted with depth electrodes and ac hieved full-sentence decoding across multiple participants, with b est-case Chi- nese c haracter error rates of 14.71% for sp ok en sentences and 21.80% for heard sen tences. Mandarin sp eec h is deco ded by ﬁrst classifying syllable comp onen ts in Hanyu Pin yin, namely initials and ﬁnals, from neural signals, follo wed b y 1 a post-trained large language mo del (LLM) that maps sequences of toneless Pin yin syllables to Chinese sentences. T o enhance LLM deco ding, we designed a three-stage p ost-training and t wo-stage inference framew ork based on a 7-billion- parameter LLM, achieving ov erall p erformance that exceeds larger commercial LLMs with hundreds of billions of parameters or more. In addition, several char- acteristics were observ ed in Mandarin sp eec h pro duction and p erception: sp eec h pro duction inv olved neural resp onses across broader cortical regions than audi- tory perception; channels responsive to both mo dalities exhibited similar activit y patterns, with speech perception sho wing a temporal dela y relativ e to production; and deco ding p erformance was broadly comparable across hemispheres. Our work not only establishes the feasibility of a uniﬁed deco ding framework but also pro- vides insights in to the neural characteristics of Mandarin sp eec h pro duction and p erception. These adv ances contribute to brain-to-text deco ding in logosyllabic languages and pav e the w ay tow ard neural language deco ding systems supp orting m ultiple mo dalities. In tro duction Language is a uniquely human system for representing and comm unicating informa- tion, expressed through several communicativ e mo dalities such as sp eaking, listening, reading, and writing. Among these mo dalities, sp eaking and listening constitute the primary means of everyda y communication. Previous w ork has demonstrated that lin- guistic conten t can b e deco ded from brain activit y elicited during sp eaking [ 1 – 13 ] or listening [ 14 , 15 ] by establishing mappings betw een neural signals and the corresp ond- ing linguistic information. Most existing studies fo cus on one of these mo dalities, eac h adopting its o wn exp erimen tal paradigm for data collection and a corresp onding deco d- ing pip eline. F or data acquisition, many high-accuracy language deco ding studies rely on recording metho ds such as microelectro de arrays [ 3 , 5 ], whic h capture lo cal neural activit y with extremely high temporal and spatial resolution. Nev ertheless, their spa- tial cov erage is limited to highly lo calized functional areas, which mak es multimodal deco ding c hallenging. Regarding deco ding framework design, mo dels hav e b een devel- op ed for diﬀerent tasks, including classifying isolated w ords [ 7 ], inferring sen tences from acoustic features [ 2 , 3 , 5 , 6 ], and reconstructing sp eec h w av eforms [ 2 , 4 , 14 ]. Mean- while, the v ast ma jority of these adv ances hav e b een achiev ed in alphabetic language systems [ 2 – 5 , 8 – 15 ], such as English and Dutch. In contrast, brain-to-text deco ding studies on logosyllabic languages, most notably Chinese, the language with the largest n umber of native sp eak ers worldwide, remain comparatively limited in brain-to-text deco ding [ 1 , 6 , 7 , 16 ]. Addressing this gap, w e present a uniﬁed brain-to-sentence decoding framew ork for Mandarin Chinese that op erates seamlessly across both speaking and listening mo dali- ties. W e employ ed stereo electroencephalography (sEEG) by implanting m ultiple depth electro des across v arious functional brain regions in each participant, establishing the foundation for multimodal neural deco ding. Our approac h uniﬁes the tw o mo dalities from the lev el of b eha vioral paradigms through the entire deco ding pip eline, enabling 2 not only sentence decoding under either mo dalit y but also a direct comparison of the neural dynamics evok ed by sp eec h pro duction and p erception. Mandarin Chinese emplo ys logographic characters, which num b er in the tens of thousands [ 17 ], making direct character-lev el deco ding from neural activity impractical. W e instead lev er- age Hanyu Pinyin, the standardized phonetic system comp osed of initials, ﬁnals, and tones [ 18 ], co v ering o ver 1200 possible tonal syllables, where eac h syllable maps to m ul- tiple Chinese characters. Our deco ding framework utilizes only initials and ﬁnals to form toneless syllables, as tone deco ding w as less robust than initial and ﬁnal deco d- ing, and incorp orating tone information led to sub optimal deco ding p erformance, as examined in subsequent analyses. The decoding pip eline consists of three stages. At ﬁrst, the initial and ﬁnal of each Chinese c haracter are classiﬁed from neural signals. Then, the resulting classiﬁcation probabilities are searched to generate multiple can- didate syllable sequences. Finally , the correct Chinese sen tence is inferred based on these candidates. The framework exhibits generalization abilit y (shown in Fig. 1 g). First, it pro vides hierarc hical generalization, as training solely on neural responses to isolated sp ok en or heard c haracters is suﬃcient for deco ding full sentences. Second, it ac hieves c haracter generalization, as the decoded sentences may include Chinese c haracters that nev er app eared during training. Third, it enables syllable generaliza- tion, allowing the mo del to deco de Chinese pinyin syllables that were not presen t in the training set. T ogether, these properties establish a general, mo dalit y-uniﬁed, and linguistically grounded deco ding framework for Mandarin Chinese. A ma jor technical c hallenge in the deco ding pro cess arises in the ﬁnal stage, which aims to infer the correct sen tence from the searched candidate set. In alphabetic lan- guages such as English, acoustic features are represented at the phoneme level, and a phoneme sequence s trongly constrains the lexicon, t ypically mapping to a unique w ord or only a few candidate words. As a result, in English sp eech decoding, the searc h pro cess eﬀectively p erforms a phoneme-to-word mapping. In contrast, deco d- ing from a toneless syllable to a logographic character in Mandarin Chinese in volv es a tw o-step one-to-man y mapping: ﬁrst from the toneless syllable to the tonal syllable, and then from the tonal syllable to the logographic character, resulting in a single toneless syllable p oten tially corresp onding to dozens of diﬀerent characters. Similarly , at the word level, toneless syllable sequences still provide insuﬃcient contextual con- strain ts, resulting in persistent one-to-many mappings analogous to those observ ed at the character lev el. Consequently , character- or word-lev el mappings are inher- en tly highly am biguous, introducing cumulativ e errors that ma y mislead subsequent correction steps. Although toneless syllables are highly am biguous in isolation, sen tence context often pro vides strong constraints. Therefore, we designed an end-to-end approac h that lev erages large language mo dels (LLMs) to deco de toneless syllable sequences into Chinese sen tences directly , eliminating intermediate translation steps, while enabling the mo del to exploit contextual information within and across sequences. Neverthe- less, LLMs with several billion parameters struggle to accomplish this task eﬀectively , lik ely due to insuﬃcient training on this mo dality (toneless syllable sequences) and closely related inference tasks. Although commercial LLMs with hundreds of billions 3 to trillions of parameters demonstrate b etter p erformance, their inference incurs sub- stan tial computational costs and is impractical for lo cal deplo yment. T o address this issue, we developed a three-stage p ost-training and t wo-stage inference sc heme based on a 7-billion-parameter mo del (sho wn in Fig. 1 e,f ), whic h demonstrated sup erior p erformance to some commerc ial LLMs on this task. Our results demonstrate that, in neural language deco ding tasks, contin ued p ost-training enables LLMs to pro cess div erse input types and tackle more complex decoding tasks, moving beyond their traditional role of merely correcting or reﬁning ﬁnal deco ded sentences. W e collected neural data from 12 participan ts implan ted with depth electrodes, across whom w e ac hieved reliable decoding of sp eec h at multiple levels. Initial and ﬁnal deco ding p erformance was signiﬁcantly ab o ve chance level for all participants in the sp eaking task (mean initial accuracy = 59.54%; mean ﬁnal accuracy = 50.17%) and for 10 participants in the listening task (mean initial accuracy = 58.92%; mean ﬁnal accuracy = 48.05%). Building on these results, sen tence-level deco ding results sho wed that 6 participants in the sp eaking task and 5 participants in the listening task achiev ed a verage c haracter error rates (CERs) b elo w 50% (mean sp eaking CER = 31.52%; mean listening CER = 37.28%), with the b est CERs reac hing 14.71% and 21.80%, resp ectiv ely . Among these, 4 participan ts demonstrated reliable sen tence deco ding across b oth mo dalities (mean sp eaking CER = 32.15%; mean listening CER = 36.80%). F urthermore, we iden tiﬁed sev eral additional observ ations characteriz- ing the neural features of sp eec h pro duction and p erception. First, neural resp onses ev oked by sp eec h pro duction spanned a broader set of cortical regions than those elicited during auditory p erception. Second, for channels that were highly resp onsiv e to b oth speech pro duction and perception, the activity patterns across the tw o modal- ities were strongly correlated, with p erception resp onses exhibiting a clear temp oral dela y relative to pro duction. Third, deco ding p erformance was comparable b et w een the left and right hemispheres across b oth speaking and listening tasks. In conclusion, our w ork (1) demonstrates the feasibility of a uniﬁed deco ding framework applica- ble across b oth sp eaking and listening mo dalities, (2) exhibits hierarchical, character, and syllable generalization abilities, (3) emplo ys a p ost-training and inference frame- w ork for LLMs to resolve the highly ambiguous one-to-many mapping from toneless syllables to logographic characters, improving sentence-lev el reconstruction, and (4) pro vides insights into the neural diﬀerences and similarities b et ween Mandarin sp eec h pro duction and p erception. Results This section provides an ov erview of our deco ding framework, data collection pro cess, con tribution analysis, and the deco ding results. Detailed statistical procedures for all exp erimen tal analyses are provided in the Metho ds. Brain-to-sen tence deco ding framew ork Our brain-to-sen tence deco ding framework consists of three comp onents: a brain deco der that classiﬁes the initials and ﬁnals of each Chinese character; a b eam search mo dule that generates multiple p ossible toneless syllable sequence candidates from 4 我记得家乡的雪 Initial classification module Final classification module Brain decoder V ocabulary expansion Three-stage LLM post-training Date collection for training Initial/final probabilities Beam search channels 新的一天 b f - j d x uo i e ia iang ve Initial probabilities One-hot label BCE loss fang jian hen nuan huo T op-20 candidates T op-3 candidates Initial-final classification T oneless syllable sequence beng biao cang ceng chao chong chou ... Missing syllables ... b f - j d x uo i e ia iang ve b f - j d x uo i e ia iang ve T op-20 candidates T op-3 candidates T wo-stage LLM inference 0 7.5s listening rest speaking time + + gěi 给给 gěi 1s 2s 1s+2.5s 1s rest Final probabilities b f - g l j d x uo i e in ei ia iang ve b f - g l j d x uo i e in ei ia iang ve One-hot label BCE loss Brain decoder training T raining Inference 0 1 0 1 Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 T ranslation List-wise ranking Correction tian jian hen nu huo fang jian hen nuan huo ... fang sou hen en huo fang jian hen nuan huo fang jian gang nuan huo fang jian hen nuan xiong Qwen-7B Qwen-7B Qwen-7B 房间很暖和 (The room is warm) Qwen-7B Qwen-7B wo ji de jia xiang de xue wo shi de jia xiang de xue wo ji de jia xin de xue 房间很暖和 (The room is warm) 我记得家乡的雪 (I remember the snow in my hometown) 我记得家乡的雪 (I remember the snow in my hometown) 今天是新的一天 (It ’ s a new day today) 太阳将要下山了 (The sun is going down) 我想单独待几天 (I need some alone  time for a few days) ... ... ... ... ... xue ... ... ... ... Stimulus: 我记得家  (I remember the snow in my hometown) wo ji de jia xiang de xue wo ji de jia xin de xue wo shi de jia xin de xue ... wo ji ... W eight Inheritance Fine-tuning Shared weights Fro z en H ierarchical generali z ation Character generali z ation Syllable generali z ation Forward Forward Backward Backward Initial 1 Final 1 Initial 2 Final 2 Initial 7 Final 7 爸给哥第金太阳弟学雪今天是新将要下山校他我一伞把骨干击临夏法 T raining Inference T raining Inference T raining Inference z h uan q in b ang z h uan q in b ang Single character Sentence U nseen syllable U nseen character 方 f ā ng f ā ng 防 f á ng 放房 f á ng 芳访 f ǎ ng f ǎ ng 仿衣 y ī y ī 医 y í y í 移一 y ī 已疑议 Initial Final Syllable Initial Final z huan qin bang Syllable zhuan , quan qin , bin bang , zhang f à ng y ǐ y ì a c d e f g G enerali z ation Experimental P aradigm Decoding Framework 乡的雪 Stimulus: 给 (give) gei de xiang jia de ji wo xue b G enerated sentence G enerated sentence + Data collection for evaluation 7 x 2s 2s 1s 7 x 3s 0 我记得家乡的雪 + w ǒ j ì de ji ā xi ā ng de xuě 家乡的雪 ji ā xi ā ng de xuě 我记得 w ǒ j ì de rest rest time 3 8 s listening speaking Fig. 1 Ov erview of our brain-to-sentence framework. a,b, Data acquisition paradigm. a, T ask alternating betw een listening to and speaking individual Chinese c haracters. b, T ask alternating between listening to and sp eaking complete Mandarin sentences. c,e, T raining pipeline of the decoding framework. c, the brain decoder is trained to classify syllable comp onen ts (initials and ﬁnals) from neural signals. W e adopted NeuroSketc h, a 2D-CNN-based neural decoder dev elop ed in our concurrent work. e, The LLM is p ost-trained to generate full sentences from syllable sequences. d,f, Inference pipeline of the deco ding framework. d, Beam search is applied to the probabilistic outputs of the brain deco der to generate multiple syllable sequence candidates, from which the top-20 candidates are retained. f, The top-20 candidates are fed into the p ost-trained LLM, which p erforms a tw o- stage inference pro cess to pro duce the ﬁnal deco ded sentence. g, The prop osed framework exhibits hierarchical, character-level, and syllable-level generalization abilities. 5 the logits (i.e., output probabilities) produced by the brain decoder; and an LLM- based syllable-to-sen tence decoder that deriv es the ﬁnal sentence from the selected candidates. Sp eciﬁcally , the brain deco der is comp osed of tw o identical mo dules, each trained in a sup ervised manner to discriminate initials and ﬁnals from the input brain signals. W e adopted NeuroSketc h [ 19 ], an eﬀective 2D-CNN neural deco der developed in our concurrent work, as the default classiﬁer for initials and ﬁnals from neural recordings. Then, for eac h sentence, we arranged the logits of the initials and ﬁnals for every character as predicted by the brain decoder, and p erformed b eam search to retriev e the most probable deco ding paths, eac h corresponding to a syllable sequence candidate. All candidates w ere rank ed in descending order of their scores, and the top-20 candidates w ere ﬁnally retained. Finally , we p ost-trained an LLM to generate the Chinese sentence from the top-20 selected candidates. As shown in Fig. 1 e, b efore training, w e expanded the LLM’s v o cabulary to cov er all toneless syllables in Man- darin Chinese. The mo del was then sub jected to a three-stage supervised ﬁne-tuning pro cess, including a translation task that translated a syllable sequence in to the cor- resp onding Chinese sen tence, a listwise ranking task that selected three candidates closest to the correct syllable sequence from the top-20 candidates, and a correction task that generated the correct sen tence based on the three b est candidates. During inference, w e employ ed a tw o-step deco ding pro cedure (shown in Fig. 1 f ): the top-20 candidates w ere fed into the p ost-trained LLM to select the top three, which were then input into the mo del again to generate the correct sentence. Data collection F or data collection, in tracranial recordings were obtained from 12 patients (denoted as S1 to S12, see Supplementary T able 1 for participan t information) with drug-resistant epilepsy (7 males, 5 females; age range: 13–56 y ears). Electrode implantation cov ered b oth hemispheres in 6 patients, the left hemisphere in 3, and the right hemisphere in 3. Eac h patient was implanted with 7–17 electro des, resulting in 107–205 bip olar- referenced channels p er sub ject. Our exp erimen tal paradigm was designed with tw o k ey considerations. First, all participants were patien ts with drug-resistant epilepsy , whic h limits the feasible duration for data collection. T o obtain an adequate amount of data and ensure a training dataset balanced across syllables within a short time, we divided the data acquisition pro cedure for each participant into tw o parts (Fig. 1 a,b). The ﬁrst part consisted of a single-character listening and sp eaking task, which serv ed as the source of training d ata. The second part comprised a sentence-lev el listening and sp eaking task, which was used exclusiv ely for ev aluation. Second, neural signals drift o ver time [ 20 , 21 ], making it challenging to directly compare brain activity elicited by listening and speaking. T o minimize temp oral confounds, we interlea v ed the listening and sp eaking tasks, such that each character or sentence was sp ok en a few seconds after b eing heard, ensuring close temp oral alignmen t betw een the tw o conditions and allo wing for a more reliable comparison of the corresp onding brain responses. W e constructed tw o distinct corp ora (Supplemen tary 4), each containing separate training and test sets. The ﬁrst corpus consisted of a training set comprising 49 Chinese c haracters, which included 11 initials, 15 ﬁnals, 22 toneless syllables, and 44 tonal syllables. The test set consisted of 22 sen tences, eac h ranging from 4 to 11 c haracters 6 in length, totaling 61 unique characters. Among these, 14 c haracters o verlapped with the training set, and all syllables in the test sen tences w ere derived from the 22 toneless syllables present in the training set. The second corpus featured a training set of 161 c haracters, cov ering 11 initials, 15 ﬁnals, 60 toneless syllables, and 159 tonal syllables. The test set consisted of 31 sen tences, eac h ranging from 6 to 16 c haracters, containing 128 unique characters, including 66 toneless syllables. In the test set, 25 characters o verlapped with the training set, while 17 toneless syllables were not present in the training set. Participan ts S1 to S10 used the ﬁrst corpus; participan ts S11 and S12 used the second corpus. Sp eciﬁcally , in the ﬁrst corpus, each toneless syllable in the training set was presented b et w een 20 and 60 times, whereas in the second corpus, eac h toneless syllable was rep eated 15 times. Additionally , eac h test sentence in both corp ora was presented twice. Con tribution analysis Across the 12 sub jects in our dataset, a total of 146 depth electro des were implanted, yielding 1902 channels after bip olar re-referencing, with 1057 lo cated in the left hemi- sphere and 845 in the right. W e ﬁrst in vestigated the dec oding con tributions of these c hannels for speaking and listening to identify those with signiﬁcan t contributions. Only eac h participant’s training data w as used and split into training and v alidation sets at a 4:1 ratio. Eac h c hannel w as individually used to classify initials and ﬁnal s with our brain deco der during b oth sp eaking and listening tasks, and the b est v alidation p erformance was recorded. Channel con tribution. In Extended Data Fig. A.1 , w e mapp ed the channel coordi- nates of all sub jects on to a standard brain template and visualized the F1 scores of the c hannels in the classiﬁcation tasks as heatmaps. Meanwhile, we estimated the chance lev el of the classiﬁcation task by randomly sampling lab els according to the class dis- tribution as weigh ts and repeating the pro cess 5000 times. Then, w e computed the relativ e impro vemen t of each channel’s p erformance ov er the chance lev el to quantify its con tribution. In Fig. 2 a, w e highlighted the channels corresp onding to 0%, 50%, and 100% improv ement ov er the chance level, which represent low, medium, and high con tributions, resp ectiv ely . The results show that, across all contribution levels, the prop ortion of channels in the sp eaking task was consisten tly higher than that in the listening task, indicating that brain activity asso ciated with speech production is o v er- all more widespread. Fig. 2 b illustrates the spatial distribution of high-con tribution c hannels during the speaking and listening tasks. Channels highly resp onsive to sp eec h pro duction w ere found across broader cortical regions, demonstrating that speech pro duction engages a wider set of brain areas. Comparison b et ween speech production and p erception. After characterizing the spatial distribution of neural responses asso ciated with sp eec h pro duction and p erception, we next focused on ch annels that w ere resp onsiv e to b oth mo dalities. T o this end, we identiﬁed a subset of channels ( n = 38) exhibiting high contributions in b oth sp eaking and listening tasks, whic h w ere primarily distributed across the superior temp oral and insular regions, and compared their neural resp onse patterns during sp eec h production and p erception. Fig. 2 c compares the decoding performance of these c hannels in the sp eaking and listening tasks. In the initial deco ding, listening achiev ed 7 a b c d High-contribution channels Listening Speaking Relative improvement over chance level e f Right hemisphere n=1057 n=1057 n=1057 n=1057 n=845 n=845 n=845 n=845 0 5 10 15 20 25 Sample count -400 0.2 0.4 0.6 0.8 -200 0 200 T ime lag (ms) Correlation Left hemisphere 0 0.1 0.2 0.3 0.4 0.5 Speaking Listening All channels F1 score T ime lag (ms) Left hemisphere Right hemisphere Left hemisphere Right hemisphere 0 100 200 300 0 100% 200% 300% 400% 0 100 200 300 0 100 200 300 0 100 200 300 Listening final initial Speaking Channel count P(x ≥ 50%) = 12.46% P(x ≥ 0%) = 54.42% P(x ≥ 100%) = 4.15% 100% 0% 50% P(x ≥ 50%) = 9.04% P ( x ≥ 0 % ) = 4 2 . 1 % 1 P(x ≥ 100%) = 4.21% P(x ≥ 50%) = 3.89% P(x ≥ 0%) = 18.19% P(x ≥ 100%) = 2.84% P(x ≥ 50%) = 3.05% P(x ≥ 0%) = 31.34% P(x ≥ 100%) = 1.95% 500% final initial =0.1 1 1 =0.068 =0.209 =0.031 n=608 n=427 n=451 n=350 n=194 n=152 n=359 n=237 0.1 0.2 0.3 0.4 0.5 Threshold = 0% F1 score Speaking Listening =0.028 =0.016 =-0.009 =0.130 Threshold = 50% 0.1 0.2 0.3 0.4 0.5 0.6 F1 score Speaking Listening n=134 n=103 n=83 n=89 n=35 n=39 n=34 n=24 =-0.205 =-0.014 =-0.169 =-0.218 Threshold = 100% Speaking Listening 0.1 0.2 0.3 0.4 0.5 0.6 F1 score n=34 n=45 n=39 n=41 n=23 n=31 n=14 n=23 =0.084 =0.1 19 =-0.094 =0.478 Speaking Listening initial final 0.2 0.4 0.6 F1 score 0 P=0.025 (n=36) P=0.208 (n=24) −400 −300 −200 −100 0 100 200 300 Speaking (leading) Channels Listening (lagging) Fig. 2 Channel contribution analysis. a, Histogram showing the distribution of relative improv emen t o ver c hance lev el for all c hannels in the speaking and listening tasks. Three improv ement thresholds, 0%, 50%, and 100%, are indicated, corresponding to lo w-, medium-, and high-con tribution channels, resp ectiv ely . b, Spatial distributions of high-contribution c hannels iden tiﬁed during sp eec h production and p erception. c, Violin plots comparing the deco ding p erformance b et ween sp eaking and listening tasks for c hannels that were highly resp onsive to both speech pro duction and p erception. d, F or the subset of channels in (c) exhibiting stable temporal relationships b et ween mo dalities, a heatmap illustrating the maximum correlation b et ween sp eaking- and listening-evok ed neural signals and the corresp onding time lags. e, Box plots showing the distribution of time lags across diﬀerent syllables for each channel with stable sp eaking–listening delays. f, Violin plots comparing deco ding performance b et ween left and righ t hemispheric c hannels across diﬀerent contribution thresholds. 8 higher o verall accuracy than speaking ( p = 0 . 025), whereas in the ﬁnal deco ding, the p erformance of the t wo tasks w as comparable ( p = 0 . 21). Motiv ated by these results, we further inv estigated the similarities and diﬀerences in brain activit y evok ed b y sp eec h pro duction and perception. Inspired by the analysis in Hamilton et al. [ 22 ], for each channel, we computed the maximum correlations and corresp onding time lags b et w een responses to the same syllables in speaking and listen- ing conditions. W e retained channels with stable time delays across samples (standard deviation less than 100 ms), resulting in 20 channels. Fig. 2 d shows the relationship b et ween the correlations and corresp onding time lags, in whic h the correlations of these c hannels w ere notably high (mean = 0.7175, 90% CI [0.4828, 0.8678]), indicating similar neural resp onse patterns to identical syllables across sp eaking and listening. A similar phenomenon was revealed in the con text of English sp eec h b y Chen et al. [ 23 ]. Additionally , the neural signals revealed a consistent latency in the listening con- dition relative to speaking. Fig. 2 e illustrates the distribution of time lags for each c hannel, sho wing consistent resp onse latencies during listening compared to sp eaking (mean = –106.5 ms, 90% CI [–249.4, 23.05]). This delay is consistent with established neural pro cessing mechanisms of self-generated sp eec h and p erception of others [ 24 ]. Collectiv ely , these results suggest that channels resp onsiv e to b oth mo dalities exhibit similar resp onse patterns with a clear temp oral lag in sp eec h p erception relativ e to sp eec h pro duction. Comparison betw een left and right hemispheres. The left hemisphere of the brain is generally considered the dominant hemisphere for language and sp eec h pro- cessing [ 25 ], and thus many studies on sp eec h deco ding hav e utilized neural signals recorded from the left hemisphere [ 2 , 3 , 5 ]. Chen et al. [ 4 ] found in their exp erimen ts that deco ding of sp eec h pro duction could be accomplished using signals from either hemisphere, with no signiﬁcant diﬀerence observed b et ween the tw o. T aking this ﬁnd- ing a step further, we p erformed comparisons b et ween the left and right hemispheres for both sp eec h production and perception based on progressiv ely more selectiv e c han- nel subsets. Starting from all channels, we subsequen tly examined three additional subsets exceeding the low-, medium-, and high-contribution thresholds, resp ectively . F or eac h (sub)set, we group ed c hannels b y hemisphere and compared the initial and ﬁnal deco ding p erformance b et ween the left and righ t hemispheres. As shown in Fig. 2 f, w e compared the p erformance distributions of the left and right hemispheres and cal- culated their Cliﬀ ’s delta ( δ ) eﬀect sizes [ 26 ]. Across b oth the sp eaking and listening tasks, we did not observe any substantial diﬀerences betw een the left and right hemi- spheres. First, under all thresholds, the num b er of selected channels did not diﬀer signiﬁcan tly b et ween the t wo hemispheres. Moreov er, among the 16 pairwise compar- isons, 11 exhibited a Cliﬀ ’s delta with an absolute v alue below 0.147, indicating a small eﬀect size [ 27 ]. Therefore, b oth the left and right hemispheres are p oten tial targets for electro de implantation in the neural deco ding of speech pro duction and p erception. Deco ding of initials and ﬁnals Based on the results of contribution analysis, w e conducted deco ding exp erimen ts using high-contribution channels, i.e., c hannels whose activity patterns sho wed strong correlation with the deco ding targets. Due to v ariability in electrode implantation 9 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S1 1 S12 Listening 0 0.2 0.4 0.6 0.8 1 S1 S2 S4 S5 S6 S7 S8 S10 S1 1 S12 Initial Accu racy Speaking S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S1 1 S12 0 0.2 0.4 0.6 0.8 1 Initial Accu racy Final Accu racy Final Accu racy Neu roSketch ModernTCN MedFormer Mu ltiResGRU S1 1 S12 0.2 0.4 0.6 0.8 1 Speaking 65.6% 48.1% 35.2% 78.9% 55.5% 46.6% b S1 1 S12 0.2 0.4 0.6 0.8 1 Accu racy Listening 40.0% 74.2% 58.1% 87.3% 75.7% 59.9% 40.7% ID Initial 78.1% 72.2% OOD Initial 59.7% ID Final OOD Final S1 S2 S4 S5 S6 S7 S8 S10 S1 1 S12 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 a Accu racy Fig. 3 Deco ding of initials and ﬁnals. a, Bar plots showing the initial and ﬁnal classiﬁcation accuracies of NeuroSk etch and three other brain deco ders in the speaking and listening tasks. F or each task, results are sho wn for participants with a v ailable high-contribution channels (12 out of 12 for sp eaking, 10 out of 12 for listening). Complete results for all participants are provided in the Supplementary T ables. b, Bar plots illustrating syllable generalization p erformance of NeuroSk etch. Classiﬁcation accuracies of initials and ﬁnals are compared b et ween in-domain (ID) and out-of-domain (OOD) syllables for participan ts ev aluated on the second corpus. lo cations and data quality across participan ts, t wo sub jects (S3 and S9) did not hav e an y high-con tribution c hannels in the listening task. T o focus on reliable decoding, w e conducted our analysis on participants with av ailable high-contribution c hannels in the corresp onding tasks. Complete results for all participants are provided in the Supplemen tary T ables. Results ov erview. In the sp eaking task, w e achiev ed av erage initial and ﬁnal accuracies of 59.54% (95% CI [24.23%, 84.10%], representing a relative improv ement of 394.9% o ver c hance lev el) and 50.17% (95% CI [14.94%, 79.87%], represen ting a relativ e improv ement of 412.0% o ver c hance lev el), resp ectiv ely , with the highest accu- racies reac hing 85.06% for initials and 80.19% for ﬁnals. In the listening task, w e ac hieved mean initial and ﬁnal accuracies of 58.92% (95% CI [17.69%, 86.17%], rep- resen ting a relative impro vemen t of 389.7% ov er chance level) and 48.05% (95% CI [14.29%, 82.96%], represen ting a relative improv emen t of 406.6% ov er chance lev el), 10 resp ectiv ely , with top accuracies of 86.76% and 83.77%, respectively . At the individual- participan t lev el, substan tial dissociations b et ween sp eaking and listening p erformance w ere observ ed, most commonly manifesting as markedly higher deco ding accuracy dur- ing sp eech pro duction than during p erception. F or example, participan t S1 achiev ed initial and ﬁnal deco ding accuracies of 78.64% (95% CI [77.60%, 79.55%]) and 79.38% (95% CI [78.97%, 79.87%]) in the speaking task, whereas the corresponding accuracies dropp ed to 18.35% (95% CI [17.53%, 19.16%]) and 14.87% (95% CI [14.03%, 15.91%]) in the listening task. Similarly , participant S9 show ed high deco ding p erformance dur- ing sp eaking (mean = 64.93%, 95% CI [63.96%, 65.84%] for initials and mean = 70.81%, 95% CI [69.45%, 71.75%] for ﬁnals) but did not exhibit any high-contribution c hannels in the listening task, resulting in initial deco ding accuracy close to chance lev el during perception ( p = 0 . 92). F or some participan ts, deco ding performance in the listening task exceeded that in the sp eaking task. How ever, the resulting p erfor- mance gaps were smaller than those observed when sp eaking markedly outp erformed listening. Comparison of diﬀeren t brain deco ders. W e compared NeuroSk etch (2D CNN–based) with three other architectures, encompassing mainstream deep neural net work paradigms, including Mo dernTCN [ 28 ] (1D CNN–based), Medformer [ 29 ] (T ransformer–based), and MultiResGRU [ 30 ](RNN–based). The bar plots in Fig. 3 a and Extended Data Fig. A.3 summarize the classiﬁcation p erformance of the four brain deco ders. Across the four brain deco ders, NeuroSk etch achiev ed the highest o verall accuracy . Speciﬁcally , in the speaking task, NeuroSketc h attained av erage accu- racy impro vemen ts of 6.44% and 7.66% (b oth p < 0 . 001) ov er the second-b est mo del (MultiResGR U) in initial and ﬁnal classiﬁcation, resp ectiv ely . In the listening task, NeuroSk etch show ed 4.59% and 7.61% higher a verage accuracies than the second- b est mo dels, MultiResGRU for initials and Mo dernTCN for ﬁnals, resp ectiv ely (b oth p < 0 . 001). MultiResGRU demonstrated a mo dest ov erall adv antage ov er ModernTCN on initial classiﬁcation, ac hieving 1.32% and 1.51% higher accuracies in the sp eaking and listening tasks (b oth p < 0 . 001). In contrast, the t wo mo dels show ed comparable p erformance on ﬁnal classiﬁcation ( p = 0 . 47 in the sp eaking task and p = 0 . 08 in the listening task). Among the four brain decoders, Medformer show ed relativ ely low er p erformance compared to the other architectures, with the diﬀerence b eing statistically signiﬁcan t ( p < 0 . 001) across b oth tasks. Out-of-domain syllable generalization. A practical challenge for neural sp eec h deco ding in Mandarin Chinese is the large size of its syllable inv entory , which includes o ver 400 toneless syllables and more than 1200 tonal syllables. Ac hieving reliable deco ding for the full syllable inv en tory requires extensiv e data collection. F or instance, Qian et al. [ 1 ] rep orted a 13-da y recording protocol for a single participant, with 2–3 hours of recording p er day , during whic h the participant pro duced 394 toneless sylla- bles, each rep eated appro ximately 30 times. In contrast, our data collection typically lasted only 2–3 hours for most participants within a single da y , which made it imp os- sible to sample a large num b er of syllables extensively . This constraint highlights the imp ortance of deco ding syllables absent from the training set, enabling the mo del to extend from a limited subset of observed syllables to the broader syllabic space. 11 W e ev aluated the out-of-domain (OOD) syllable generalization p erformance of NeuroSk etch b y dividing the test set of the second corpus in to in-domain (ID) and OOD syllables. In OOD syllables, we achiev ed a verage accuracies of 71.42% (95% CI [58.58%, 79.40%]) for initials and 40.60% (95% CI [34.40%, 48.28%]) for ﬁnals, represen ting relative impro vemen ts ov er their resp ectiv e c hance levels of 571.2% and 333.3%, resp ectively . The results demonstrate that our framework can generalize to OOD syllables and accurately classify their initials and ﬁnals. Fig. 3 b compares the accuracies of the initials and ﬁnals on the tw o subsets. The a verage accuracy for ini- tials on OOD syllables decreased by 5.08% compared to that on ID syllables, while ﬁnals exhibited a larger drop of 14.80%. This phenomenon suggests that ﬁnals were more diﬃcult to deco de when generalizing to previously unseen syllables. One p os- sible explanation is that ﬁnals exhibit greater acoustic and articulatory v ariabilit y across syllables. F or example, the duration of the ﬁnal a diﬀers betw een the sylla- bles p a (shorter and more abrupt) and ma (t ypically longer and more nasalized). In addition, the articulatory conﬁguration of the ﬁnal i v aries with the preceding initial. F or instance, the i in shi is a retroﬂex vo wel produced with the tongue curled back, whereas in ji it corresp onds to a high front vo w el. As a preliminary exploration, these results demonstrate that an initial–ﬁnal–based deco ding paradigm can achiev e OOD syllable generalization, providing empirical evidence that Mandarin syllable deco ding do es not necessarily require full cov erage of the syllabic inv entory . Sen tence deco ding Qualit y of syllable sequence candidates. Next, w e analyzed the syllable sequence candidates generated from b eam search, which was p erformed on the predicted prob- abilities output by the brain deco der. T o assess the qualit y of the candidates from b eam search, we utilized the syllable error rate (SER) distribution of the top-20 can- didates and the exac t match probability (EMP), which is deﬁned as the probability that the top-20 candidates contain p erfectly correct syllable sequences. Fig. 4 a sho ws the SER distribution of the beam searc h results. In the sp eaking task, across all sub jects, the prop ortion of high-quality candidates (SER < 0.3) and the EMP pro duced b y NeuroSk etch were 28.23% and 27.65%, resp ectiv ely . These v alues are approximately t wice those of the second-best mo del (MultiResGR U), whic h ac hieved 15.66% high-qualit y candidates and an EMP of 12.86%. F urthermore, we in vestigated the impact of c hanges in the initial-ﬁnal classiﬁcation accuracy on the EMP . Using the p erformance of NeuroSketc h as a reference p oin t, we calculated the impro vemen t ratios in initial-ﬁnal classiﬁcation accuracy and EMP compared to other mo dels. As sho wn in Fig. 4 b, the impro vemen t ratio in the EMP signiﬁcan tly exceeded that of the initial-ﬁnal classiﬁcation accuracy ( p < 0 . 001). When the classiﬁcation accuracy improv ed by approximately 20%, the corresponding EMP could increase b y ov er 80% for some samples, indicating a substantial enhancement in candidate qualit y . The ab o ve results demonstrate that the adv antage in initial-ﬁnal classiﬁcation w as ampliﬁed during the b eam s earc h stage, indicating that a well-performing brain deco der is crucial for deco ding p erformance. Syllable-to-sen tence deco ding using LLM. The ﬁnal step of the deco ding frame- w ork leverages an LLM to generate the sp ok en or heard sen tence from m ultiple toneless 12 syllable sequence candidates. T o tackle this challenging problem, we decomp osed it in to t wo subtasks. As sho wn in the preceding analyses, errors in initial and ﬁnal deco ding w ere ampliﬁed during sequence-lev el decoding. Consequen tly , when initial and ﬁnal deco ding accuracy is insuﬃcien tly high, the b eam-searc h candidate sets can deviate substantially from the correct sequences. In most such cases, all candidates within the b eam exhibited very high SERs, making it imp ossible to reconstruct the original sentences and thereby hindering eﬀective ev aluation of the sen tence deco ding p erformance. Therefore, we fo cused our analysis on participants whose av erage pro- p ortion of high-quality candidates (SER < 0 . 3) across the deco ding corpus exceeded 10%, whic h served as a minimal qualit y threshold b elo w whic h reliable sentence recon- struction was rarely p ossible. Complete sentence deco ding results of our mo del for all participan ts are av ailable in the Supplementary T ables. Fig. 4 c presents the sentence-lev el deco ding p erformance of these participan ts. In the sp eaking task, w e achiev ed an a verage CER of 35.92% across 7 participan ts (95% CI [14.19%, 62.58%]), and in the listening task, the av erage CER across 6 participants w as 40.31% (95% CI [20.28%, 57.06%]). Among these participan ts, 5 achiev ed reli- able sentence deco ding in b oth the sp eaking and listening tasks, with av erage CERs of 38.18% (95% CI [14.03%, 62.64%]) and 37.28% (95% CI [20.12%, 51.68%]), resp ec- tiv ely . In the ﬁrst corpus, w e ac hieved the b est CERs of 14.71% (95% CI [13.27%, 16.97%]) and 21.80% (95% CI [19.56%, 23.58%]) for the speaking and listening tasks. In the second corpus, where the test sentences included syllables not present in the training set, the b est CERs of the sp eaking and listening tasks w ere 36.06% (95% CI [33.83%, 38.51%])and 31.10% (95% CI [28.98%, 33.61%]), resp ectively . In the follo w- ing results, w e tested the p erformance of other LLMs as well as v ariants of our own mo del. Given that neural deco ding typically requires rapid resp onse times, we dis- abled the deep-thinking mode during inference for all models, using standard inference exclusiv ely . 13 CER S4 S5 S6 S7 S1 1 S12 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 CER S1 S5 S6 S7 S9 S1 1 S12 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 S4 S5 S6 S7 S1 1 S12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 CER S1 S5 S6 S7 S9 S1 1 S12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 CER a Speaking Speaking Final Listening Speaking Initial Listening Initial Listening Final 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Accuracy Improvement Ratio EMP Improvement Ratio p < 0.001 y=x b 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Accuracy Improvement Ratio EMP Improvement Ratio p < 0.001 y=x 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Accuracy Improvement Ratio EMP Improvement Ratio p < 0.001 y=x 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Accuracy Improvement Ratio EMP Improvement Ratio p < 0.001 y=x c Qwen 2.5-7B DeepSeek V3.2 Qwen3-Max Qwen2.5-72B Doubao 1.6 Speaking Listening Ours GPT 5-chat Llama 3.1 Grok 4-Fast Proportion d 0 0.2 0.4 0.6 0.8 1 Proportion MultiResGRU MedFormer ModernTCN NeuroSketch 0.7-1.0 0.5-0.7 0.3-0.5 0.0-0.3 0 0.2 0.4 0.6 0.8 1 MultiResGRU MedFormer ModernTCN NeuroSketch Ours T op-1 correction T op-3 correction Baseline w/o vocabulary w/o translation T op-20 correction Fig. 4 Sen tence deco ding. a, Proportion distributions of SER for the top-20 syllable-sequence candidates generated by NeuroSk etch and the three other brain decoder architectures in the speaking (12 participants) and listening (10 participants) tasks. Complete results of our model for all partici- pants are provided in the Supplementary T ables. b, Scatter plot illustrating the relationship b et ween the relative improvemen t in initial and ﬁnal classiﬁcation accuracies and the relativ e impro vemen t in EMP among the top-20 candidates. c,d, Results for sen tence-level decoding. T o ensure that the input candidate sets contain a minimal threshold of reliable information for sen tence deco ding, we focused on participants whose a verage proportion of high-quality candidates (SER < 0 . 3) across the deco ding corpus exceeded 10% (7 out of 12 for speaking, 6 out of 12 for listening). Complete results of our model for all participan ts are provided in the Supplementary T ables. c, Bar plots comparing CER of our LLM and other LLMs on the syllable-to-sen tence decoding task, including small-sized LLM (Qw en2.5-7B- Instruct), medium-sized LLMs (Qw en2.5-72B-Instruct), and large-scale commercial LLMs (Qwen-3 Max, Deepseek-v3.2-exp, Doubao-1.6, GPT-5-chat-latest, Grok-4-fast, Llama-3.1). d, Bar plots pre- senting ablation results of our syllable-to-sentence deco ding framework. The Baseline metho d used an IME-style deco der based on lexicon- and language-mo del–driv en search. The T op-1, T op-3, and T op-20 correction groups p erformed direct correction of the top-k beam searc h candidates. Auxiliary component ablations, labeled “w/o translation” and “w/o vocabulary ,” w ere created by remo ving the respective components from the full framework. 14 As our mo del w as based on Qwen2.5-7B-Instruct [ 31 ], we ﬁrst tested the p er- formance of the original Qwen2.5-7B-Instruct on the syllable-to-sentence task. It w as observed that the av e rage CERs of Qwen2.5-7B-Instruct w ere 104.84% (95% CI [70.25%, 146.25%]) and 99.21% (95% CI [71.66%, 129.44%]) in the sp eaking and lis- tening tasks, resp ectiv ely , indicating that directly applying a 7B mo del to this task is entirely infeasible. F or medium-sized mo dels, w e tested Qwen2.5-72B-Instruct and found a signiﬁcant impro vemen t ov er Qwen2.5-7B-Instruct ( p < 0 . 001), with the av er- age CER in the sp eaking task (mean = 55.38%, 95% CI [26.58%, 95.77%]) reduced to approximately half that of Qwen2.5-7B-Instruct. F urthermore, we ev aluated six large-scale commercial LLMs, including the Chinese-oriented LLMs Qwen-3 Max [ 32 ], Deepseek-v3.2-exp [ 33 ], and Doubao-1.6 [ 34 ], as w ell as the English-oriented LLMs GPT-5-c hat-latest [ 35 ], Grok-4-fast [ 36 ], and Llama-3.1 [ 37 ]. The results sho w that most of these large-scale LLMs signiﬁcantly outperformed small- or medium-sized LLMs ( p < 0 . 001). Among the ev aluated LLMs, GPT-5-chat-latest, Doubao-1.6, and Deepseek-v3.2-exp sho w ed comparable leading p erformance, with pairwise signiﬁcance tests b et ween these mo dels yielding p -v alues greater than 0.05 in b oth the speaking and listening tasks. Even though the task primarily in volv es understanding Han yu Pinyin, Chinese-orien ted LLMs did not demonstrate a signiﬁcant p erformance adv antage o ver English-orien ted LLMs. In addition, we compared our model with these commercial LLMs. Across b oth sp eaking and listening tasks, our mo del achiev ed signiﬁcantly low er a verage CERs than all commercial LLMs ( p < 0 . 001). Sp eciﬁcally , it outp erformed the b est commercial mo dels by an a verage margin of 4.97% in the sp eaking task (compared with Deepseek-v3.2-exp) and 3.10% in the listening task (compared with GPT-5-chat- latest). These results demonstrate that decomp osing the task into tw o subtasks and applying appropriate p ost-training can substantially enhance LLM p erformance. W e also conducted an ablation study to ev aluate the eﬀectiveness of the prop osed LLM-based syllable-to-sen tence deco ding framework, examining key design choices related to LLM p ost-training, b eam-searc h candidate retention, explicit task decom- p osition, and auxiliary components. Fig. 4 d shows the sentence deco ding results of our mo del and ablation groups. W e b egan by constructing a minimal baseline in which only the top-1 b eam-searc h syllable sequence was retained and conv erted in to a Chi- nese sen tence using an input method editor (IME)-style deco der based on lexicon- and language-mo del–driv en searc h, represen ting a con v entional syllable-to-text deco ding strategy . Under this setting, the baseline ac hieved an av erage CER of 49.31% (95% CI [33.97%, 71.13%]) in the sp eaking task and 52.42% (95% CI [34.30%, 62.90%]) in the listening task. W e then replaced the IME-style decoder with a p ost-trained correction LLM while keeping all other comp onen ts unchanged. This mo diﬁcation led to an av erage CER reduction of 6.86% ( p < 0 . 001) in the sp eaking task and 5.38% ( p < 0 . 001) in the listening task, indicating that LLM-based correction exhibits greater p oten tial to resolve syllable-to-sen tence deco ding than conv entional lexicon- and language-mo del–based deco ding. Next, we examined the eﬀect of retaining multi- ple b eam-searc h candidates. When the num b er of candidates was increased from one to three, deco ding p erformance impro ved b y 2.80% ( p < 0 . 001) and 1.76% ( p < 0 . 001) in the sp eaking and listening tasks, respectively . How ever, further increasing the can- didate set to 20 did not lead to consistent additional improv ements, with p erformance 15 remaining comparable to that of the top-3 setting ( p = 0 . 27 in the listening task). This saturation eﬀect indicates that, under direct correction, expanding the candidate set increases the prop ortion of low-qualit y h yp otheses, which can obscure the cor- rect solution and limit further p erformance gains. Notably , in our framework, when direct correction was decomp osed b y ﬁrst selecting the most promising candidates from the larger set using a listwise ranking strategy b efore applying correction, p er- formance con tinued to improv e with statistical signiﬁcance ( p < 0 . 001), underscoring the imp ortance of structured task decomp osition for eﬀectively leveraging large and noisy candidate sets. Finally , w e ev aluated t wo auxiliary components of the frame- w ork: extending the LLM vocabulary to fully cov er toneless syllables and p erforming a preliminary translation p ost-training step. Remo ving either comp onen t resulted in a p erformance degradation ( p < 0 . 001) in both the sp eaking and listening tasks, conﬁrm- ing that these design choices are integral to the ov erall deco ding pip eline. In conclusion, these ablation results demonstrate that the prop osed syllable-to-sen tence deco ding framew ork b eneﬁts from b oth high-lev el framew ork design c hoices and ﬁne-grained comp onen t-level optimizations. 0 0.2 0.4 0.6 0.8 1 S1 S2 S4 S5 S6 S7 S8 S10 S1 1 S12 0.2 0.4 0.6 0.8 1 Accuracy Speak ing a c b Listening Speak ing 0 0.2 0.4 0.6 0.8 1 Proportion Listening With T one Without T one 0.7-1.0 0.5-0.7 0.3-0.5 0.0-0.3 Speak ing With tone Without tone Listening Proportion EMP With T one Without T one EMP S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S1 1 S12 0.2 0.4 0.6 0.8 1 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S1 1 S12 0 0.2 0.4 0.6 0.8 1 S1 S2 S4 S5 S6 S7 S8 S10 S1 1 S12 0 0.2 0.4 0.6 0.8 1 Accuracy Fig. 5 Deco ding of tones. a, Bar plots sho wing tone classiﬁcation accuracy in the sp eaking and listening tasks. b, Prop ortion distributions of syllable error rates for the top-20 syllable sequence candidates with and without tone information. c, Bar plots comparing exact matc h probability under decoding with and without tone information. Deco ding of tones Han yu Pinyin is comp osed of three fundamental elements: initials, ﬁnals, and tones. Previous results demonstrated that sentence-lev el deco ding could be achiev ed using only initials and ﬁnals, combined with the ability of an LLM. In this section, we explain wh y tone information is not deco ded in our framework. T o inv estigate this, we trained our brain decoder to classify tones and applied b eam search to generate tone-b earing 16 syllable candidates using the predicted probabilit y distributions o v er initials, ﬁnals, and tones. Fig. 5 a and Extended Data Fig. A.4 rep ort the results of tone classiﬁcation. Overall, the av erage tone classiﬁcation accuracy w as 58.01% (95% CI [35.14%, 92.50%]) in the sp eaking task and 54.36% (95% CI [25.97%, 87.35%]) in the listening task. W e found that, although tone has only four categories, the classiﬁcation accuracies in man y sub jects (4 sub jects in the sp eaking task and 4 sub jects in the listening task) w ere low er than those of initials and ﬁnals, which ha ve more than ten categories, indicating that the neural signal patterns for tones are less distinct than those for initials and ﬁnals. Meanwhile, the relatively high error rate in tone classiﬁcation is further ampliﬁed during beam searc h. Fig. 5 b,c compare the quality of the b eam searc h candidates generated without and with tone information. In the sp eaking task, incorp orating tones substan tially degraded candidate qualit y: the av erage prop ortion of high-quality candidates dropp ed from 28.23% to 10.09%, falling to approximately one third of the original lev el. A similar trend w as observ ed in the listening task, where the high-qualit y proportion decreased from 25.94% to 9.74%. Consistently , the a verage EMP declined sharply when tones were included: in the sp eaking task from 27.65% to 7.69%, and in the listening task from 25.23% to 6.95%, b oth falling to less than one- third of the original p erformance. These results demonstrate that in tro ducing tones drastically reduces the quality of the deco ded syllable-sequence candidates. Moreov er, our sen tence deco ding exp erimen ts show ed that b oth current commercial LLMs and our model can successfully infer correct sen tences from toneless syllable sequences, demonstrating that tone information is not necessary for sentence deco ding. Therefore, tones were not incorp orated in our decoding framework. Discussion In this w ork, we introduce a uniﬁed approach to brain-to-sentence deco ding for b oth sp eaking and listening in Mandarin Chinese. By integrating the tw o mo dalities within a uniﬁed exp erimen tal and mo deling framework and leveraging the broad spatial cov- erage of sEEG electro des, we successfully deco ded full sentences in b oth mo dalities, demonstrating the p oten tial of multimodal brain–language deco ding. A notable aspect of our deco ding framew ork is its generalization ability . Ev en when trained only on single-c haracter sp eaking or listening data, the framew ork can decode full sen tences. It also handles characters and syllables that nev er appeared during training, demonstrat- ing broad generalization across linguistic units. T o address the high ambiguit y inheren t in mapping Mandarin syllables to characters and words, we prop ose an LLM-based syllable-to-sen tence deco ding framework that p erforms end-to-end mapping directly from a set of syllable sequence candidates to the corresp onding Chinese sentence. Giv en the diﬃculty of this task for LLMs, we in tro duce a principled approach lev erag- ing task decomp osition alongside contin ued p ost-training of the LLM. Sp eciﬁcally , we dev elop a three-stage p ost-training and tw o-stage inference framework, which enables a 7-billion-parameter LLM to outp erform several muc h larger commercial mo dels in this challenging deco ding task. 17 Using this uniﬁed framework, w e not only ac hieved sen tence deco ding but also enabled direct comparison b et ween the tw o mo dalities. Our k ey observ ations include the follo wing: First, neural responses during speech pro duction were distributed across broader cortical regions than those observ ed during sp eec h p erception. Second, chan- nels join tly resp onsiv e to sp eec h pro duction and p erception show ed highly similar neural dynamics across the t wo mo dalities, with p erceptual resp onses consistently lagging b ehind those observed during production. Third, deco ding p erformance was comparable betw een the left and righ t hemispheres for b oth speaking and listen- ing tasks. These results highligh t shared and distinct neural characteristics of sp eec h pro duction and p erception. Despite the promising results ac hieved in this study , sev eral limitations remain. First, our LLM-based syllable-to-sen tence deco ding framework requires the full sen- tence to b e completed b efore generating outputs, whic h prev ents online deco ding of eac h c haracter as it is b eing spoken or heard. Second, while our results demon- strate that initials maintain relatively high accuracy in OOD syllable generalization, ﬁnals exhibit a larger p erformance drop. Third, although w e achiev ed deco ding p er- formance that was consistently and substan tially ab o ve chance lev el, the conﬁdence in terv als reveal pronounced in ter-sub ject v ariability . This v ariability primarily arises from inherent c hallenges in clinical neural recordings, including diﬀerences in elec- tro de implan tation sites, signal quality , av ailable training samples, and participan t engagemen t. How to mitigate suc h data heterogeneity at the deco ding level, particu- larly for sub jects with less informativ e neural signals, and further extend brain-to-text deco ding tow ard robust cross-sub ject generalization with sEEG remains an op en and c hallenging problem that warran ts future inv estigation. F uture work will focus on addressing these c hallenges. T o improv e online deco ding, future work ma y explore frameworks that support incremen tal syllable-to-c haracter con version, allowing for contin uous output up dates based on partial input and his- torical deco ding con text. Regarding OOD syllable generalization, inspired by F eng et al. [ 6 ], we will further explore more ﬁne-grained acoustic feature-based segmenta- tion of ﬁnals, explicitly distinguishing diﬀeren t realizations of the same ﬁnal to guide the mo del tow ard learning more transferable representations. With respect to cross- sub ject deco ding, Singh et al. [ 11 ] demonstrated successful reconstruction of English phoneme sequences using a mo del with a shared sub-comp onen t across participants, highligh ting the p oten tial of cross-sub ject brain-to-sen tence deco ding with sEEG. T o mitigate sEEG data heterogeneit y and facilitate more robust and generalizable brain- to-text decoding, future work may explore complemen tary directions at both the data-analysis and model-design levels. F rom a data persp ectiv e, w e aim to inv esti- gate in trinsic prop erties of sEEG signals by iden tifying neural features that are shared across sub jects under sp eciﬁc conditions, such as within homologous cortical regions. If suc h shared representations prov e to b e suﬃciently generalizable and informative, they may enable training strategies that lev erage p ooled data across participants. In parallel, at the mo del-design level, we plan to explicitly encourage the learning of these shared neural represen tations while suppressing sub ject-sp eciﬁc or task-irrelev ant v ari- ations, thereby providing a p oten tial pathw ay to ward more robust and generalizable brain-to-text deco ding. 18 Bey ond the immediate technical contributions, our work carries broader impli- cations across multiple domains. Within the ﬁeld of artiﬁcial in telligence, although instan tiated with Mandarin Pinyin initials and ﬁnals, this deco ding paradigm can b e broadly applied to any sensor-to-symbol pip eline capable of generating an n-b est h yp othesis set, thus op ening a ven ues for versatile neural deco ding across diﬀeren t mo dalities and languages. Moreov er, the syllable-to-sentence dec oding framework we dev elop ed demonstrates the great potential of LLMs for neural language decoding. Unlik e traditional approaches that treat LLMs merely as downstream error-correction to ols applied to deco ded sentences, our framew ork enables LLMs to directly pro cess more diverse input forms and p erform more complex tasks with greater ﬂexibilit y and eﬀectiv eness. In the context of brain-computer interfaces (BCI), these adv ances sup- p ort progress to w ard neural deco ding systems that handle m ultiple language input and output mo dalities. F rom a neuroscience p erspective, our observ ations oﬀer insights in to the shared and distinct neural pro cesses underlying language pro duction and per- ception, whic h ma y guide future inv estigations into the neural basis of m ultimo dal language pro cessing. Metho ds Data acquisition P articipants. Twelv e participants with drug-resistant epilepsy w ere enrolled from three tertiary hospitals in China: Huashan Hospital (F udan Universit y), Chongqing Xinqiao Hospital, and the First Aﬃliated Hospital of F ujian Medical Universit y . All underw ent clinical sEEG monitoring as part of their pre-surgical ev aluation. Partic- ipan ts were nativ e Mandarin sp eak ers, and written informed consen t was obtained b efore data collection, in accordance with institutional ethics guidelines and the Dec- laration of Helsinki. The study w as appro ved by the Huashan Hospital Institutional Review Board of F udan Univ ersity (HIRB, KY2019-518). Signal recording. The electro de implantation plan, encompassing the lo cation and n umber of implants, w as devised solely for the treatment of epilepsy . Neural signals w ere recorded using a m ulti-channel electrophysiological recording system (EEG- 1200C, Nihon Kohden, T okyo, Japan) at a sampling rate of 2000Hz. Concurrently , audio was acquired via a microphone placed in proximit y to the participant to ensure clear and reliable sound recording, sampled at 44.1 kHz. Both neural and audio sig- nals w ere synchronized and recorded in real time using the BCI2000 soft ware platform ( h ttps://www.b ci2000.org ). Electro de anatom y lo calization. Individual brain reconstructions were performed using preop erativ e T1-weigh ted MRI with F reeSurfer [ 38 ] (version 7.3.2). Postopera- tiv e CT was co-registered to the preop erativ e MRI using V eraView (United Imaging Healthcare), a clinical imaging platform that enables rigid fusion of multimodal vol- umes through in tensity-based alignment. In tracranial electro de contacts w ere manually annotated on the coregistered CT–MRI o verla y within V eraView, with spatial lo cal- ization guided by anatomical landmarks. The resulting electrode co ordinates in nativ e space were subsequently used for surface-based mapping and further transformation 19 in to standard MNI space. Supplementary Fig.1 shows the electro de lo cations for each participan t. Acoustic con tamination. W e conducted a con tamination analysis to rule out the p ossibilit y that our neural signals are contaminated by acoustic noise, utilizing the metho d proposed in Roussel et al. [ 39 ]. Firstly , we calculated the correlations b et ween frequency components in the audio and the neural signals to obtain the con tamination matrix. The mean v alue on the diagonal was computed to obtain a contamination index. Then, a distribution of surrogate indices was built b y computing the mean diagonal 10000 times on as many shuﬄed versions of the contamination matrix. Each time, w e shuﬄed the contamination matrix b y p erm uting either its ro ws or its columns. Finally , we calculated the prop ortion P of surrogate indices that exceeded the original index, where P > 0 . 05 indicates insuﬃcient evidence to reject the null hypothesis of no con tamination. Supplementary Fig.2a presents an example contamination matrix and the corresp onding distribution of surrogate indices for participant S6. As shown in Supplemen tary Fig.2b, all participants exhibited P v alues greater than 0.05, suggesting no evidence of acoustic contamination in our neural recordings. Exp erimen tal paradigm Data collection for training. Each participan t w as guided b y visual cues to p er- form trial tasks. Each trial b egan with a white ﬁxation cross presen ted cen trally on a blac k bac kground for 2 seconds, follo wed by a 1-second auditory cue, a 1-second in ter-stimulus in terv al, and a 3.5-second articulation windo w. The articulation win- do w b egan with a 1-second ready p eriod, follow ed by a 2.5-second dynamic horizontal progress bar that guided sustained vocalization until its completion. Due to clini- cal constraints and v ariability in recording stability across patients, the exp erimen tal duration was limited to 2-3 hours, yielding 5-15 sessions for each participant. Data collection for ev aluation. F or the ﬁrst corpus, eac h experimental session lasted approximately 13 minutes, including one presen tation of eac h sentence. F or the second corpus, completing one full pass o ver all test sentences required tw o ses- sions. Participan ts were guided by visual c ues to p erform trial tasks. Each trial b egan with a 2-second preparatory p erio d, follo wed by an auditory prompt containing the full sentence. Within the audio prompt, each c haracter w as presented for 1 second, with 1 second inter-c haracter interv als. F ollowing a 1-second inter-stim ulus in terv al, a dynamic horizontal progress bar appeared on the screen to guide sentence articula- tion. Each character within the sen tence was allo cated 3 seconds, which comprised a 1-second inter-c haracter pause and 2 seconds of sustained vocalization. Data prepro cessing Neural signals w ere ﬁrst visually insp ected to exclude channels exhibiting excessiv e noise or absen t activit y . The remaining data were downsampled to 1000 Hz, band- pass ﬁltered b et ween 0.5 and 200 Hz, and notch ﬁltered at 50 Hz to remov e p o wer-line in terference. Bip olar re-referencing w as then applied to enhance spatial speciﬁcity , and eac h channel was subsequently normalized via z-score transformation. W e segmented neural signals based on characters. In the listening task, each c haracter’s auditory 20 cue lasted 1 second, so we used a uniform 1-second window for segmentation. In the sp eaking task, individual character articulation inv olved a 3.5-second window, b eginning with a 1-second ready p eriod follo wed by a 2.5-second progress bar that guided sustained vocalization. Accordingly , we segmented these trials using a 2.5- second window. F or sen tence-level sp eec h, eac h character was allo cated 3 seconds, consisting of a 1-second in ter-character pause and 2 seconds of sustained vocalization. W e segmented these with a 2-second window and applied zero-padding before and after to match the 2.5-second windo w length used for individual characters. Resp onse latency analysis Here we describ e the implementation details of the resp onse latency analysis. In the analysis, w e selected channels that w ere highly resp onsiv e to b oth sp eec h pro duction and p erception, and compared their neural signals under the tw o mo dalities. W e uti- lized the data in the training set, in which participants either listened to or pro duced individual characters. Neural signals here were aligned to the actual b eha vioral onset, deﬁned as the audio playbac k onset in the listening task and the sp eec h onset in the sp eaking task. F or each trial, w e extracted a 1-second neural segment starting from the corresp onding onset. Signals asso ciated with the same syllable were then av eraged separately for the speaking and listening conditions. T o quantify the temp oral similarit y and relative dela y b et ween neural resp onses ev oked by sp eec h pro duction and perception, w e p erformed a cross-correlation analy- sis on the neural signals recorded from individual c hannels. Given tw o single-c hannel neural signals x ( t ) and y ( t ), corresp onding to the same syllable under sp eaking and listening conditions, resp ectiv ely , we ﬁrst remo ved their mean v alues to eliminate direct-curren t comp onen ts: ˜ x ( t ) = x ( t ) − 1 N N X t =1 x ( t ) , ˜ y ( t ) = y ( t ) − 1 N N X t =1 y ( t ) , (1) where N denotes the length of the t w o signals. W e then computed the cross-correlation function b et ween the mean-cen tered signals as: R xy ( τ ) = X t ˜ x ( t ) ˜ y ( t + τ ) , (2) where the lag τ spans the full range from − ( N − 1) to ( N − 1). T o fo cus on physiologi- cally plausible temp oral oﬀsets, the analysis was restricted to lags satisfying | τ | ≤ τ max . T o facilitate comparison across channels and conditions, the cross-correlation function w as normalized to yield a correlation co eﬃcien t: ρ xy ( τ ) = R xy ( τ ) p P t ˜ x ( t ) 2 P t ˜ y ( t ) 2 , (3) 21 whic h b ounds the correlation v alues b et ween − 1 and 1. F or eac h c hannel, we identiﬁed the maximum correlation co eﬃcien t ρ max = max τ ρ xy ( τ ) , (4) along with the corresponding lag τ ∗ . The v alue ρ max w as used to quan tify the similarity b et ween sp eaking- and listening-ev oked neural resp onses, while τ ∗ w as in terpreted as the relative temp oral delay b et w een the tw o conditions. Brain deco der In this section, we provide detailed descriptions of NeuroSketc h and three other brain deco ders ev aluated in our w ork. NeuroSketc h is described in a separate w ork ev aluated on public datasets. The presen t study is indep enden t in terms of data and exp erimen tal setting, and fo cuses on a uniﬁed m ultimo dal brain-to-sentence deco ding framew ork and LLM-based inference for Mandarin. The schematic diagrams of the arc hitectures for the four brain deco ders are presented in Extended Data Fig. A.2 . The input to eac h deco der can b e deﬁned as X ∈ R C × T , where C denotes the num b er of channels and L denotes the length of the neural recording acquired during sp eec h pro duction or p erception. NeuroSk etch. NeuroSketc h [ 19 ] is a 2D-CNN-based neural deco ding architecture dev elop ed in our concurren t w ork. Giv en the input X ∈ R C × L , the mo del ﬁrst reshap es it into X ′ ∈ R B × 1 × 3 C ×⌊ L/ 3 ⌋ . The reshap ed tw o-dimensional representation is subse- quen tly pro cessed by a stem stage comprising four sequential stem blo cks. Eac h blo c k consists of a t wo-dimensional con volutional lay er, follow ed b y batch normalization and a ReLU activ ation function. The conv olutional la yers in all four blo c ks utilize a k er- nel size of 3, padding of 1, and strides of 2, 1, 1, and 2, resp ectiv ely . The input and output feature dimensions for the four blo c ks are as follows: from 1 to 64, 64 to 32, 32 to 64, and 64 to 96, resp ectiv ely . F ollowing the stem stage, the net work consists of four feature extraction stages, each comprising four NeuroSketc h blo c ks. The ﬁrst three blo c ks maintain constant output feature dimensions of 96, 192, 384, and 768 for the four stages, resp ectiv ely . The ﬁnal blo c k in each stage increases the output fea- ture dimensions to 128, 256, 512, and 1024, resp ectiv ely . Each blo c k consists of tw o k ey comp onen ts: a patch embedding mo dule and a conv olution mo dule. Within the patc h embedding mo dule, when it is lo cated at the b eginning of the last three stages, an av erage p o oling la yer with a k ernel size and stride of 2 is ﬁrst applied to downsam- ple the feature map. F urthermore, for the mo dules in the ﬁrst and last blo c ks of each stage, a linear pro jection is used to adjust the feature dimensions, follow ed by batch normalization to stabilize the feature distribution. F or all other cases, the mo dule p erforms an iden tity mapping. The conv olution mo dule applies group ed 3 × 3 conv o- lutions, where the n umber of groups is set to the output feature dime nsions divided b y 16. It is follow ed b y batch normalization, Mish activ ation [ 40 ], and a 1 × 1 con- v olution for feature fusion. The output is added back to the patch-em b edded input via a residual connection. After the four-stage feature extraction, we then apply GeM p ooling [ 41 ] to aggregate the representation along the temp oral dimension, which is ﬁnally passed through a linear la yer to produce the class probabilities. 22 Mo dernTCN. Mo dernTCN [ 28 ] is a temp oral conv olutional architecture designed for time series data. The mo del is organized into three hierarchical stages, each con- taining an embedding lay er and three Mo dernTCN blo c ks. Across these stages, the feature dimension progressiv ely increases from 32 to 64 and then to 128. The embed- ding lay er in the ﬁrst stage diﬀers from those in the subsequen t stages. Sp eciﬁcally , it p erforms patch embedding independently for each channel using a one-dimensional con volutional lay er with an output feature dimension of 32, a k ernel size of 50, and a stride of 50. In con trast, the em b edding la yers at the beginning of the second and third stages apply linear pro jection op erations to increase the feature dimension for each c hannel. Subsequen tly , the c hannel and feature dimensions are concatenated and fed in to a Mo dernTCN block. Each Mo dernTCN block comprises three main components: a conv olutional mo dule, a feature-wise feed-forward netw ork (FFN), and a c hannel- wise FFN. The conv olution mo dule employs large-kernel depthwise conv olutions with k ernel sizes of 21, 17, and 13 in the ﬁrst, second, and third stages, resp ectiv ely , com- plemen ted by auxiliary small-kernel branches with a k ernel size of 5 to enhance local feature mo deling. The feature-wise FFN groups conv olutions b y the num b er of chan- nels C , while the c hannel-wise FFN groups them b y the n umber of features. Both net works expand the hidden dimension four times via a linear pro jection, GELU acti- v ation [ 42 ], drop out, and residual connections, before pro jecting it bac k to the original size. Finally , the extracted features are av eraged along the temp oral dimension, and the channel and feature dimensions are concatenated and fed in to a linear classiﬁcation head for discrete category prediction. Medformer. Medformer [ 29 ] is a multi-gran ularity patching T ransformer architec- ture sp eciﬁcally developed for medical time-series classiﬁcation. The mo del initiates pro cessing with a m ulti-scale patc h em b edding mo dule that segmen ts the input signals in to non-ov erlapping temp oral patches of lengths 5, 10, 20, and 50, thereby capturing b oth ﬁne- and coarse-grained temp oral dynamics. Each patch is pro jected in to a 384- dimensional embedding space using a set of cross-channel tok en em b edding lay ers, to whic h learnable con textual tok ens and sin usoidal p ositional enco dings are added to preserv e the c hannel and temp oral order. The em b eddings derived from the v arious patc h scales are then indep enden tly pro cessed by six Medformer enco der blo c ks, each comprising intra-scale and inter-scale self-attention mec hanisms as well as a FFN. The intra-scale self-attention is applied separately to the patc h sequences within each temp oral scale, enabling the mo del to capture scale-sp eciﬁc temp oral dep endencies. Con versely , the in ter-scale self-attention op erates on a set of router tokens, which are obtained by extracting the last token from each intra-scale output. These router tok ens interact through self-attention to facilitate the exc hange of information across diﬀeren t temp oral scales. The up dated router representations then replace the origi- nal tokens in each scale sequence. The FFN expands the hidden dimension four times via a linear pro jection, ReLU activ ation, dropout, and la yer normalization, b efore pro- jecting it back to the original size. Residual skip connections are implemented around b oth atten tion and FFN to stabilize training. Finally , the router tokens from all scales are concatenated and passed through a linear pro jection la yer to pro duce the class probabilities. 23 MultiResGR U. MultiResGR U [ 30 ] is a hierarchical recurren t neural netw ork designed to capture temp oral dep endencies in time series data. The arc hitecture begins with a linear em b edding lay er that pro jects the input in to a 512-dimensional latent space, follow ed by lay er normalization and a ReLU activ ation function. Subsequently , the mo del incorp orates a stack of three residual bidirectional GR U blo c ks, each com- prising a bidirectional GRU lay er and a tw o-lay er FFN. Within each blo ck, the FFN initially expands the hidden representation to 1536 via a linear pro jection, follow ed b y a ReLU activ ation, drop out, and lay er normalization, and then pro jects the fea- tures back to the original hidden size using the same structure. Finally , the features are a veraged along the temp oral dimension and passed to a linear classiﬁcation head for discrete category prediction. Beam searc h Beam search was used to deco de the predicted probability distributions of the mo del on the initials and ﬁnals into candidate syllable sequences. Speciﬁcally , inference w as ﬁrst p erformed on the test sentences using the trained brain decoder to obtain the probabilit y distributions of the initial and ﬁnal comp onen ts for each character. These probabilities w ere then organized according to the initial–ﬁnal structure to construct a probabilit y matrix that represents the likelihoo ds of diﬀeren t initial–ﬁnal com binations for each c haracter. In addition, a separator symbol (“+”) with a ﬁxed probability of 1 was inserted b etw een consecutiv e initial–ﬁnal pairs, ensuring that the beam searc h treated each complete syllable as a distinct deco ding unit. A lexicon-constrained b eam search was subsequently applied to generate v alid syllable sequences from the probabilit y matrix. The lexicon was constructed b y segmen ting the deco ding corp ora into words and conv erting each word into its corre- sp onding syllable sequence, which was organized as a preﬁx tree for eﬃcien t lo okup. During deco ding, the b eam width was set to 100, a top-k ﬁlter of 50 was applied, and the 20 most probable candidate sequences w ere retained. At each step of the b eam searc h, up to 100 hypotheses were main tained. These h yp otheses were expanded by querying the preﬁx tree with the top-50 most probable syllables from the mo del’s output distribution, ensuring that only syllable sequences corresp onding to v alid pre- ﬁxes of lexicon entries w ere considered. Finally , the 20 complete hypotheses with the highest cumulativ e log-probabilities were selected, yielding the most likely deco ding candidates. Syllable-to-sen tence deco ding framew ork After obtaining the b eam search candidates, we employ ed an LLM to generate the correct sen tences. W e observ ed that directly using the LLM for inference yielded sub optimal results, esp ecially with small- or medium-scale LLMs, whic h sometimes pro duced outputs completely unrelated to the ground truth. Although very large com- mercial LLMs can generate reasonable results, their inference demands signiﬁcan t computational resources. Moreov er, due to the extreme sensitivit y of clinical data, hospital data storage environmen ts are t ypically isolated from external netw orks. As a result, deco ding mo dels must b e deploy ed lo cally in practical deco ding scenarios, a 24 requiremen t that commercial LLMs often struggle to m eet. Therefore, our goal was to enable small-scale LLMs to eﬀectively handle this task. Despite selecting the top-20 candidates based on b eam search scores, the quality of these candidates was highly v ariable in practice, often including many samples with extremely high error rates that could mislead the LLM. T o address this issue, w e decomp osed the task into tw o simpler subtasks. First, the LLM selects the three b est candidates from all beam search outputs. Then, it infers the correct sen tence based on these three selected candidates. There are t wo reasons for c ho osing three candidates: ﬁrst, the b est candidate is often not unique; second, the candidates may complemen t each other. F or example, a certain syllable may b e incorrect in the ﬁrst candidate but correct in the second. Allo wing the LLM to consider multiple high- qualit y candidates enables it to integrate complemen tary information and pro duce a more accurate result. In our implemen tation, we utilized a publicly av ailable Chinese- orien ted LLM, Qwen2.5-7B-Instruct [ 31 ], as the base model. V o cabulary expansion. There are a total of 416 toneless syllables in Hanyu Pin yin [ 43 ], of which the vocabulary of Qwen2.5-7B-Instruct cov ers 202 (Supplemen- tary 5). Therefore, w e extended the vocabulary to include all syllables. Corpus for p ost-training. Because our recorded training set contains only individual-c haracter data, p ost-training the LLM requires constructing syn thetic data from publicly av ailable datasets. W e built our corpus from t wo publicly a v ailable datasets: NLPCC18 [ 44 ] and SIGHAN15 [ 45 ], whic h consist of relatively simple Chi- nese sentences. W e ﬁltered these datasets by retaining only sentences comp osed exclusiv ely of Chinese characters. The ﬁltered sentences and their corresp onding toneless syllable sequences formed the p ost-training corpus. P ost-training task 1: translation. The translation task inv olv es translating toneless syllable sequences into corresp onding Chinese sentences, which serves as a preliminary task for the post-training pro cess. W e designed this task to allo w the LLM to build a connection betw een unfamiliar knowledge (syllable sequences) and familiar kno wledge (the Chinese language). W e denote the mo del ﬁne-tuned with this task as Qw en2.5-7B-T ranslate. P ost-training task 2: list wise ranking. The list wise ranking task requires the LLM to select the three candidates closest to the correct syllable sequence from a set of 20. Because the p ost-training corpus only con tained the correct syllable sequences of the sen tences, w e needed to construct candidate syllable sequences from the original sequences to simulate those obtained from b eam search. W e describe the steps that construct one candidate from a correct syllable sequence with n syllables. First, we decided on the error rate r of the candidate. There were three error rate ranges: (0, 0.3), (0.3, 0.6), and (0.6, 0.9), from which one range was randomly selected according to probabilities drawn from a Diric hlet distribution with weigh ts of 2, 2, and 1. The error rate r was uniformly sampled within the range. In addition, if the selected range is (0, 0.3), the error rate r was set to 0 with a probability of 10%. Second, we p erformed ⌊ n × r ⌋ random replacemen ts. In eac h replacemen t, w e randomly selected a syllable. If the selected syllable corresponded to a single Chinese character in the original sentence, it was replaced with another randomly c hosen syllable. If the syllable corresp onded to a character that formed a word together with other characters, 25 then the en tire w ord w as replaced, ensuring that the syllable sequence of the substitute w ord diﬀered from that of the original word by exactly one syllable. This replacement strategy enables the distribution of the generated data to more closely match the distribution of actual b eam search outputs. Since diﬀeren t replacement op erations migh t aﬀect the same syllable, w e recalculated the error rate of the candidate relative to the original syllable sequence after all replacement steps were completed to make corrections. By indep enden tly rep eating the ab o v e steps 20 times, we obtained a candidate set for each sample. The ﬁne-tuning of listwise ranking w as based on Qw en2.5-7B- T ranslate. When training the mo del with the constructed data, w e randomly p erm uted the input candidates and asked the mo del to ﬁnd out three candidates with the low est error rates. After training, we obtained Qwen2.5-7B-Rank. P ost-training task 3: correction. In the correction task, the mo del was trained to predict the correct Chinese sentence based on the selected three candidate syllable sequences. W e employ ed Qwen2.5-7B-Rank to infer the data from the list wise ranking task, deriving the training data for correction. Using the previously trained model for inference, rather than directly selecting the three candidates with the low est error rates, w as motiv ated b y the fact that the mo del-inferred candidates ma y not alw a ys b e optimal. Incorp orating relatively sub optimal candidates helps enhance the robustness of the mo del. The correction task w as ﬁne-tuned based on Qwen2.5-7B-Rank, and we denote our correction model as Qw en2.5-7B-Correct. Tw o-stage inference. W e employ ed Qwen2.5-7B-Correct as our inference mo del. Firstly , the top-20 candidates output from the b eam search pro cess were fed into the mo del to select the top three. Then, the selected candidates were input in to the mo del again to generate the sentence. Ablation study . In the ablation study , we constructed a series of ablated mo dels to ev aluate the technical con tributions of our syllable-to-sentence deco ding framework. The IME-style deco der, implemented using the Python Pinyin2Hanzi pack age, relies on lexicon- and language-mo del–driv en search for syllable-to-sentence conv ersion. F or the direct correction approach, b oth the top-1 and top-3 candidate settings utilized inference via Qw en2.5-7B-Correct to control v ariables consistently . Since Qwen2.5- 7B-Correct do es not supp ort direct correction of 20 candidates sim ultaneously , we trained a separate p ost-trained mo del based on Qw en2.5-7B-T ranslate for the top-20 candidate setting. Ablations of the v o cabulary expansion and translation p ost-training comp onen ts w ere p erformed by removing these mo dules from the original framework. T raining and ev aluation details In this section, w e presen t the details for all the training and ev aluation in our deco ding framew ork. Data augmentation. In the classiﬁcation of syllable comp onen ts, we employ ed a comprehensiv e data augmentation pip eline comprising ﬁve techniques to enrich data div ersity and enhance model robustness. First, a random shift w as applied with a prob- abilit y of 0.5, whic h shifted the input sequence randomly within ± 10% of its length, where p ositiv e and negative v alues indicate forward and backw ard shifts, resp ectiv ely . 26 Second, additive noise was applied with a probability of 0.1, where zero-mean Gaus- sian noise was adaptively scaled according to the input’s standard deviation to yield a signal-to-noise ratio that w as randomly sampled b et ween 15 and 30 dB. Third, c hannel masking w as applied with a probability of 0.5, where eac h channel had an independent masking c hance of 0.2. F ourth, time masking was applied with a probability of 0.5, in whic h four consecutive temp oral segments, each cov ering 5% of the total sequence length, were masked along the time axis. Finally , mixup [ 46 ] was used with a proba- bilit y of 0.5, where tw o samples and their corresp onding lab els w ere linearly combined using a mixing co eﬃcien t λ drawn from a b eta distribution α = 0 . 4, thus smo othing the decision b oundaries and improving generalization. Brain deco der training. All mo dels were initialized with a random seed of 42 to ensure repro ducibilit y . W e p erformed training with a batch size of 64 and ev aluation with a batc h size of 128, using binary cross-en tropy (BCE) loss. Optimization follow ed the Muon algorithm [ 47 ], with a weigh t decay co eﬃcien t of 0.05 serving as regular- ization. During the warm-up stage, which cov ered 10% of the total training steps, the learning rate w as linearly increased to 3 × 10 − 4 . After that, a cosine decay schedule w as applied, main taining the minimum learning rate at zero to promote smo other con vergence and more stable optimization dynamics. F or channel contribution analysis, w e used only each participan t’s training data, whic h were further split in to training and v alidation sets at a 4:1 ratio. W e trained the mo del for 100 epo chs and retained the c heckpoint corresp onding to the b est v alidation p erformance. F or syllable component deco ding of four brain deco ders, mo dels were trained on the full training set for 500 ep o c hs. T o enhance mo del generalization, w e applied Sto chastic W eight Averaging (SW A) [ 48 ] during training, from the 250th to the 500th ep o c h. During ev aluation, T est-Time Augmentation (TT A) [ 49 ] w as employ ed to improv e prediction stability . F or each test sample, the original input was retained along with tw o additional v ersions generated via random shifts. The mo del p erformed inference on all three v ariants, and their output logits were av eraged to pro duce the ﬁnal prediction. Moreov er, to ensure ev aluation reliability and reduce randomness, we p erformed inference under ten diﬀerent random seeds, and the av eraged results were rep orted. LLM post-training. F or all p ost-training tasks of the LLM, we p erformed sup er- vised ﬁne-tuning (SFT) using LoRA [ 50 ] with a rank of r = 16 and a scaling factor of α = 32. Each task was trained for one ep och and optimized with the AdamW opti- mizer. W e adopted a linear warm-up o ver 5% of the training steps to reach a peak learning rate of 5 × 10 − 5 , follow ed b y a cosine scheduler that deca yed the learning rate to zero. T o further improv e computational eﬃciency , BFloat16 precision [ 51 ] and FlashA ttention-2 [ 52 ] were emplo yed throughout the ﬁne-tuning pro cess. The detailed tuning instructions are summarized in Supplemen tary 6. Statistical analysis In this section, we describ e the statistical analyses used for all exp erimen tal results rep orted in this study . Diﬀeren t statistical tests w ere applied depending on the comparison setting, as detailed b elo w. 27 F or comparisons b et w een speech pro duction and speech p erception within the same set of highly responsive channels (Fig. 2 c), p erformance diﬀerences b et ween sp eaking and listening tasks w ere assessed using t wo-sided Wilcoxon signed-rank tests. F or com- parisons of decoding performance betw een the left and right hemispheres (Fig. 2 f ), eﬀect sizes were quan tiﬁed using Cliﬀ ’s delta, which is appropriate for non-parametric comparisons of indep enden t samples. F or initial and ﬁnal classiﬁcation p erformance, one-sided Wilcoxon s igned-rank tests w ere conducted to assess whether each partic- ipan t’s deco ding accuracy signiﬁcantly exceeded c hance level. Comparisons b et ween diﬀeren t brain deco ders (Fig. 3 a) were conducted across all participants using tw o- sided Wilcoxon signed-rank tests. T o assess the relationship b et ween the impro vemen t ratio in the EMP and that in the initial–ﬁnal classiﬁcation p erformance (Fig. 4 b), w e applied a one-sided Wilcoxon signed-rank test to examine whether the distribution of their paired diﬀerences w as signiﬁcantly greater than zero. F or sen tence-level decoding p erformance (Fig. 4 c,d), comparisons among diﬀerent LLMs, as well as b et ween our prop osed LLM and its ablated v ariants, were p erformed across all participants using t wo-sided Wilcoxon signed-rank tests. Data av ailabilit y Because of ethical restrictions, the dataset cannot b e publicly arc hived. How ever, it is a v ailable on request from the senior corresp onding author. The source data accompa- n ying this pap er include all quantitativ e results rep orted in the manuscript, compiled in to Supplementary Tables.xlsx , along with a README.md ﬁle that describ es the conten ts of each sheet. Co de a v ailability The source co de supp orting this study is made av ailable with the pap er. Detailed implemen tation and usage information can b e found in the accompan ying README.md ﬁle. Use of LLMs In addition to the LLMs that we p ost-trained and ev aluated within our own exp er- imen ts, we made limited use of publicly av ailable LLMs during the preparation of this manuscript. Sp eciﬁcally , an LLM w as used to assist with minor writing reﬁne- men ts, such as improving grammatical accuracy and textual clarit y . In some cases, the LLM w as also employ ed for small-scale co de completion tasks, including generating b oilerplate functions or suggesting minor syntax corrections. All core ideas, experi- men tal designs, analyses, and conclusions w ere completely conceiv ed, implemen ted, and v alidated b y the authors. Ac kno wledgmen ts This work w as supported b y the National Science and T echnology Ma jor Pro ject (2025ZD0215100) and the National Natural Science F oundation of China (Nos. 28 32595492 and 82272116). W e also gratefully ackno wledge the supp ort of the iBRAIN (In tracranial Brain Recording/Activ ation/Inhibition Netw ork) Data Alliance. The iBRAIN Alliance is a multi-cen ter collab orativ e initiative dedicated to establish- ing a large-scale, standardized in tracranial EEG database to adv ance research in brain-computer interfaces and clinical neuroscience. Author Contributions Z.Y., Y.Y., and M.L. conceived the study . Z.Y., G.Z., and B.C. designed the data acquisition paradigms. L.C. and X.L. were resp onsible for clinical electro de implan tation. Z.W. managed the clinical trials. B.C. conducted data acquisition and prepro cessing. Y.X. assisted with clinical data acquisition. Y.M. sup ervised the ov er- all data acquisition. Z.Y. and G.Z. designed the mo del framework. G.Z. implemented the mo del and p erformed v alidation exp erimen ts. Z.Y., B.C., and M.L. designed the analysis exp erimen ts. Z.Y. conducted the analysis exp erimen ts. Z.Y. and G.Z. wrote the manuscript. Y.Y. revised the manuscript. Y.Y. and M.L. sup ervised the pro ject. Comp eting In terests The authors declare no comp eting interests. References [1] Qian, Y., Liu, C., Y u, P ., Ran, X., Li, S., Y ang, Q., Liu, Y., Xia, L., W ang, Y., Qi, J., Zhou, E., Lu, J., Li, Y., T ao, T.H., Zhou, Z., W u, J.: Real-time decoding of full-sp ectrum c hinese using brain-computer interface. Science Adv ances 11 (45), 9968 (2025) [2] Metzger, S.L., Littlejohn, K.T., Silv a, A.B., Moses, D.A., Seaton, M.P ., W ang, R., Doughert y , M.E., Liu, J.R., W u, P ., Berger, M.A., Zh uravlev a, I., T u-Chan, A., Ganguly , K., Anumanc hipalli, G.K., Chang, E.F.: A high-p erformance neuro- prosthesis for sp eec h deco ding and av atar control. Nature 620 (7976), 1037–1046 (2023) [3] Willett, F.R., Kunz, E.M., F an, C., Av ansino, D.T., Wilson, G.H., Choi, E.Y., Kamdar, F., Glasser, M.F., Ho c hberg, L.R., Druckmann, S., Sheno y , K.V., Hen- derson, J.M.: A high-p erformance sp eec h neuroprosthesis. Nature 620 (7976), 1031–1036 (2023) [4] Chen, X., W ang, R., Khalilian-Gourtani, A., Y u, L., Dugan, P ., F riedman, D., Do yle, W., Devinsky , O., W ang, Y., Flink er, A.: A neural sp eec h deco ding frame- w ork leveraging deep learning and speech syn thesis. Nature Machine In telligence 6 (4), 467–480 (2024) [5] Card, N.S., W airagk ar, M., Iacobacci, C., Hou, X., Singer-Clark, T., Willett, F.R., Kunz, E.M., F an, C., Nia, M.V., Deo, D.R., Sriniv asan, A., Choi, E.Y., Glasser, 29 M.F., Ho ch b erg, L.R., Henderson, J.M., Shahlaie, K., Stavisky , S.D., Brandman, D.M.: An Accurate and Rapidly Calibrating Sp eec h Neuroprosthesis. The New England journal of medicine 391 (7), 609–618 (2024) [6] F eng, C., Cao, L., W u, D., Zhang, E., W ang, T., Jiang, X., Chen, J., W u, H., Lin, S., Hou, Q., Zh u, J., Y ang, J., Saw an, M., Zhang, Y.: Acoustic Inspired Brain- to-Sen tence Deco der for Logosyllabic Language. Cyb org and Bionic Systems 6 , 0257 (2025) [7] Zheng, H., W ang, H., Jiang, W., Chen, Z., He, L., Lin, P ., W ei, P ., Zhao, G., Liu, Y.: Du-IN: Discrete units-guided mask mo deling for deco ding sp eec h from in tracranial neural signals. In: The Thirty-eigh th Annual Conference on Neural Information Pro cessing Systems (2024) [8] Duraivel, S., Rahimp our, S., Chiang, C.-H., T rumpis, M., W ang, C., Barth, K., Harw ard, S.C., Lad, S.P ., F riedman, A.H., Southw ell, D.G., Sinha, S.R., Viven ti, J., Cogan, G.B.: High-resolution neural recordings improv e the accuracy of speech deco ding. Nature Communications 14 (1), 6938 (2023) [9] Kunz, E.M., Abramovic h Krasa, B., Kamdar, F., Av ansino, D.T., Hahn, N., Y o on, S., Singh, A., Nason-T omaszewski, S.R., Card, N.S., Jude, J.J., Jacques, B.G., Bec hefsky , P .H., Iacobacci, C., Ho c hberg, L.R., Rubin, D.B., Williams, Z.M., Brandman, D.M., Stavisky , S.D., AuY ong, N., Pandarinath, C., Druc kmann, S., Henderson, J.M., Willett, F.R.: Inner speech in motor cortex and implications for sp eec h neuroprostheses. Cell 188 (17), 4658–467317 (2025) [10] Anumanc hipalli, G.K., Chartier, J., Chang, E.F.: Sp eec h synthesis from neural deco ding of sp ok en sen tences. Nature 568 (7753), 493–498 (2019) [11] Singh, A., Thomas, T., Li, J., Hick ok, G., Pitk ow, X., T andon, N.: T ransfer learning via distributed brain recordings enables reliable sp eec h deco ding. Nature Comm unications 16 (1), 8749 (2025) [12] Luo, S., Angrick, M., Co ogan, C., Candrea, D.N., Wyse-Sookoo, K., Shah, S., Rabbani, Q., Milsap, G.W., W eiss, A.R., Anderson, W.S., Tipp ett, D.C., Maragakis, N.J., Clawson, L.L., V ansteensel, M.J., W es ter, B.A., T enore, F.V., Hermansky , H., Fifer, M.S., Ramsey , N.F., Crone, N.E.: Stable Deco ding from a Speech BCI Enables Control for an Individual with ALS without Recalibra- tion for 3 Mon ths. Adv anced Science (W einheim, Baden-W urttem b erg, German y) 10 (35), 2304853 (2023) [13] Moses, D.A., Metzger, S.L., Liu, J.R., Anumanc hipalli, G.K., Makin, J.G., Sun, P .F., Chartier, J., Dougherty , M.E., Liu, P .M., Abrams, G.M., T u-Chan, A., Gan- guly , K., Chang, E.F.: Neuroprosthesis for deco ding sp eec h in a paralyzed p erson with anarthria. New England Journal of Medicine 385 (3), 217–227 (2021) [14] D´ efossez, A., Cauc heteux, C., Rapin, J., Kab eli, O., King, J.-R.: Decoding sp eec h 30 p erception from non-inv asive brain recordings. Nature Machin e In telligence 5 (10), 1097–1107 (2023) [15] F o dor, M.A., Csap´ o, T.G., Arth ur, F.V.: T ow ards Deco ding Brain Activity During P assive Listening of Sp eec h. Besz ´ edtudom´ an y - Sp eec h Science, 158–184 (2024) [16] Zhang, D., W ang, Z., Qian, Y., Zhao, Z., Liu, Y., Hao, X., Li, W., Lu, S., Zh u, H., Chen, L., Xu, K., Li, Y., Lu, J.: A brain-to-text framework for deco ding natural tonal sentences. Cell Rep orts 43 (11), 114924 (2024) [17] Xu, Z. (ed.): Han yu Da Zidian. Shanghai Lexicographical Publishing House, ??? (1986) [18] Li, P ., Yip, M.C.: Context eﬀects and the pro cessing of sp ok en homophones. Reading and W riting 10 (3), 223–243 (1998) [19] Zhang, G., Y uan, Z., Y ang, J., Chen, J., Meng, L., Y ang, Y.: NeuroSketc h: An Eﬀectiv e F ramework for Neural Deco ding via Systematic Architectural Optimization (2025) [20] Saha, S., Ahmed, K.I.U., Mostafa, R., Hadjileontiadis, L., Khandoker, A.: Evi- dence of V ariabilities in EEG Dynamics During Motor Imagery-Based Multiclass Brain-Computer Interface. IEEE transactions on neural systems and rehabilita- tion engineering: a publication of the IEEE Engineering in Medicine and Biology So ciet y 26 (2), 371–382 (2018) [21] Christensen, J.C., Estepp, J.R., Wilson, G.F., Russell, C.A.: The eﬀects of da y- to-da y v ariability of physiological data on op erator functional state classiﬁcation. NeuroImage 59 (1), 57–63 (2012) [22] Hamilton, L.S., Oganian, Y., Hall, J., Chang, E.F.: Parallel and distributed enco ding of sp eech across human auditory cortex. Cell 184 (18), 4626–463913 (2021) [23] Chen, C., Dupr´ e La T our, T., Gallan t, J.L., Klein, D., Deniz, F.: The cortical represen tation of language timescales is shared b et ween reading and listening. Comm unications Biology 7 (1), 284 (2024) [24] Magrassi, L., Aromataris, G., Cabrini, A., Annov azzi-Lo di, V., Moro, A.: Sound represen tation in higher language areas during language generation. Pro ceedings of the National Academ y of Sciences 112 (6), 1868–1873 (2015) [25] Hick ok, G., P o epp el, D.: The cortical organization of sp eec h pro cessing. Nature Reviews. Neuroscience 8 (5), 393–402 (2007) [26] Cliﬀ, N.: Dominance statistics: Ordinal analyses to answ er ordinal questions. Psyc hological Bulletin 114 (3), 494–509 (1993) 31 [27] W an, Z., Xia, X., Lo, D., Murphy , G.C.: How do es Machine Learning Change Soft ware Dev elopment Practices? . IEEE T ransactions on Softw are Engineering 47 (09), 1857–1871 (2021) [28] donghao, L., xue: Mo dernTCN: A mo dern pure conv olution structure for gen- eral time series analysis. In: The Tw elfth In ternational Conference on Learning Represen tations (2024) [29] W ang, Y., Huang, N., Li, T., Y an, Y., Zhang, X.: Medformer: A m ulti-granularit y patc hing transformer for medical time-series classiﬁcation. In: The Thirty-eigh th Ann ual Conference on Neural Information Pro cessing Systems (2024) [30] Zinxira: TL VMC Parkinson’s FOG Prediction 4th Place Solution (2024). https: //gith ub.com/Zinxira/tlvmc- parkinsons- fog- prediction- 4th- place- solution [31] Qwen, :, Y ang, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Li, C., Liu, D., Huang, F., W ei, H., Lin, H., Y ang, J., T u, J., Zhang, J., Y ang, J., Y ang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Y ang, K., Y u, L., Li, M., Xue, M., Zhang, P ., Zhu, Q., Men, R., Lin, R., Li, T., T ang, T., Xia, T., Ren, X., Ren, X., F an, Y., Su, Y., Zhang, Y., W an, Y., Liu, Y., Cui, Z., Zhang, Z., Qiu, Z.: Qwen2.5 T echnical Rep ort (2025) [32] Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv, C., et al.: Qw en3 tec hnical rep ort. arXiv preprin t arXiv:2505.09388 (2025) [33] Liu, A., Mei, A., Lin, B., Xue, B., W ang, B., Xu, B., W u, B., Zhang, B., Lin, C., Dong, C., et al.: Deepseek-v3. 2: Pushing the frontier of open large language mo dels. arXiv preprint arXiv:2512.02556 (2025) [34] ByteDance: Doubao-Seed-1.6 (2025). https://console.v olcengine.com/ark/region: ark+cn- b eijing/model/detail?Id=doubao- seed- 1- 6 [35] Op enAI: GPT-5 Chat Model Card (2025). https://platform.openai.com/do cs/ mo dels/gpt- 5- c hat- latest [36] xAI: Grok 4 F ast Mo del Card (2025). https://data.x.ai/ 2025- 09- 19- grok- 4- fast- mo del- card.pdf [37] Grattaﬁori, A., Dub ey , A., Jauhri, A., Pandey , A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., V aughan, A., et al.: The llama 3 herd of mo dels. arXiv preprint arXiv:2407.21783 (2024) [38] Fischl, B.: F reesurfer. NeuroImage 62 (2), 774–781 (2012) [39] Roussel, P ., Go dais, G.L., Bocquelet, F., Palma, M., Hongjie, J., Zhang, S., Giraud, A.-L., M´ egev and, P ., Miller, K., Gehrig, J., Kell, C., Kahane, P ., 32 Chabard ´ es, S., Yvert, B.: Observ ation and assessment of acoustic contamina- tion of electrophysiological brain signals during speech pro duction and sound p erception. Journal of Neural Engineering 17 (5), 056028 (2020) [40] Misra, D.: Mish: A self regularized non-monotonic activ ation function. arXiv preprin t arXiv:1908.08681 (2019) [41] Berman, M., J´ egou, H., V edaldi, A., Kokkinos, I., Douze, M.: Multigrain: a uni- ﬁed image embedding for classes and instances. arXiv preprin t (2019) [42] Hendrycks, D.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016) [43] Comparison of Hanyu Pinyin Romanization Systems. h ttps://www.pinyin.info/ romanization/compare/han yu.html [44] NLPCC 2018 (2018). http://tcci.ccf.org.cn/conference/2018/taskdata.php [45] Tseng, Y.-H., Lee, L.-H., Chang, L.-P ., Chen, H.-H.: Introduction to SIGHAN 2015 bake-oﬀ for Chinese spelling chec k. In: Pro ceedings of the Eigh th SIGHAN W orkshop on Chinese Language Processing, pp. 32–37 (2015) [46] Zhang, H., Cisse, M., Dauphin, Y.N., Lop ez-P az, D.: mixup: Bey ond empirical risk minimization. arXiv preprin t arXiv:1710.09412 (2017) [47] Liu, J., Su, J., Y ao, X., Jiang, Z., Lai, G., Du, Y., Qin, Y., Xu, W., Lu, E., Y an, J., et al.: Muon is scalable for llm training. arXiv preprint arXiv:2502.16982 (2025) [48] Izmailov, P ., Podoprikhin, D., Garip o v, T., V etrov, D., Wilson, A.G.: Aver- aging weigh ts leads to wider optima and b etter generalization. arXiv preprin t arXiv:1803.05407 (2018) [49] Shanmugam, D., Blalock, D., Balakrishnan, G., Guttag, J.: Better aggrega- tion in test-time augmentation. In: Pro ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1214–1223 (2021) [50] Hu, E.J., Shen, Y., W allis, P ., Allen-Zhu, Z., Li, Y., W ang, S., W ang, L., Chen, W., et al. : Lora: Low-rank adaptation of large language mo dels. ICLR 1 (2), 3 (2022) [51] Kalamk ar, D., Mudigere, D., Mellempudi, N., Das, D., Banerjee, K., Av ancha, S., V o oturi, D.T., Jammalamadak a, N., Huang, J., Y uen, H., et al.: A study of bﬂoat16 for deep learning training. arXiv preprint arXiv:1905.12322 (2019) [52] Dao, T.: Flashattention-2: F aster attention with b etter parallelism and work par- titioning. In: The Twelfth International Conference on Learning Representations (2024) 33 App endix A Extended Data A.1 Ov erall spatial distribution of channel con tributions Speaking initial final tone initial final tone 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 Listening Fig. A1 Spatial distribution of sEEG electro des and their contributions across the brain. Electro des from 12 sub jects are visualized from three brain persp ectives. Heatmaps indicate each channel’s F1 score in the sp eaking and listening tasks during initial, ﬁnal, and tone deco ding. 34 A.2 Arc hitectures of brain deco ders ModernTCN Input Output Stem ModernTC N Block ×3 Embedding ModernTC N Block A verage Pooling ×3 Embedding ModernTC N Block ×3 (C, L) (C, 32, L/50) (C, 128, L/50) (C, 64, L/50) (C, 64, L/50) (C, 32, L/50) (C, 128, L/50) (C, 128) Linear Flatten (C×128) MedFormer (L/5, 384) (L/5, 384) (L/5, 384) ×6 (L/50, 384) Intra-scale self-attention Intra-scale self-attention (L/50, 384) (L/50, 384) Input Output Multi-granularity Emebdding (C, L) Linear (4×384) (L/10, 384) (L/20, 384) Intra-scale self-attention (L/10, 384) (L/20, 384) Inter-scale self-attention (L/10, 384) (L/20, 384) FFN Intra-scale self-attention MultiResGRU Input Embedding BiGRU ×3 FFN (C, L) ( L, 384) ( L, 384) ( L, 768) Output A verage Pooling (384) Linear Input Output Stem NeuroSketch Block ×4 GeM Pooling (C, L) (96, 3C/4, L/12) (128, 3C/4, L/12) NeuroSketch Block ×4 (1024, 3C/32, L/96) NeuroSketch Block ×4 (512, 3C/16, L/48) (1024) Linear NeuroSketch NeuroSketch Block ×4 (256, 3C/8, L/24) Fig. A2 Architectures of the four brain deco ders. The input to each brain deco der is an sEEG signal with C c hannels and L time steps. F or each stage of the decoder, we indicate the n umber of output c hannels, the num ber of time steps, and the corresponding num b er of blo cks. 35 A.3 F1 scores of initial-ﬁnal classiﬁcation Speaking S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S1 1 S12 0 0.2 0.4 0.6 0.8 1 Initial F1 Score Listening S1 S2 S4 S5 S6 S7 S8 S10 S1 1 S12 0 0.2 0.4 0.6 0.8 1 Initial F1 Score NeuroSketch ModernTCN MedFormer MultiResGRU S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S1 1 S12 0 0.2 0.4 0.6 0.8 1 Final F1 Score S1 S2 S4 S5 S6 S7 S8 S10 S1 1 S12 0 0.2 0.4 0.6 0.8 1 Final F1 Score Fig. A3 Initial-ﬁnal classiﬁcation results . Bar plots illustrate the initial and ﬁnal classiﬁcation F1 scores ac hieved b y four diﬀeren t brain deco ders for sp eaking and listening tasks. 36 A.4 F1 scores of tone classiﬁcation S1 S2 S4 S5 S6 S7 S8 S10 S1 1 S12 0.2 0.4 0.6 0.8 1 F1 Score List ening Speaking S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S1 1 S12 0.2 0.4 0.6 0.8 1 F1 Score Fig. A4 T one classiﬁcation results . Bar plots illustrate the ton e classiﬁcation F1 scores ac hieved by four diﬀerent brain decoders for speaking and listening tasks. 37

Towards unified brain-to-text decoding across speech production and perception

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment