[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M represen…

Authors: Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho

[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic
[b] = [d] − [t] + [p]: Self-supervised Speech Models Disco ver Phonological V ector Arithmetic Kwanghee Choi 1 , Eunjung Y eo 1 , Cheol Jun Cho 2 , Da vid Harwath 1 , Da vid R. Mortensen 3 1 UT Austin, 2 UC Berkele y , 3 CMU Abstract Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet ho w this information is structured remains underexplored. W e conduct a comprehensive study across 96 languages to analyze the under - lying structure of S3M representations, with particular attention to phonological vectors. W e first sho w that there e xist linear dir ections within the model’ s representation space that correspond to phonological features. W e fur- ther demonstrate that the scale of these phono- logical vectors correlate to the degree of acous- tic realization of their corresponding phono- logical features in a continuous manner . For example, the dif ference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. T ogether, these findings indicate that S3Ms encode speech us- ing phonologically interpretable and compo- sitional vectors, demonstrating phonological vector arithmetic. All code and interacti ve demos are av ailable at https://github.com/ juice500ml/phonetic- arithmetic . 1 Introduction The same year that word2v ec ( Mikolo v et al. , 2013a ), a self-supervised method for distribu- tional dense word representations, was introduced, Mikolo v et al. ( 2013b ) showed that these represen- tations encode linguistically meaningful relations through vector arithmetic (fig. 1 ), i.e . , v king − v man + v woman ≃ v queen , (1) providing a ready e xplanation for how word2vec represents semantics. 1 In this paper , we ask: Do self-supervised speech models (S3Ms) represent phonology in an analogous compositional manner? 1 Many ha ve argued that the analogy test used in Mikolov et al. ( 2013b ), which differs from ours, can be fragile in prac- tice; see § A.1 for details. Phonological analogies for speech models man woman female r oyal female king queen [t] [d] voicing voicing [p] [b] r oyal bilabial bilabial f f -1 [pit] [ b it] voicing vector + king − man + woman ≃ queen [p] − [t] + [d] ≃ [b] W ord analogies for text models S3M representations Original audio Modified audio Figure 1: Comparing analogies for te xt and speech. W ord representations capture semantic relations 2 , while speech representations capture phonological relations. Such analogies (§ 3 ) can be used to control speech syn- thesis in a phonologically grounded manner (§ 4 ). S3Ms, a class of neural networks trained on large quantities of unlabeled speech, ha ve demon- strated strong performance across various down- stream tasks, including speech recognition, synthe- sis, and spoken language understanding ( Bae vski et al. , 2020 ; Hsu et al. , 2021 ; Chen et al. , 2022 ). Consequently , many studies ha ve sought to under- stand what properties of S3M representations sup- port this performance. Early analyses primarily in vestigated what infor - mation S3M representations encode ( Pasad et al. , 2021 ). Empirical studies hav e shown that S3Ms org anize speech according to relati ve distances that reflect acoustic similarity ( Choi and Y eo , 2022 ), and that these representations form clusters corre- sponding to phonetic units ( W ells et al. , 2022 ; Choi et al. , 2025 ). Ho wev er, how this information is structured is still underexplored. Building on prior observ ations and linguistic in- tuition, we aim to refine the current understanding of S3Ms. In § 3 , we propose a hypothesis: Phono- 2 W ord analogies borrowed from Ethayarajh et al. ( 2019 ). 1 logical featur es ar e r epr esented linearly within S3M repr esentations, allowing phonological analo- gies to emerg e. For example, consider the phone quadruplet [b], [p], [d], and [t]. [b] and [p] form a voiced-v oiceless bilabial plosiv e pair , and [d] and [t] form a v oiced-voiceless alv eolar plosive pair . 3 This yields two symmetric phonological analogies: [b] : [p] = [d] : [t] ( voicing) (2) [b] : [d] = [p] : [t] ( PO A) (3) These analogies yield the follo wing approxima- tions in each corresponding v ector r extracted from S3Ms (details in § 3.1.4 ): r [b] ≃ r [p] + ( r [d] − r [t] ) ( voicing) (4) = r [d] + ( r [p] − r [t] ) ( PO A ) (5) which yields two symmetric compositional phono- logical vectors: a v oicing vector v voi. = r [d] − r [t] in eq. ( 4 ) and a change of POA vector 4 v PO A = r [p] − r [t] in eq. ( 5 ). T o empirically ev aluate this hypothesis, we tested 19 phonological features from PanPhon ( Mortensen et al. , 2016 ). 5 W e find that analogies based on all 19 phonological features consistently hold in S3M representations. In § 4 , we further explore the scale of each phonological vectors. In detail, we introduce a scalar λ ∈ R into eq. ( 4 ): r [b] ≃ r [p] + λ · ( r [d] − r [t] ) . (6) W e hypothesize that the scale λ will contr ol the acoustic characteristics associated with a phono- logical vector in a continuous manner . For in- stance, we expect the scale λ of the voicing v ector v voi. to correspond to the de gree of voicing of the segment. T o validate this hypothesis, we train a vocoder f − 1 to approximate the in verse of S3M f : R = f ( x ) (7) x ≃ f − 1 ( R ) , (8) 3 V oicing refers to the vibration of the vocal folds. Bilabial and alveolar describe place of articulation (POA), with bilabi- als produced by bringing both lips together and alveolars by placing the tongue tip against the alveolar ridge. 4 W e can also name it as alv eolar-to-bilabial vector or ne ga- tiv e coronality vector . 5 Syllabic, sonorant, continuant, delayed release, lateral, nasal, strident, voice, spread glottis, anterior , coronal, dis- tributed, labial, high, lo w , back, round, tense, and long. where x is the input speech signal and R ∈ R T ′ × F its S3M representation before pooling (details in § 3.1.4 ). W e modify the representation R by adding scaled phonological vectors (eq. ( 6 )) and resynthesize the audio through the v ocoder f − 1 (eq. ( 8 )). Eight phonological features that can be directly quantified through acoustic measurements on x were selected: height (high and low), backness, roundness, nasality , sonority , stridency , and voicing. W e observe that acoustic measurements strongly correlate with the scale λ , for both interpolation ( | λ | ≤ 1 ) and extrapolation ( | λ | > 1 ) of phono- logical vectors. These results suggest that S3M representations encode phonological features not as purely binary distinctions but as a continuum through specific vector directions and scales. In summary , our contributions to adv ancing the understanding of S3M representations are twofold: • Direction : W e sho w that S3M representations exhibit phonological vector arithmetic, i.e. , existence of compositional vectors that align with phonological features (§ 3 ). • Scale : W e show that the scale of phonologi- cal vectors corresponds to the de gree of their associated phonological features, leading to interpretable control of speech synthesis (§ 4 ). 2 Datasets W e use two phonetically transcribed and manually segmented datasets, TIMIT and V oxAngeles, to e v aluate both English-specific and cross-linguistic generalizability of our findings. TIMIT ( Garofolo et al. , 1993 ) contains utter - ances and their phonetic segmentations of 630 En- glish speakers with balanced phonetic and dialectal cov erage. T o avoid ambiguities in extracting phono- logical features, we exclude diphthongs from our analysis. In addition, plosi ve closures are mer ged with their subsequent releases. 6 V oxAngeles ( Chodrof f et al. , 2024 ) provides phonetic segmentations for audio from the UCLA Phonetics Archi ve from 95 languages across 21 language families, pro viding broader phonetic di- versity . As the dataset has no predefined train/test splits, we use the full set. Because V oxAngeles excludes English, it enables e valuation of whether S3Ms generalize to phones that do not occur in English. 6 Plosiv es in TIMIT are separately segmented into closure and burst, dif ferent to IP A definitions. 2 3 Experiment 1: Direction of Phonological V ectors In this section, we test the first hypothesis: whether linear phonological vectors that satisfy phonologi- cal analogies exist within S3M representations, by observing their directions (eqs. ( 4 ) and ( 5 )). 3.1 Method 3.1.1 Creating phonological analogies Phonological features capture fundamental prop- erties of speech sounds ( T rubetzkoy , 1949 ; Jak ob- son et al. , 1951 ), such as voicing, place of artic- ulation ( e.g. , bilabial, coronal, velar), and man- ner of articulation ( e .g. , plosive, fricati ve, nasal). T o comprehensively analyze such features, we use PanPhon ( Mortensen et al. , 2016 ) to extract discrete phonological features for each phone p : h ′ p = PanPhon ( p ) ∈ {− 1 , 0 , 1 } 21 . PanPhon yields 21 phonological features with v alues + (present), 0 (not applicable), or − (absent). For e xample, /b/ can be represented as [ + voice, + labial, − nasal, + anterior , 0 tense, · · · ]. T o binarize the features, we expand each value + , 0 , and − to be [1 , 0] , [0 , 0] , and [0 , 1] , producing h p = extend ( h ′ p ) ∈ { 0 , 1 } 42 . A quadruplet of phones p = ( p 1 , p 2 , p 3 , p 4 ) yields two symmetric analogies, e .g. , eqs. ( 4 ) and ( 5 ). W e construct the quadruplet set Q by fil- tering from ev ery possible quadruplets drawn from the phone vocab ulary p ∈ V : 7 Q = { p ∈ V | h p 1 − h p 2 = h p 3 − h p 4 } , (9) such that each quadruplet p ∈ Q denotes two sym- metric analogies p 1 : p 2 = p 3 : p 4 and p 1 : p 3 = p 2 : p 4 , or , equiv alently , r p 1 ≃ r p 2 + r p 3 − r p 4 . Note that the analogies are not required to be mini- mal pairs and may differ along multiple phonologi- cal features, where we did not observe any system- atic influence of minimal vs. non-minimal pairs (§ B.6 ). W e obtain 236 and 468 quadruplets (hence dou- ble the analogies) from the TIMIT test set and the full V oxAngeles dataset, respectiv ely . W e excluded the quadruplets where there is no PanPhon map- ping for any one of the phones, or if an y of the phones has less than 50 occurrences within the dataset, to ensure reliable cosine similarity esti- mates, which reduced the number of phones from 47 to 43 for TIMIT , 567 to 57 for V oxAngeles. No 7 p ∈ V means that all phones p 1 , p 2 , p 3 , p 4 belong to V . quadruplets were obtained for the “constricted glot- tis” and “consonantal” feature for both datasets, 8 resulting in 21 − 2 = 19 phonological features being tested. 3.1.2 Measuring cosine similarities Through preparation steps described abov e, we no w hav e a vector representation for each phone: D = { ( p i , r i ) } . T o quantitativ ely measure whether each quadruplet p = ( p 1 , p 2 , p 3 , p 4 ) is consistent with the underlying representations, we calculate the av erage cosine similarity: cos( p ) = E [cos( r p 1 , r p 2 + r p 3 − r p 4 )] , (10) where the a verage is calculated by randomly sam- pling each phone representation from D . T o quantify the uncertainty of similarity esti- mate, we use bootstrapping, similar to Choi et al. ( 2024b ). W e randomly sample 1000 phones each with replacement, and compute the av eraged co- sine similarity . In our preliminary experiments, we found that each bootstrap estimate is stable, result- ing in very small across-replicate variance. As such, we construct the 99% confidence interval (CI) us- ing 10 replicates, to av oid excessi ve computation. 3.1.3 Comparing cosine similarities W e compare the abov e cosine similarities with two baselines. The same-phone baseline pro vides an upper bound by measuring the similarity between the same phone p 1 from dif ferent utterances: cos + ( p ) = E [cos( r p 1 , r ′ p 1 )] , (11) where r ′ p 1 is another random sample of p 1 from D . Similarly , different-phone baseline measures between p 1 and random phone that is not p 1 : cos − ( p ) = E [cos( r p 1 , r not p 1 )] . (12) For a well-structured representation, we expect the similarities to satisfy the ordering: cos − ( p ) < cos( p ) < cos + ( p ) . (13) W e assess whether this ordering holds by compar - ing the confidence interv als (CIs) of the estimates, verifying that the upper CI of the left term is below the lower CI of the right term. Equation ( 13 ) is analogous to an ABX test ( Chaabouni et al. , 2017 ), 8 low , constricted glottis, spread glottis, long, and consonan- tal in TIMIT , and constricted glottis, lateral, and consonantal in V oxAngeles were missing. 3 where the target representation is expected to be more similar to another instance of the same phone (eq. ( 11 )) than to its approximation (eq. ( 10 )), and more similar to its approximation (eq. ( 10 )) than to representations of other phones (eq. ( 12 )). T o summarize the beha vior ov er all quadruplets, we define the success rate , which is the proportion of quadruplets whose similarity scores satisfy the ordering in eq. ( 13 ) such that phonological analo- gies hold: S ( Q ) = 1 | Q | X p ∈ Q 1 [cos − ( p ) < cos( p ) < cos + ( p )] , (14) where 1 denotes the indicator function, returning 1 if the condition holds, and 0 otherwise. W e addi- tionally e v aluate performance using an alternati ve metric, described in § A.1 . 3.1.4 Speech repr esentations W e use spectral representations as baselines. W e employ log mel spectrograms ( MelSpec ) and mel- frequency cepstral coef ficients ( MFCC ), using the default parameters of librosa ( McFee et al. , 2015 ). W e compare these baselines to three mono- lingual S3Ms trained on English: wav2vec 2.0 ( Bae vski et al. , 2020 ), HuBERT ( Hsu et al. , 2021 ), and W avLM ( Chen et al. , 2022 ), using the L A R G E configuration. W e extract 25 layerwise represen- tations per model. W e defer additional details to § A.2 . Gi ven the speech utterance x ∈ R T of length T , speech representation is a matrix R = f ( x ) ∈ R T ′ × F , where the temporal dimension (number of frames) is reduced to T ′ ≃ T /s with the model’ s stride size s . 9 Our goal is to extract a single vector r ∈ R F for each phone segment, gi ven its start and end times ( t s , t e ) . Following Pasad et al. ( 2021 , 2023 ), we conduct a verage pooling after applying feature slicing. § B.2 conducts additional compar- isons on dif ferent slicing methods. 3.2 Results 3.2.1 S3Ms vs. Spectral repr esentations Figure 2 compares the success rates of S3Ms and spectral representations on phonological analogies using TIMIT . The last layer of HuBER T (94%) and W avLM (92%), as well as the middle layer of wav2v ec 2.0 (61%) substantially outperforms spectral features MFCC (19%) and MelSpec (0%). 9 The ev aluated S3Ms use s = 320 for 16kHz input audio. 0 10 20 Layer inde x 0.0 0.5 1.0 Success rate wav2vec 2.0 MFCC MelSpec 0 10 20 Layer inde x 0.0 0.5 1.0 HuBER T 0 10 20 Layer inde x 0.0 0.5 1.0 W avLM 0 10 20 Layer inde x 0.0 0.5 1.0 Success rate wav2vec 2.0 0 10 20 Layer inde x 0.0 0.5 1.0 HuBER T 0 10 20 Layer inde x 0.0 0.5 1.0 W avLM Figure 2: Comparing S3Ms with spectral representa- tions on TIMIT (top) and V oxAngeles (bottom). Each S3M exhibits distinct layerwise behavior: wa v2vec 2.0 peaks in the middle layer , whereas HuBER T and W avLM peak in the last layer . This behavior is consistent with prior observations that measure layerwise phonetic information through probing ( Pasad et al. , 2023 ; Choi et al. , 2025 ). Our results e xtend these findings by showing that S3Ms exhibit phonological compositionality . Further , we observe that a greater number of analogies hold in the middle or final layers com- pared to layer index 0. W e hypothesize that the need for deeper layers suggests that S3Ms bene- fit from increased contextualization when forming abstract phonological vectors. W e explore this hy- pothesis further in § 3.2.3 and § B.2 . Additionally , we compare phone recognizers that are fine-tuned from S3Ms to assess the impact of the phone recognition task on the phonological analogies (§ B.4 ). W e also find that alternati ve e valuation metrics can yield dif ferent layerwise trends in Figure 2 (§ B.1 ). 3.2.2 Do S3Ms generalize to unseen phones? W e e xamine whether phonological analogies hold for phones from unseen languages using V oxAnge- les dataset (Figure 2 ). Of the 468 analogies, 316 (68%) contain at least one phone that does not exist in the English (TIMIT) phone set. Consistent with the trends observed in § 3.2.1 , S3Ms continue to achiev e higher success rates than spectral representations. In particular , W a vLM, Hu- BER T , and wav2v ec 2.0 achieve success rates of 93%, 45%, and 39%, respectiv ely , compared to 19% for MFCC and 0% for MelSpec. This indi- cates that English-only S3Ms capture phonological structure beyond English-specific phones. 4 0 10 20 Layer inde x 0.0 0.5 1.0 Success rate T otal (236) MFCC MelSpec 0 10 20 Layer inde x 0.0 0.5 1.0 Consonants (204) 0 10 20 Layer inde x 0.0 0.5 1.0 V owels (24) 0 10 20 Layer inde x 0.0 0.5 1.0 Success rate T otal (468) 0 10 20 Layer inde x 0.0 0.5 1.0 Consonants (280) 0 10 20 Layer inde x 0.0 0.5 1.0 V owels (152) Figure 3: Comparing consonant-only and vo wel-only quadruplets on TIMIT (top) and V oxAngeles (bottom) for W avLM. Number within the parenthesis denotes the number of quadruplets. W e exclude cases where a quadruplet contains both consonants and vo wels. 3.2.3 V owels vs. Consonants W e analyze W avLM representations, which achiev e the highest ov erall success rates (Figure 2 ). W e observe three prominent peaks in success rate: (1) a first intermediate peak between layers 0–10, (2) a second intermediate peak between layers 10–20, and (3) the highest peak at the final layer . T o further in vestigate the origin of dif ferent peaks, we separate phonological analogies into vo wels and consonants. As shown in Figure 3 , vo wels are associated with the first intermediate peak across both datasets. In contrast, conso- nants exhibit more complex behavior: in TIMIT , consonants contrib ute to both intermediate peaks, whereas in V oxAngeles they are associated with the second intermediate peak. Nevertheless, both vo wels and consonants peak on the final layer . W e suggest that dif ferences may stem from the distinct acoustic-temporal properties of v o wels and consonants. V o wel cues tend to be temporally lo- calized, whereas consonantal cues are often dis- tributed across surrounding segments. 10 Gi ven that deeper layers are more likely to lev erage broader contextual information, we speculate that phono- logical features requiring less temporal context of- ten peak earlier in the netw ork ( e.g. , v owels), and vice versa ( e .g. , consonants). In summary , these findings indicate that the strong performance of S3Ms is closely tied to their ability to lev erage contextual information. This 10 For example, aspirated plosi ve [p h ] in the word apply can be inferred from multiple cues spanning a broader temporal context, including formant transitions in the preceding vo wel, burst energy at release, subsequent aperiodic noise, and partial dev oicing of the following [l]. conclusion is further supported by e xperiments that limit the temporal window size (§ B.2 ). Moreo ver , results suggest that dif ferent temporal complexities are preferentially contextualized at different layers, while the final layer gathers them into a unified representation. W e additionally tested other f actors beyond the consonant-vo wel distinction that may dictate such layerwise trends. Ho wev er , we found that neither indi vidual phonological features (§ B.5 ) nor phono- logical distances between analogies (§ B.6 ) substan- tially influence these layer -wise patterns, lea ving more underlying causes for future in vestigation. 4 Experiment 2: Scale of Phonological V ectors In this section, we test the second hypothesis: whether the scale λ of the phonological vectors (eq. ( 6 )) correlates with acoustic measurements associated phonological features, by training a vocoder to in vert S3Ms. 4.1 Method 4.1.1 Modifying repr esentations through phonological vectors W e define each phonological vector v using the PanPhon features h in § 3.1.1 . W e define the phono- logical vector as the difference between the mean representations of phones with and without the fea- ture i (not necessarily minimal pairs): v i = E h [ i ]=+1 [ r ] − E h [ i ]= − 1 [ r ] . (15) Moti v ated by the analysis of § 3.2.3 , we separately compute consonants and vo wels, using consonant- deri ved v ectors for consonants, and vice versa. For example, the voicing vector is obtained by subtract- ing the a veraged representations r of all v oiced consonants ( h [ voi. ] = +1 ) with all un voiced con- sonants ( h [ voi. ] = − 1 ). W e additionally analyze the sample ef ficiency of eq. ( 15 ) (§ B.9 ) and com- pare with single phone pair constructions (§ B.8 ). As mentioned in eq. ( 6 ), we apply vector v to the frames corresponding to the target phone p . Giv en its start and end frame indices t ′ s , t ′ e (§ 3.1.4 ), the modified representation ˜ R is defined as: ˜ R t = ( R t + λ v ( t ′ s ≤ t < t ′ e ) R t ( otherwise ) , (16) where the scaling factor λ ∈ R controls the strength of the modification and R t denotes the representa- tion at timestep t . Finally , using the vocoder model 5 Phonological feature High Lo w Back Round Nasal Sonorant Strident V oice Acoustic measurement F1 F1 F2 F2 F1BW HNR COG COG Expected correlation sign – + – – – + + – T able 1: Summary of phonological features, their associated acoustic measurements, and expected correlation signs. The phonologically expected correlation sign is denoted as + (positi ve) or – (negati ve). Five types of acoustic measurements are used: first formant (F1), second formant (F2), first-formant bandwidth (F1BW), harmonic-to- noise ratio (HNR), and center of gravity (COG). Further details are pro vided in § A.3 . 5 0 5 1 0 1 F1 (kHz) r h i + v h i = - 0 . 8 0 1 5 0 5 1 0 1 F1 (kHz) r l o + v l o = 0 . 9 0 8 5 0 5 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 7 5 9 5 0 5 2 1 0 1 F2 (kHz) r r o u n d + v r o u n d = - 0 . 8 3 3 5 0 5 1 0 1 F1B W (kHz) r n a s + v n a s = - 0 . 4 4 1 5 0 5 200 100 0 100 200 HNR (dB) r s o n + v s o n = 0 . 6 4 9 5 0 5 5 0 5 COG (kHz) r s t r i d + v s t r i d = 0 . 8 1 9 5 0 5 5 0 5 COG (kHz) r v o i + v v o i = - 0 . 7 2 0 5 0 5 1 0 1 F1 (kHz) r h i + v h i = - 0 . 8 2 6 5 0 5 1 0 1 F1 (kHz) r l o + v l o = 0 . 8 3 6 5 0 5 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 7 3 3 5 0 5 2 1 0 1 F2 (kHz) r r o u n d + v r o u n d = - 0 . 8 5 3 5 0 5 1 0 1 F1B W (kHz) r n a s + v n a s = - 0 . 4 7 2 5 0 5 200 100 0 100 200 HNR (dB) r s o n + v s o n = 0 . 6 5 9 5 0 5 5 0 5 COG (kHz) r s t r i d + v s t r i d = 0 . 7 8 8 5 0 5 5 0 5 COG (kHz) r v o i + v v o i = - 0 . 7 9 4 Figure 4: Comparison between the phonological vector scale λ and the acoustic measurements (§ A.3 ) on TIMIT . ρ denotes Spearman’ s rank correlation coefficient. Blue and orange plots indicate vo wels and consonants, respectiv ely . The empirically observed correlation signs match the theoretical expectations shown in T able 1 . Further , plots demonstrate the linearity of phonological vectors, resulting in monotonic (but not necessarily linear) changes in acoustic measurements. f − 1 , we reclaim the expected speech through resyn- thesis: ˜ x = f − 1 ( ˜ R ) . W e use the final layer repre- sentations from W avLM, as it has been shown to be ef fecti ve for both TIMIT and V oxAngeles (§ 3.2 ). 4.1.2 Analyzing modified repr esentations through acoustic measur ements T o assess whether the scaling factor λ reflects the degree of associated phonological feature realiza- tion, we e xtract acoustic measurements from the modified representation ˜ R t . Specifically , we use the vocoder f − 1 to resynthesize speech ˜ x from ˜ R , and then compute the acoustic measurements for the targeted phonological feature. These measure- ments quantify how the degree of feature realiza- tion in the resynthesized speech v aries with respect to λ . For example, increasing λ for the voicing vec- tor is expected to yield a higher degree of v oicing in the resynthesized target phone. T o quantify these ef fects, we compare acoustic measurements before and after modifying repre- sentations with scaled phonological v ectors. W e assess the relationship between the scale λ and the change of acoustic measurements ∆ using Spear- man’ s rank correlation coefficient ( ρ ). 4.1.3 V ocoder training W e train two v ocoder models based on the V ocos vocoder ( Siuzdak , 2024 ): an English model using LibriTTS ( Zen et al. , 2019 ), and a multilingual model using FLEURS-R ( Ma et al. , 2024 ). V ocos mitigates synthesis artifacts commonly introduced by aggressi ve upsampling and demonstrates robust- ness to out-of-distrib ution inputs ( Siuzdak , 2024 ), making it well suited for analyzing S3M represen- tations. Details of v ocoder training can be found in § A.4 . W e quantitativ ely ev aluate resynthesis quality using acoustic measurements in § B.11 . 4.1.4 Phonological vectors W e test eight phonological vectors with well- established corresponding acoustic measurements: high , low , back , and r ound for v owels, and nasal , sonorant , strident , and voice for consonants. W e summarize the acoustic measurements on § A.3 and the e xpected correlation sign in T able 1 . W e additionally analyze the non-orthogonal relation- ships between these phonological vectors in § B.9 . T o estimate the phonological vector v , we use TIMIT train split and a fixed randomly selected subset of languages for V oxAngeles. For each vec- tor , we modify and resynthesize 3000 utterances 6 0.00 0.05 0.10 0.15 0 2 4 6 8 F r equency (kHz) = 5 0.00 0.05 0.10 0.15 0 2 4 6 8 = 2 0.00 0.05 0.10 0.15 0 2 4 6 8 = 1 0.00 0.05 0.10 0.15 0 2 4 6 8 = 0 0.00 0.05 0.10 0.15 0 2 4 6 8 = 1 0.00 0.05 0.10 0.15 0 2 4 6 8 = 2 0.00 0.05 0.10 0.15 0 2 4 6 8 = 5 Figure 5: Applying round vector to front v owel [i], where there is no front rounded v owel in English. Orange and blue arrows denote F2 and F3, respecti vely , which are all decreasing for λ > 0 . 0.00 0.05 0.10 0.15 0 2 4 6 8 F r equency (kHz) = 5 0.00 0.05 0.10 0.15 0 2 4 6 8 = 2 0.00 0.05 0.10 0.15 0 2 4 6 8 = 1 0.00 0.05 0.10 0.15 0 2 4 6 8 = 0 0.00 0.05 0.10 0.15 0 2 4 6 8 = 1 0.00 0.05 0.10 0.15 0 2 4 6 8 = 2 0.00 0.05 0.10 0.15 0 2 4 6 8 = 5 Figure 6: Applying the voicing v ector to phone [b]. Orange arrows denote the onset of voicing in the subsequent vo wel. For increasing values of λ , the v oicing onset time is mo ved earlier , extending into the closure segment for large v alues of λ . 0.00 0.05 0.10 0.15 0 2 4 6 8 F r equency (kHz) = 5 0.00 0.05 0.10 0.15 0 2 4 6 8 = 2 0.00 0.05 0.10 0.15 0 2 4 6 8 = 1 0.00 0.05 0.10 0.15 0 2 4 6 8 = 0 0.00 0.05 0.10 0.15 0 2 4 6 8 = 1 0.00 0.05 0.10 0.15 0 2 4 6 8 = 2 0.00 0.05 0.10 0.15 0 2 4 6 8 = 5 Figure 7: Applying the strident vector to phone [b]. Orange arro ws point to the burst, where increasing λ remov es the burst. Blue arro ws denotes the increasing energy around 4 ∼ 8kHz. 0.00 0.05 0.10 0.15 0 2 4 6 8 F r equency (kHz) = 5 0.00 0.05 0.10 0.15 0 2 4 6 8 = 2 0.00 0.05 0.10 0.15 0 2 4 6 8 = 1 0.00 0.05 0.10 0.15 0 2 4 6 8 = 0 0.00 0.05 0.10 0.15 0 2 4 6 8 = 1 0.00 0.05 0.10 0.15 0 2 4 6 8 = 2 0.00 0.05 0.10 0.15 0 2 4 6 8 = 5 Figure 8: Applying the nasal vector to phone [b]. Orange arrows point to the burst, where increasing λ remov es the burst. Blue arro ws point to the low frequenc y murmur introduced with increasing nasalization. drawn from the remaining data, i.e. , the TIMIT test split and the remaining V oxAngeles languages. F or each utterance, we sample λ ∈ U ( − 5 , 5) from the uniform distribution and a phone segment (with replacement), modify its representation using λ v , and resynthesize using the vocoder f − 1 . 4.2 Results 4.2.1 Quantitative analyses Figure 4 compares the acoustic measurements of the original and the modified-resynthesized speech on TIMIT . Across all features, the correlation signs observed in our experiments exactly matches the theoretically expected signs (T able 1 ). Further, we observe consistent and monotonic relationships be- tween the vector scale λ and the resulting acoustic measurements. Our results confirm that the phono- logical vectors beha ve in accordance with their intended interpretation as linear directions in the learned representation space, inducing monotonic but not necessarily linear changes in acoustic mea- surements. Figure 24 in § B.10 further sho ws that the same trends hold for V oxAngeles, demonstrat- ing the generalization abilities to unseen phones during training. Further , we found the effects of the phonologi- cal vectors are continuous rather than binary . For example, increasing the scale of the voicing v ector does not abruptly toggle voicing on or of f. In- stead, it produces smooth shifts in COG, reflecting a graded change in the degree of voicing. This suggests that S3M representations encode phono- logical features not merely as categorical contrasts, but as continuous directions. This property enables fine-grained control ov er acoustic variation along indi vidual phonological dimensions. W e also observe robust extrapolation well be- yond the interpolation range | λ | ≤ 1 . | λ | > 1 still yields acoustically interpretable outputs, fur - ther supporting the linear structure of phonological vectors within the representation space. There were three exceptions for extrapolation. 7 For sonorance, the curve is ef fectiv ely saturated for already-sonorant segments, reflecting the f act that sonorants cannot easily be made “more sonorant. ” Similarly , v oiced consonants and non-stridents also sho w comparable saturation ef fects. 4.2.2 Qualitative analyses W e complement the quantitative results with a qualitati ve inspection of spectrograms. For each phonological feature, we resynthesize audio us- ing λ = ( − 5 , − 2 , − 1 , 0 , 1 , 2 , 5) . W e visualize the modified phone with 500ms of conte xt on both sides to show both the local effect and potential coarticulation (Figures 5 to 8 ). The rounding v ector is applied to the high front unrounded v owel [i] in Figure 5 . For λ > 0 , all for - mants are being lo wered together , consistent with an acoustic hallmark of lip rounding. Note that En- glish has no front rounded vo wels, demonstrating that rounding vector generalizes to unseen phones. The voicing vector is applied to the voiced bil- abial plosi ve [b] in Figure 6 . F or λ < 0 , the onset time of voicing after the plosi ve is increased. For λ > 0 , the model decreases the voice onset time (V O T). For large v alues, the v oicing of the sub- sequent vowel is extended into the closure of the plosi ve, e xhibiting negati ve V OT . The strident vector is applied to [b] in Figure 7 . For λ > 0 , frication above 4kHz increasingly emerges, matching the spectral signature of strident fricati ves. Notably , it also removes the burst charac- teristics of plosiv es when increasing stridency . This indicates that S3M representations encode not only static spectral en velopes, but also internal temporal structure (burst vs. frication), and modify them coherently along the phonological dimension. The nasal vector is applied to [b] in Figure 8 . Increasing λ introduces nasal acoustic cues: weak- ening of the burst, and introduction of a low- frequency murmur . As with stridents, these modifi- cations af fect both temporal and spectral structure associated with the underlying manner of articula- tion, not merely its coarse spectral profile. T aken together , our analyses sho w that phono- logical vectors induce phonologically coherent changes in speech representations. These effects follo w expected trajectories for individual phono- logical features, v ary continuously rather than cat- egorically , and in some cases e xtrapolate in lin- guistically interpretable ways. Moreover , the vec- tors modulate not only spectral en velopes b ut also temporal cues, indicating that S3M representations encode rich internal structure. These findings fur - ther connect S3M representations to scalar and multi-v alued features in phonological theory (see Gnanadesikan ( 1997 ) for a surve y). 5 Related works W ord analogies. Mikolov et al. ( 2013b ) demon- strated linear analogies in word embeddings, show- ing syntactic and semantic relations through vector arithmetic. Kim and de Marneffe ( 2013 ) e xtended to continuous semantic scales for adjecti ves, which is analogous to scaling phonological vectors with λ . Also, Pennington et al. ( 2014 ) use analogical tasks to compare embeddings, and Ethayarajh et al. ( 2019 ) provide a mathematical e xplanation for lin- ear analogies, motiv ating our work for phonolog- ical analogies. Additionally , Levy and Goldberg ( 2014 ) raised concerns about word analogy ev al- uation, which motiv ated Fournier et al. ( 2020 ) to propose alternativ e methods based on comparing relational directions rather than representations di- rectly . Linear r epresentation h ypothesis (LRH) sug- gests that human-interpretable features are linearly represented within models’ hidden representations ( Elhage et al. , 2022 ; P ark et al. , 2024 ; Modell et al. , 2025 ), theoretically supporting the existence of phonological vectors. LRH is also closely related to steering vectors ( Subramani et al. , 2022 ; T urner et al. , 2023 ), which control model beha vior by adding interpretable vectors to model representa- tions, motiv ating our controllable speech synthesis. Speech model interpr etability . Prior work mainly focused what information is encoded, i.e . , spectral ( Choi and Y eo , 2022 ), phonetic ( Choi et al. , 2024a ), articulatory ( Cho et al. , 2023 , 2024a ), syl- labic ( Baade et al. , 2025 ; Cho et al. , 2025 ), lexical ( Peng and Harwath , 2022 ; P asad et al. , 2024 ), syn- tactic ( Shen et al. , 2023 ), and semantic information ( Pasad et al. , 2024 ). More closely related to our work, se veral studies e xamined how information is encoded: layerwise analyses re vealed linguistic hierarchy ( Pasad et al. , 2021 , 2023 ), representa- tions store information within similarities ( Choi and Y eo , 2022 ; Choi et al. , 2024b ), and hierarchy among similarities ( Abdullah et al. , 2023 ; Choi et al. , 2025 ). Phonological analogies. For S3Ms, Gauthier et al. ( 2025 ) sho wed morphological inflection in- duces linear geometry and Nakamura et al. ( 2025 ) identified certain phonological ax es within S3Ms. 8 Li et al. ( 2021 ) also analyzed phone recognition models and identified voicing and aspiration vec- tors. Chaabouni et al. ( 2017 ) used phonological analogies and ABX tasks as ev aluation tools for multimodal speech representations. Others also le veraged phonological analogies in text settings: Silfverber g et al. ( 2018 ) sho wed that phonemic text embeddings can learn phonological vectors without supervision and Zouhar et al. ( 2024 ) used phono- logical analogies to ev aluate them. In contrast, our work pro vides a large-scale cross-lingual analysis on S3Ms and further explores the scale of phono- logical vectors. Controllable speech synthesis approaches of- ten adopt interpretable features, including phonetic posteriorgrams ( Zhao et al. , 2019 ; Morrison et al. , 2024 ), articulatory features ( Cho et al. , 2024b ; Krug et al. , 2025 ), or phonological features ( Staib et al. , 2020 ; Tånnander et al. , 2024 ), to enable in- terpretable control. While they rely on explicitly designed representations, our phonological vectors are dri ven from self-supervision. While we do not directly compare synthesis performance, we e xpect our emergent phonological v ectors from S3Ms can be le veraged in future w ork on such applications. 6 Conclusion W e sho w that S3Ms, trained only on speech with- out phonological supervision, learn linearly com- posable and scalable phonological vectors. These findings advance both speech processing and lin- guistics by clarifying the internal structure of S3M representations and refining our understanding of phonological features. In speech processing, our findings enable intuitiv e interpretations of S3M rep- resentations and fine-grained control of speech syn- thesis along phonological dimensions. In linguis- tics, they provide empirical e vidence that phonolog- ical features can emerge from acoustic re gularities and moti vate a view of phonological features as continuous rather than strictly binary . Limitations This work explored only a limited re gion of the space of possibilities. Only a small number of S3Ms were in vestigated. Different models behaved dif ferently and it is not possible to isolate what the causes of these dif ferences are on the basis of the results reported here. Although there can be mul- tiple possible systems of articulatory and acoustic features, our studies only in vestigated the feature system assumed by PanPhon. This makes it diffi- cult to draw firm conclusions about whether what is important, in extracting phonological vectors, is identifying a coherent feature that minimally delineates a phonological natural class or simply identifying a consistent phonetic difference. Fi- nally , synthesis results are influenced not only by the S3Ms but also by the v ocoder . Since we ev alu- ate only a single v ocoder , some observ ed beha viors may reflect v ocoder-specific characteristics rather than properties of the S3Ms alone. Ethics statement Our work focuses on phonological understand- ing of the self-supervised speech representations through the lens of vector arithmetic. All of our experiments are conducted using publicly a vailable datasets (§ 2 ), which were collected and released under licenses appropriate for research use. W e do not collect ne w data, nor do we include person- ally identifiable information beyond what is already present in these datasets. While our speech synthesis experiments (§ 4 ) are intended solely for scientific analysis, they may hav e broader societal implications if misused. W e do not e valuate or claim applicability to the gen- eration of misleading or deceptive content. W e release code, demo, and models for reproducibility and research purposes only , and we encourage fu- ture work to consider appropriate safeguards when applying similar techniques in downstream or user - facing systems. Use of AI assistants AI assistants were used in the preparation of this manuscript. Specifically , they were emplo yed pri- marily for code auto-completion, minor text edit- ing, and grammar polishing. Ne vertheless, all sci- entific content, code implementations, results, and analyses were conceiv ed, verified, and finalized by the authors. References Badr M Abdullah, Mohammed Maqsood Shaik, Bernd Möbius, and Dietrich Klako w . 2023. An information- theoretic analysis of self-supervised discrete repre- sentations of speech. In Interspeech . Rosana Ardila, Meg an Branson, K elly Davis, Michael K ohler, Josh Meyer , Michael Henretty , Reuben 9 Morais, Lindsay Saunders, Francis T yers, and Gre- gor W eber . 2020. Common Voice: A massiv ely- multilingual speech corpus. In LREC . Alan Baade, Puyuan Peng, and David Harwath. 2025. SyllableLM: Learning coarse semantic units for speech language models. In ICLR . Alex ei Baevski, Y uhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2v ec 2.0: A framew ork for self-supervised learning of speech representations. In NeurIPS . Christina Jean Bjorndahl. 2018. A Story of /v/: V oiced Spirants in the Obstruent-Sonorant Divides . Cornell Univ ersity . Paul Boersma and David W eenink. 2025. Praat: do- ing phonetics by computer . https://www.fon.hum. uva.nl/praat/ . V ersion 6.4 [Computer program]. Rahma Chaabouni, Ewan Dunbar , Neil Zeghidour , and Emmanuel Dupoux. 2017. Learning weakly super- vised multimodal phoneme embeddings. In Inter- speech . Guoguo Chen, Shuzhou Chai, Guan-Bo W ang, Jiayu Du, W ei-Qiang Zhang, Chao W eng, Dan Su, Daniel Pov ey , Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur , Shinji W atanabe, Shuaijiang Zhao, W ei Zou, Xiangang Li, Xuchen Y ao, Y ongqing W ang, Zhao Y ou, and Zhiyong Y an. 2021. Gigaspeech: An ev olving, multi-domain asr corpus with 10,000 hours of transcribed audio. In Interspeech . Sanyuan Chen, Chengyi W ang, Zhengyang Chen, Y u W u, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, T akuya Y oshioka, Xiong Xiao, and 1 oth- ers. 2022. WavLM: Lar ge-scale self-supervised pre- training for full stack speech processing. IEEE Jour - nal of Selected T opics in Signal Pr ocessing , 16. Cheol Jun Cho, Nicholas Lee, Akshat Gupta, Dhruv Agarwal, Ethan Chen, Alan Black, and Gopala Anu- manchipalli. 2025. Sylber: Syllabic embedding rep- resentation of speech from raw audio. In ICLR . Cheol Jun Cho, Abdelrahman Mohamed, Alan W Black, and Gopala K Anumanchipalli. 2024a. Self- supervised models of speech infer uni versal articula- tory kinematics. In ICASSP . Cheol Jun Cho, Peter W u, Abdelrahman Mohamed, and Gopala K Anumanchipalli. 2023. Evidence of vocal tract articulation in self-supervised learning of speech. In ICASSP . Cheol Jun Cho, Peter W u, T ejas S Prabhune, Dhruv Agarwal, and Gopala K Anumanchipalli. 2024b. Coding speech through vocal tract kinematics. IEEE Journal of Selected T opics in Signal Pr ocessing , 18. Eleanor Chodrof f, Blaž Pažon, An Baker , and Stev en Moran. 2024. Phonetic segmentation of the UCLA phonetics lab archiv e. In LREC-COLING . Kwanghee Choi, Jee-weon Jung, and Shinji W atanabe. 2024a. Understanding probe behaviors through v ari- ational bounds of mutual information. In ICASSP . Kwanghee Choi, Ankita Pasad, T omohiko Nakamura, Satoru Fukayama, Karen Liv escu, and Shinji W atan- abe. 2024b. Self-supervised speech representations are more phonetic than semantic. In Interspeech . Kwanghee Choi and Eun Jung Y eo. 2022. Opening the black box of w av2v ec feature encoder . arXiv pr eprint arXiv:2210.15386 . Kwanghee Choi, Eunjung Y eo, Kalvin Chang, Shinji W atanabe, and Da vid R Mortensen. 2025. Lev erag- ing allophony in self-supervised speech models for atypical pronunciation assessment. In N AA CL . Alexis Conneau, Ale xei Bae vski, Ronan Collobert, Ab- delrahman Mohamed, and Michael Auli. 2021. Un- supervised cross-lingual representation learning for speech recognition. Interspeech . Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer , T om Henighan, Shauna Kra vec, Zac Hatfield-Dodds, Robert Lasenby , Dawn Drain, Carol Chen, and 1 others. 2022. T oy models of su- perposition. arXiv preprint . Kawin Ethayarajh. 2019. How conte xtual are contextu- alized word representations? comparing the geome- try of bert, elmo, and gpt-2 embeddings. In EMNLP . Kawin Ethayarajh, Da vid Duvenaud, and Graeme Hirst. 2019. T o wards understanding linear word analogies. In A CL . Louis Fournier , Emmanuel Dupoux, and Ewan Dun- bar . 2020. Analogies minus analogy test: measuring regularities in word embeddings. In CoNLL . Mark JF Gales, Kate M Knill, Anton Ragni, and Shakti P Rath. 2014. Speech recognition and ke y- word spotting for low-resource languages: Babel project research at cued. In 4th W orkshop on Spoken Language T echnologies for Under -Resour ced Lan- guages (SLTU 2014) . John S. Garofolo, Lori F . Lamel, W illiam M. Fisher , Jonathan G. Fiscus, David S. Pallett, and Nancy L. Dahlgren. 1993. Darpa timit: Acoustic-phonetic con- tinuous speech corpus cd-rom, nist speech disc 1-1.1. Jon Gauthier, Canaan Breiss, Matthew K Leonard, and Edward F Chang. 2025. Emergent morpho- phonological representations in self-supervised speech models. In EMNLP . Amalia Elisabeth Gnanadesikan. 1997. Phonology with ternary scales . University of Massachusetts Amherst. W ei-Ning Hsu, Benjamin Bolte, Y ao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdino v , and Abdel- rahman Mohamed. 2021. HuBER T : Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , 29. 10 Y annick Jadoul, Bill Thompson, and Bart De Boer . 2018. Introducing parselmouth: A python interface to praat. Journal of Phonetics , 71. Roman Jakobson, C Gunnar Fant, and Morris Halle. 1951. Preliminaries to speech analysis: The distinc- tiv e features and their correlates. Jacob Kahn, Mor gane Riviere, W eiyi Zheng, Evgeny Kharitonov , Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, V italiy Liptchinsky , Ronan Col- lobert, Christian Fuegen, and 1 others. 2020. Libri- light: A benchmark for asr with limited or no super- vision. In ICASSP . Joo-Kyung Kim and Marie-Catherine de Marneffe. 2013. Deriving adjectiv al scales from continuous space word representations. In EMNLP . Masahiko Komatsu, Shinichi T okuma, W on T okuma, and T akayuki Arai. 2002. Multi-dimensional analysis of sonority: perception, acoustics, and phonology . In Interspeech . Paul K onstantin Krug, Christoph W agner , Peter Birkholz, and Timo Stich. 2025. Precisely control- lable neural speech synthesis. In ICASSP . Peter Ladefoged. 1996. Elements of acoustic phonetics . Univ ersity of Chicago Press. Omer Levy and Y oav Goldberg. 2014. Linguistic reg- ularities in sparse and e xplicit word representations. In CoNLL . Xinjian Li, Juncheng Li, Florian Metze, and Alan W Black. 2021. Hierarchical phone recognition with compositional phonetics. In Interspeech . Min Ma, Y uma Koizumi, Shigeki Karita, Heiga Zen, Ja- son Riesa, Haruko Ishikaw a, and Michiel Bacchiani. 2024. FLEURS-R: A restored multilingual speech corpus for generation tasks. In Interspeech . Brian McFee, Colin Raffel, Da wen Liang, Daniel PW Ellis, Matt McV icar , Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. SciPy . T omas Mikolov , Kai Chen, Greg Corrado, and Jef- frey Dean. 2013a. Ef ficient estimation of word representations in vector space. arXiv pr eprint arXiv:1301.3781 . T omas Mikolo v , W en tau Y ih, and Geof frey Zweig. 2013b. Linguistic regularities in continuous space word representations. In NAA CL . Alexander Modell, Patrick Rubin-Delanchy , and Nick Whiteley . 2025. The origins of representation man- ifolds in large language models. arXiv pr eprint arXiv:2505.18235 . Max Morrison, Cameron Churchwell, Nathan Pruyne, and Bryan Pardo. 2024. Fine-grained and inter- pretable neural speech editing. In Interspeech . David R. Mortensen, P atrick Littell, Akash Bharadwaj, Kartik Go yal, Chris Dyer, and Lori Le vin. 2016. Pan- Phon: A resource for mapping IP A segments to artic- ulatory feature vectors. In COLING . T omohiko Nakamura, Kwanghee Choi, K eigo Hojo, Y oshiaki Bando, Satoru Fukayama, and Shinji W atan- abe. 2025. Discrete speech unit extraction via inde- pendent component analysis. In International Con- fer ence on Acoustics, Speech, and Signal Pr ocessing W orkshops (ICASSPW) . IEEE. Kiho Park, Y o Joong Choe, and V ictor V eitch. 2024. The linear representation hypothesis and the geome- try of large language models. In ICML . Ankita P asad, Chung-Ming Chien, Shane Settle, and Karen Livescu. 2024. What do self-supervised speech models know about words? T ransactions of the Association for Computational Linguistics , 12. Ankita Pasad, Ju-Chieh Chou, and Karen Li vescu. 2021. Layer-wise analysis of a self-supervised speech rep- resentation model. In ASR U . Ankita Pasad, Bowen Shi, and Karen Livescu. 2023. Comparativ e layer-wise analysis of self-supervised speech models. In ICASSP . Puyuan Peng and Da vid Harw ath. 2022. W ord dis- cov ery in visually grounded, self-supervised speech models. In Interspeech . Jeffre y Pennington, Richard Socher, and Christopher D Manning. 2014. Glov e: Global vectors for word representation. In EMNLP . V ineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnae ve, and Ronan Collobert. 2020. MLS: A large- scale multilingual dataset for speech research. In Interspeech . T arun Pruthi and Carol Y Espy-W ilson. 2007. Acous- tic parameters for the automatic detection of vowel nasalization. In Interspeech . Gaofei Shen, Afra Alishahi, Arianna Bisazza, and Grze- gorz Chrupała. 2023. W ave to syntax: Probing spo- ken language models for syntax. In Interspeech . Miikka Silfverberg, Lingshuang Jack Mao, and Mans Hulden. 2018. Sound analogies with phoneme em- beddings. In Proceedings of the Society for Compu- tation in Linguistics (SCiL) 2018 . Hubert Siuzdak. 2024. V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. In ICLR . Marlene Staib, Tian Hue y T eh, Alexandra T orresquin- tero, Dev ang S Ram Mohan, Lorenzo Foglianti, Raphael Lenain, and Jiameng Gao. 2020. Phonologi- cal features for 0-shot multilingual speech synthesis. In Interspeech . Peter Strevens. 1960. Spectra of fricati ve noise in hu- man speech. Language and speech , 3. 11 Nishant Subramani, Ni vedita Suresh, and Matthe w E Peters. 2022. Extracting latent steering vectors from pretrained language models. In A CL . Chihiro T aguchi, Y usuke Sakai, Parisa Haghani, and David Chiang. 2023. Univ ersal automatic phonetic transcription into the international phonetic alphabet. In Interspeech . Christina Tånnander , Shi vam Mehta, Jonas Besko w , and Jens Edlund. 2024. Beyond graphemes and phonemes: continuous phonological features in neu- ral text-to-speech synthesis. Interspeech . Nikola T rubetzkoy . 1949. Grundzüge der Phonologie , volume 7. Tra vaux di Cercle Linguistique de Prauge. Alexander Matt T urner , Lisa Thier gart, Gavin Leech, David Udell, Juan J V azquez, Ulisse Mini, and Monte MacDiarmid. 2023. Steering language mod- els with activ ation engineering. arXiv pr eprint arXiv:2308.10248 . Changhan W ang, Morgane Ri viere, Ann Lee, Anne W u, Chaitanya T alnikar , Daniel Haziza, Mary W illiamson, Juan Pino, and Emmanuel Dupoux. 2021. V oxPop- uli: A large-scale multilingual speech corpus for rep- resentation learning, semi-supervised learning and interpretation. In A CL . Dan W ells, Hao T ang, and Korin Richmond. 2022. Pho- netic analysis of self-supervised representations of english speech. In Interspeech . Qiantong Xu, Alex ei Baevski, and Michael Auli. 2022. Simple and ef fective zero-shot cross-lingual phoneme recognition. In Interspeech . Heiga Zen, V iet Dang, Rob Clark, Y u Zhang, Ron J W eiss, Y e Jia, Zhifeng Chen, and Y onghui W u. 2019. LibriTTS: A corpus deri ved from librispeech for text- to-speech. In Interspeech . Guanlong Zhao, Shaojin Ding, and Ricardo Gut ierrez- Osuna. 2019. Foreign accent con version by synthe- sizing speech from phonetic posterior grams. In Inter- speech . V ilém Zouhar, Kalvin Chang, Chenxuan Cui, Nate B Carlson, Nathaniel Romney Robinson, Mrinmaya Sachan, and Da vid R Mortensen. 2024. Pwesuite: Phonetic word embeddings and tasks the y facilitate. In LREC-COLING . A Additional details A.1 Item- and offset-based analogy tests For the word analogy a : b = a ′ : b ′ , Mikolov et al. ( 2013b ) e valuated using item-based analogy tests, i.e. , testing whether b ′ ≃ a ′ − a + b . Subsequent work has shown that offset-based tests, which as- sess whether a − a ′ ≃ b − b ′ , pro vide a more rob ust measure of linguistic regularity ( Le vy and Gold- berg , 2014 ; Fournier et al. , 2020 ). In this work, we adopt both perspectiv es: we primarily use item- based tests in § 3 , and employ offset-based tests as a secondary analysis in § B.1 . The choice of item-based analogy test (success rates) in § 3 is moti vated by the characteris tics of speech, where the number of paired phones asso- ciated with a phonological feature is much smaller compared to the number of words within a word re- lation. As a result, of fset-based tests on TIMIT and V oxAngeles discard many analogies (36 quadru- plets compared to 236 in TIMIT , and 112 com- pared to 468 in V oxAngeles), which prev ents fur- ther analyses such as comparisons between vo w- els and consonants (§ 3.2.3 ). Further , unlike in text, each phone in speech is realized through mul- tiple utterances. Accordingly , quantities such as a ′ − a + b do not correspond to a single point es- timate but rather to a distrib ution, for which we report confidence interv als on success rates. A.2 Details on S3Ms In § 3.2.1 , we compared three widely used S3Ms. W e use the L A R G E configuration for all models, consisting of 7 layers of 1D CNNs followed by 24 transformer blocks, a total of approximately 300M parameters. W e extracted representations from each of the 24 transformer layers (index 1 ∼ 24) as well as one from the CNNs (index 0), yielding 25 layerwise representations per model. wa v2vec 2.0 ( Bae vski et al. , 2020 ) is trained on 60k hours of English read speech from LibriLight audiobook dataset ( Kahn et al. , 2020 ). The model is trained using a contrasti ve loss with on-the-fly learnable codebooks. HuBER T ( Hsu et al. , 2021 ) is also trained on LibriLight but with a different training strategy . It uses a predicti ve loss, where the targets are k-means cluster assignments, either MFCC or the pre vious training iteration of HuBER T -base. W avLM ( Chen et al. , 2022 ) additionally incor- porates GigaSpeech ( Chen et al. , 2021 ) and V ox- Populi ( W ang et al. , 2021 ), which include both 12 read and spontaneous speech. Its loss is similar to HuBER T’ s, but it includes an additional speech denoising objecti ve. A.3 Acoustic measurements In § 4 , we compare eight phonological features and their corresponding acoustic measurements. W e use P arselmouth ( Jadoul et al. , 2018 ), a Python interface to Praat ( Boersma and W eenink , 2025 ), to compute the measurements. F ormants , i.e . , resonant frequencies of the vocal tract, are used for measuring vo wels ( Ladefoged , 1996 ). The first formant (F1) is used for the vo wel height (high or lo w), such that high vo wels show lo wer F1, and vice versa. Second formant (F2) is used for backness and roundness, such that back vo wels and round vo wels hav e lo wer F2. Bandwidth of F1 (F1BW) , i.e. , the frequenc y range around F1 with significant energy , is used for measuring nasality ( Pruthi and Espy-W ilson , 2007 ). Nasal sounds exhibit increased damping from nasal ca vities, resulting in spreading out the formant energy; a broader F1 bandwidth. Center of gravity (COG) , i.e. , amplitude- weighted av erage frequenc y of the spectrum, is used for measuring voicing ( Bjorndahl , 2018 ) and stridents ( Strev ens , 1960 ). For voiced sounds, COG decreases due to the presence of the voicing bar , whereas it increases for stridents due to frication in the upper frequency range. Harmonics-to-noise ratio (HNR) , i.e. , the ratio of periodic (harmonic) energy to aperiodic (noise) energy , is used for measuring sonorance ( K omatsu et al. , 2002 ). Sonorant sounds tend to exhibit higher HNR due to their periodic structure. A.4 Details on vocoder training T o train the neural vocoder f − 1 , we follow the ov erall setup of V ocos ( Siuzdak , 2024 ), with mod- ification to the network architecture and training configuration. W e slightly modify the network ar- chitecture dimensions, such that we can use either W a vLM or MFCC representations as input. For faster con ver gence, we increase both the batch size and the learning rate by a factor of eight relative to the original configuration. W e use tw o speech synthesis datasets to train a vocoder . LibriTTS ( Zen et al. , 2019 ) is a multi- speaker English dataset with 585h of read speech. FLEURS-R ( Ma et al. , 2024 ) contains 1.3kh of enhanced read speech in 102 languages. 0 10 20 Layer inde x 0.5 0.6 0.7 0.8 PCS W avLM 0 10 20 Layer inde x HuBER T 0 10 20 Layer inde x wav2vec 2.0 0 10 20 Layer inde x 0.5 0.6 PCS W avLM 0 10 20 Layer inde x HuBER T 0 10 20 Layer inde x wav2vec 2.0 S3M MFCC MelSpec R andom Figure 9: Comparing pairing consistency score (PCS) of S3Ms and spectral representations using TIMIT (upper) and V oxAngeles (lower). B Additional results B.1 Layerwise analysis via offset-based test B.1.1 Settings For the of fset-based test, we use the pairing con- sistency score (PCS) ( Fournier et al. , 2020 ). PCS measures the separability between offsets drawn from the same relation and offsets dra wn from mis- matched relations using a binary classification cri- terion. W e construct PCS categories by grouping phone pairs that share the same phonological feature dif- ferences. For each category , we compute repre- sentation offsets between phone pairs belonging to the same category (correct of fsets). These are con- trasted with mismatched offsets, which are formed by shuffling the second phone in each pair . W e use av eraged representations per each phone to com- pute their of fsets. PCS is computed by treating the of fset similarity as a binary classification problem, where correct of fsets are labeled as positiv e instances and mis- matched offsets as neg ati ve instances. W e report the area under the receiv er operating characteristic (R OC) curve as the ev aluation metric. A random baseline with no relational consistency yields a PCS of 0 . 5 . B.1.2 Results Figure 9 presents the layerwise PCS results for S3Ms and spectral representations, and we com- pare these findings with the success rate results in § 3 . Consistent with § 3.2.1 , the PCS re- sults confirm the superiority of S3Ms over spec- tral representations across layers. Moreo ver , PCS scores generally increase in later layers relati ve 13 to layer 0, suggesting increased conte xtualization in deeper representations. In line with § 3.2.2 , we also observe non-negligible performance on unseen phones, although performance remains lo wer than that achie ved on seen phones. Despite these ov erall consistencies, the detailed layerwise trends dif fer from those observed us- ing success rates. Unlike the patterns reported in § 3.2.1 , PCS exhibits relativ ely similar behavior across layers, with peak performance typically oc- curring in intermediate layers rather than at the final layer . This suggests that the most informativ e representations for of fset-based relational consis- tency may reside in middle layers rather than the last layer with the sudden peak of success rates. W e leave a deeper in vestigation of the discrepan- cies between dif ferent e v aluation metrics for future work. B.2 F eature vs. A udio slicing B.2.1 Settings W e compare two common pooling methods for obtaining a phone vector r from the representation matrix R = f ( x ) . F eature slicing ( Pasad et al. , 2021 , 2023 ) per- forms temporal av erage pooling on the slice of the representation matrix: r = avgpool ( f ( x )[ t ′ s : t ′ e ]) , (17) where t ′ s = ⌊ t s /s ⌋ and t ′ e = ⌈ t e /s ⌉ are the corre- sponding indices after temporal do wnsampling. As transformer layers ha ve a global recepti ve field, it yields a contextualized representation, e ven though prior work shows that it tends to be locally domi- nated ( Bae vski et al. , 2020 ; Hsu et al. , 2021 ). A udio slicing ( Choi et al. , 2024a , b ), in contrast, slices the wa veform directly: r = avgpool ( f ( x [ t s : t e ])) . (18) This method restricts the temporal receptive field strictly to the phone segment, remo ving the neigh- boring conte xt. Empirically , this has been shown to lead to clearer separation in cosine similarity com- parisons ( Choi et al. , 2024b ). If the model requires minimum window size w > ( t e − t s ) , we ensure the sliced audio is at least length w by adding equal margins around the se gment. B.2.2 Results In Figures 10 and 11 , we observ e that feature slic- ing is more ef fectiv e than audio slicing for S3Ms, 0 10 20 Layer inde x 0.0 0.5 1.0 Success rate wav2vec 2.0 (feat) 0 10 20 Layer inde x 0.0 0.5 1.0 HuBER T (feat) 0 10 20 Layer inde x 0.0 0.5 1.0 W avLM (feat) 0 10 20 Layer inde x 0.0 0.5 1.0 Success rate wav2vec 2.0 (audio) 0 10 20 Layer inde x 0.0 0.5 1.0 HuBER T (audio) 0 10 20 Layer inde x 0.0 0.5 1.0 W avLM (audio) MFCC (feat) MelSpec (feat) MFCC (audio) MelSpec (audio) Figure 10: Comparing success rates for S3Ms with spectral representations using TIMIT . W e denote feature and audio slicing as feat and audio, respectiv ely . 0 10 20 Layer inde x 0.0 0.5 1.0 Success rate wav2vec 2.0 (feat) 0 10 20 Layer inde x 0.0 0.5 1.0 HuBER T (feat) 0 10 20 Layer inde x 0.0 0.5 1.0 W avLM (feat) MFCC (feat) MelSpec (feat) MFCC (audio) MelSpec (audio) Figure 11: Comparing success rates for feature sliced S3Ms with spectral representations using V oxAngeles. W e denote feature and audio slicing as feat and audio, respectiv ely . whereas the opposite holds for spectral represen- tations. Overall, feature-sliced W avLM last-layer representations sho ws strongest performance ov er- all, where we primarily use for all the experiments. Comparing feature and audio slicing for S3Ms further supports our conclusion in § 3.2.3 , where audio slicing remov es conte xtualization by limiting the temporal recepti ve field size. In contrast, audio slicing is beneficial for spectral representations. Especially , MFCC reaches a suc- cess rate of 67% and 50% for TIMIT and V oxAn- geles, respecti vely , which is surprisingly high. W e suspect the dif ference comes from spectrum mag- nitude normalization. For feature slicing, spectral representations’ magnitude is normalized per utter- ance, where audio slicing directly normalizes the magnitude for each phone segment, likely leading to performance improv ement. Ho we ver , as we show in § B.3 , MFCC represen- tations’ cosine similarities tend to collapse tow ard 1 . 0 , making them substantially harder to utilize, leading to worse synthesis performance in § B.12 . 14 0 10 20 Layer inde x 0.0 0.2 0.4 0.6 0.8 1.0 Cosine similarity W avLM c o s + c o s c o s 0 10 20 Layer inde x wav2vec 2.0 MFCC 0.97 0.98 0.99 0 10 20 Layer inde x 0.0 0.2 0.4 0.6 0.8 1.0 Cosine similarity W avLM c o s + c o s c o s 0 10 20 Layer inde x 0.0 0.2 0.4 0.6 0.8 1.0 wav2vec 2.0 MFCC 0.90 0.92 0.94 Figure 12: Comparing av eraged similarities C for fea- ture sliced S3Ms with audio sliced MFCC on TIMIT (upper) and V oxAngeles (lower). W e calculate the 99% CI considering quadruplet-wise cosine av erages. B.3 S3Ms a void anisotr opic collapse B.3.1 Settings T o observe the absolute similarity values, we define the a veraged similarity : C ( A ) = 1 | A | X p ∈ A cos( p ) , (19) and estimate a CI over the set of quadruplet-wise similarities. B.3.2 Results Figure 12 shows the av eraged cosine similarities for W avLM, wav2v ec 2.0, and MFCC. For later layers of wav2v ec 2.0, all similarities approach 1 . 0 , reflecting anisotropic collapse ( Ethayarajh , 2019 ), resulting in reduced success rates in Figure 2 . MFCCs also beha ve similarly , with much smaller margins compared to the last layer of W a vLM. On the other hand, W avLM does not sho w collaps- ing beha vior , likely leading to more easily usable phonological vectors for synthesis in § 4 . B.4 Impact of fine-tuning with phone recognition task B.4.1 Settings W e compare XLSR-53, a multilingual S3M, and their fine-tuned variants for phone recognition, W a v2vec2Phoneme and MultIP A. T o compare 0 10 20 Layer inde x 0.0 0.5 1.0 Success rate XLSR -53 0 10 20 Layer inde x 0.0 0.5 1.0 W av2vec2Phoneme 0 10 20 Layer inde x 0.0 0.5 1.0 MultIP A MFCC (feat) MelSpec (feat) MFCC (audio) MelSpec (audio) Figure 13: Comparing pre-trained S3M (XLSR- 53) and fine-tuned phone recognition models (W av2v ec2Phoneme and MultIP A) on TIMIT . these three models, we use the same settings of § 3.2.1 . XLSR-53 ( Conneau et al. , 2021 ) is trained with the wa v2vec 2.0 contrasti ve objecti ve on multilin- gual datasets spanning 53 languages from Com- monV oice ( Ardila et al. , 2020 ), Multilingual Lib- riSpeech ( Pratap et al. , 2020 ), and B ABEL ( Gales et al. , 2014 ), total of 56k hours. W av2v ec2Phoneme ( Xu et al. , 2022 ) fine-tunes XLSR-53 for phone recognition using a CTC loss. The model is trained on the same datasets, using automatically generated phonemic transcriptions from multilingual G2P systems. MultIP A ( T aguchi et al. , 2023 ) in a similar man- ner to W a v2vec2Phoneme, but uses a smaller sub- set consisting of se ven languages. B.4.2 Results Figure 13 compares XLSR-53 with their fine-tuned phone recognizer counterparts. Both fine-tuned models improve success rates relati ve to XLSR- 53 across nearly all layers, indicating that fine- tuning for phone recognition generally strength- ens phonological structure. Howe ver , two mod- els dif fers in their layerwise beha vior . Because the final layer precedes the CTC head, it is opti- mized to produce representations that linearly sep- arate phones. W a v2vec2Phoneme shows a sharp increase in success rate in the deeper layers, sug- gesting that it de velops more abstract phonolog- ical structure, whereas MultIP A ’ s improv ements are concentrated in earlier layers. This contrast may be related to differences in language cov erage. As W a v2vec2Phoneme is trained on substantially more languages, it may encourage the model to learn more abstract, general phonological structure, while MultIP A ’ s smaller language set may reduce the pressure for such abstraction. An interesting av enue for future work is to explore phone recogni- tion as post-training strategies for S3Ms, promoting the emergence of abstract phonological v ectors. 15 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 syl (4) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 son (64) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 cont (116) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 delr el (32) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 lat (4) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 nas (36) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 strid (112) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 voi (176) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 ant (96) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 cor (128) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 distr (152) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 lab (104) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 hi (48) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 back (72) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 total (204) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 MFCC (feat) MelSpec (feat) MFCC (audio) MelSpec (audio) Figure 14: Phonological feature-wise success rates for consonants on TIMIT using W avLM representations. 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 syl (4) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 ant (4) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 lab (4) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 hi (20) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 back (20) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 r ound (12) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 tense (20) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 total (24) MFCC (feat) MelSpec (feat) MFCC (audio) MelSpec (audio) Figure 15: Phonological feature-wise success rates for vo wels on TIMIT using W avLM representations. B.5 Different phonological featur es does not lead to different lay erwise trends B.5.1 Settings W e observ e the success rates of indi vidual phono- logical features on both TIMIT and V oxAngeles. For each feature, we consider only the quadru- plets for which the phonological vector in eq. ( 9 ) is nonzero, i.e . , where either h p 1 [ i ]  = h p 2 [ i ] or h p 1 [ i ]  = h p 3 [ i ] for the feature i under considera- tion. Follo wing § 3.2.3 , we ev aluate consonants and vo wels separately when computing feature- wise success rates, where quadruplets that contain a mixture of consonants and v owels are e xcluded from the analysis. B.5.2 Results For both TIMIT (Figures 14 and 15 ) and V oxAn- geles (Figures 16 and 17 ), the layerwise trends are largely consistent across phonological features, 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 son (108) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 cont (144) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 delr el (92) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 nas (44) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 strid (140) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 voi (236) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 sg (44) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 ant (128) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 cor (168) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 distr (216) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 lab (144) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 hi (136) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 lo (64) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 back (80) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 total (280) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 MFCC (feat) MelSpec (feat) MFCC (audio) MelSpec (audio) Figure 16: Phonological feature-wise success rates for consonants on V oxAngeles using W avLM representa- tions. 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 syl (8) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 ant (8) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 lab (40) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 hi (84) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 lo (56) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 back (116) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 r ound (112) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 tense (80) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 long (72) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 total (152) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 MFCC (feat) MelSpec (feat) MFCC (audio) MelSpec (audio) Figure 17: Phonological feature-wise success rates for vo wels on V oxAngeles using W avLM representations. with no clear feature-specific dif ferences. One ex- ception is the syllabic (syl) and lateral (lat) in Fig- ure 14 . Howe ver , the four samples are the permuta- tions of the quadruplet of [l] , [n] , [l " ] , and [n " ] , where the syllabic consonants beha ve similar to v owels in terms of their reduced temporal v ariation. B.6 Phonological distances does not lead to different lay erwise trends B.6.1 Settings W e observe the success rates on both TIMIT and V oxAngeles per phonological distance. Phonologi- cal distance between two phones is defined as the number of dif fering phonological features. For 16 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 2 (48) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 3 (44) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 4 (52) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 5 (40) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 6 (36) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 7 (12) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 8 (4) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 total (236) MFCC (feat) MelSpec (feat) MFCC (audio) MelSpec (audio) Figure 18: PanPhon feature distance-wise success rates on TIMIT using W avLM representations. 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 1 (28) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 2 (76) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 3 (100) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 4 (92) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 5 (60) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 6 (32) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 7 (24) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 8 (32) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 9 (16) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 10 (4) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 11 (4) 0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 total (468) MFCC (feat) MelSpec (feat) MFCC (audio) MelSpec (audio) Figure 19: PanPhon feature distance-wise success rates on V oxAngeles using W avLM representations. each quadruplet, we compute the distances between ( p 2 , p 3 ) and ( p 2 , p 4 ) , and assign the quadruplet to a distance bin based on the maximum of these two v alues. It ensures that each quadruplet is grouped according to its most phonologically di ver gent con- trast. B.6.2 Results For both TIMIT (Figure 18 ) and V oxAngeles (Fig- ure 19 ), the layerwise trends remain largely consis- tent across different phonological distances, with no variation observed as distance increases. One exception occurs in distance bins containing only four quadruplets. Howe ver , these cases arise from permutations of a single phone set, similar to § B.5 . B.7 Measuring sample efficiency f or phonological vector extraction B.7.1 Settings In this experiment, we empirically ev aluate how many phone representations are needed to extract accurate phonological vectors. T o measure the ac- curacy of e xtracted phonological vectors, we treat the v ector extracted from the full training dataset in § 4.1.1 as the ground truth. W e then approximate this v ector using smaller subsets of the training data. W e quantify approximation quality using co- sine similarity . W e ev aluate eight phonological vectors in § 4.1.4 . For each positive and negati ve av eraged represen- tation in Equation ( 15 ), we randomly sample N representations with replacement. W e repeat the approximation 1000 times with random samples and visualize the resulting cosine similarities using histograms. W e e valuate N ∈ { 1 , 4 , 16 , 64 , 256 } sampled representations. B.7.2 Results For both TIMIT and V oxAngeles (Figure 20 ), cosine similarity consistently improves as N in- creases across all phonological vectors. When N = 256 , cosine similarity approaches 1.0, sug- gesting that a fe w hundred samples suf fice for ac- curate approximation. B.8 Using a single phone pair for phonological vectors B.8.1 Settings W e compare phonological v ectors extracted using the full dataset with those obtained using only a single phone pair for the positi ve and neg ati ve a v- eraged representations in § 4.1.1 . F or example, for the voicing phonological vector , we ev aluate whether using only the ([b], [p]) pair leads to a good approximation. W e use the results from § B.7 as baselines for comparison. Similarly , the vector obtained from the full training dataset is treated as the ground truth. For each phonological vector , we e valuate all possible phone pairs and report the resulting distribution of cosine similarities with the ground truth via histograms. B.8.2 Results Figure 21 sho ws the distribution of cosine similar - ities, using § B.7 as the baseline. In TIMIT , each phone has few hundred to more than a thousand samples. Howe ver , the cosine similarity remains around 0 . 5 , similar to using only four samples in the subsampling experiment. These results indicate that using a fixed phone pair produces different vectors from those obtained using diverse phones, despite maintaining positiv e cosine similarity with the ground truth. This is likely because the two 17 0.0 0.5 1.0 Cosine similarity 0 10 20 30 40 Density hi 1 4 16 64 256 0.0 0.5 1.0 lo 0.0 0.5 1.0 back 0.0 0.5 1.0 r ound 0.0 0.5 1.0 nas 0.0 0.5 1.0 son 0.0 0.5 1.0 strid 0.0 0.5 1.0 voi 0.0 0.5 1.0 Cosine similarity 0 10 20 30 40 Density hi 1 4 16 64 256 0.0 0.5 1.0 lo 0.0 0.5 1.0 back 0.0 0.5 1.0 r ound 0.0 0.5 1.0 nas 0.0 0.5 1.0 son 0.0 0.5 1.0 strid 0.0 0.5 1.0 voi Figure 20: Approximating various phonological vectors on TIMIT (upper) and V oxAngeles (lo wer) with different sample size N = { 1 , 4 , 16 , 64 , 256 } . Each histogram depicts the distrib ution of cosine similarities. 0.0 0.5 1.0 Cosine similarity 0 10 20 30 40 Density hi 1 4 16 64 256 Phone tuple 0.0 0.5 1.0 lo 0.0 0.5 1.0 back 0.0 0.5 1.0 r ound 0.0 0.5 1.0 nas 0.0 0.5 1.0 son 0.0 0.5 1.0 strid 0.0 0.5 1.0 voi 0.0 0.5 1.0 Cosine similarity 0 10 20 30 40 Density hi 1 4 16 64 256 Phone tuple 0.0 0.5 1.0 lo 0.0 0.5 1.0 back 0.0 0.5 1.0 r ound 0.0 0.5 1.0 nas 0.0 0.5 1.0 son 0.0 0.5 1.0 strid 0.0 0.5 1.0 voi Figure 21: Approximating various phonological vectors on TIMIT (upper) and V oxAngeles (lower) using a fixed phone pair . Each histogram depicts the distribution of cosine similarities. Figure 20 is ov erlaid for comparison. phones in a pair often differ in multiple phonologi- cal features, causing the resulting vector to capture mixed contrasts. B.9 Comparing similarities between phonological vectors B.9.1 Settings W e in vestigate the relationships among different phonological vectors by measuring their pairwise cosine similarities. Specifically , we compute cosine similarities between the eight phonological vectors defined in § 4.1.4 . B.9.2 Results Figures 22 and 23 visualize the pairwise cosine sim- ilarities between phonological vectors for TIMIT and V oxAngeles. W e mainly discuss cases with absolute cosine similarity values that are greater than 0 . 5 . Across both datasets, we observe se veral phonologically interpretable patterns. First of all, vo wel-related and consonant-related phonological vectors e xhibit near-orthogonal similarities. For vo wels, the high and low vectors show strong negati ve cosine similarity , reflecting their opposing acoustic properties. Additionally , for V oxAngeles, round vector show positi ve similarity with high and back, and ne gati ve with lo w . It is acoustically consistent with the rounding, which lowers both formants, F1 and F2. High v owels are character- ized by low F1, lo w vo wels by higher F1, and back vo wels by lower F2. For consonants, nasal, sonorant, and voice v ec- tors exhibit positive similarity , where nasals are always sonorants, and sonorants are almost always voiced. Also, strident and sonorant vectors show negati ve similarity , where stridents are not sonorant and vice versa. Overall, these results demonstrate that cosine similarities between e xtracted phonological v ectors capture meaningful phonological relationships. 18 hi lo back r ound nas son strid voi hi lo back r ound nas son strid voi 1.00 -0.79 -0.35 0.05 -0.01 -0.08 0.01 0.10 -0.79 1.00 -0.08 0.08 -0.04 0.04 0.02 -0.13 -0.35 -0.08 1.00 -0.02 0.08 0.05 0.00 0.01 0.05 0.08 -0.02 1.00 -0.20 -0.05 -0.07 -0.05 -0.01 -0.04 0.08 -0.20 1.00 0.78 -0.37 0.55 -0.08 0.04 0.05 -0.05 0.78 1.00 -0.57 0.74 0.01 0.02 0.00 -0.07 -0.37 -0.57 1.00 -0.60 0.10 -0.13 0.01 -0.05 0.55 0.74 -0.60 1.00 Phonological V ector Comparison 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Figure 22: Cosine similarities between dif ferent phono- logical vectors dra wn from TIMIT . hi lo back r ound nas son strid voi hi lo back r ound nas son strid voi 1.00 -0.90 -0.79 0.26 0.01 -0.10 0.01 0.11 -0.90 1.00 0.77 -0.38 0.02 0.10 0.02 -0.08 -0.79 0.77 1.00 0.26 0.04 0.09 -0.05 -0.06 0.26 -0.38 0.26 1.00 0.02 -0.01 -0.08 0.06 0.01 0.02 0.04 0.02 1.00 0.82 -0.37 0.67 -0.10 0.10 0.09 -0.01 0.82 1.00 -0.43 0.76 0.01 0.02 -0.05 -0.08 -0.37 -0.43 1.00 -0.62 0.11 -0.08 -0.06 0.06 0.67 0.76 -0.62 1.00 Phonological V ector Comparison 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Figure 23: Cosine similarities between dif ferent phono- logical vectors dra wn from V oxAngeles. B.10 Resynthesis works f or unseen languages B.10.1 Settings Follo wing § 4.2.1 , we use the same settings of § 4.1 by using representations from the final layer of W a vLM. As mentioned in § 4.1.2 , we construct train-test splits by randomly selecting languages for e valuation. For resynthesis, as described in § 4.1.3 , we use the vocoder trained on FLEURS-R, a multilingual speech dataset. B.10.2 Results A comparison between Figure 24 and Figure 4 sho ws that V oxAngeles with FLEURS-R-trained vocoder e xhibits trends that are nearly identical to those observ ed on TIMIT with LibriTTS-trained vocoder . These include monotonic relationships be- tween acoustic measurements and the phonological vector scale with the smooth interpolation and ex- trapolation behavior . The close correspondence be- tween seen and unseen language settings indicates that W avLM representations generalizes ef fectiv ely to unseen languages through phonological vectors. B.11 Acoustic measurements r emain stable after resynthesis B.11.1 Settings T o assess the stability of the trained vocoder and the acoustic measurements, we compare the acoustic measurements computed from the original wa ve- form x and from the resynthesized speech ˜ x = f − 1 ( f ( x )) . W e set λ = 0 , corresponding to iden- tity resynthesis without modifying the W avLM rep- resentations. B.11.2 Results W e visualize the differences in acoustic measure- ments using density plots in Figure 25 . Across all tested consonants and vo wels, the distributions are tightly concentrated around zero, indicating that the resynthesis and acoustic measurement pipeline leads to reliable analyses. B.12 Phonological vectors fr om MFCCs are ineffective f or resynthesis B.12.1 Settings Follo wing § 4.2.1 and § B.10 , we adopt the same settings of § 4.1 . W e ev aluate both audio-sliced and feature-sliced MFCC representations, as § B.2 demonstrated that audio slicing can improve the ef fecti veness of MFCCs. B.12.2 Results Figures 26 and 27 sho w the results for audio-sliced and feature-sliced MFCCs, respecti vely . Across both slicing strategies, the acoustic measurements exhibit little to no correlation with the scale of the corresponding phonological vectors. The only ex- ceptions are weak correlations observed for back vo wels by the audio-sliced MFCCs and stridents by the feature-sliced MFCCs. Howe ver , these ef- fects are inconsistent, particularly gi ven that no corresponding correlations are observed on TIMIT . Moreov er, observ ed correlations are substantially waker than that of W avLM, as sho wn in Figures 4 and 24 . T aken together , these results indicate that MFCC-deri ved representations are inef fective as phonological vectors for synthesis e xperiments. 19 5 0 5 1 0 1 F1 (kHz) r h i + v h i = - 0 . 7 8 9 5 0 5 1.0 0.5 0.0 0.5 1.0 F1 (kHz) r l o + v l o = 0 . 8 3 5 5 0 5 2 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 7 8 0 5 0 5 2 1 0 1 2 F2 (kHz) r r o u n d + v r o u n d = - 0 . 8 0 9 5 0 5 1 0 1 F1B W (kHz) r n a s + v n a s = - 0 . 6 3 4 5 0 5 200 0 200 HNR (dB) r s o n + v s o n = 0 . 8 2 1 5 0 5 5 0 5 COG (kHz) r s t r i d + v s t r i d = 0 . 7 7 0 5 0 5 5 0 5 COG (kHz) r v o i + v v o i = - 0 . 7 0 4 5 0 5 1 0 1 F1 (kHz) r h i + v h i = - 0 . 7 8 1 5 0 5 1.0 0.5 0.0 0.5 1.0 F1 (kHz) r l o + v l o = 0 . 7 8 0 5 0 5 2 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 7 6 6 5 0 5 2 1 0 1 2 F2 (kHz) r r o u n d + v r o u n d = - 0 . 8 2 1 5 0 5 1 0 1 F1B W (kHz) r n a s + v n a s = - 0 . 5 8 7 5 0 5 200 0 200 HNR (dB) r s o n + v s o n = 0 . 7 6 3 5 0 5 5 0 5 COG (kHz) r s t r i d + v s t r i d = 0 . 7 7 2 5 0 5 5 0 5 COG (kHz) r v o i + v v o i = - 0 . 7 2 2 Figure 24: Comparing the phonological vector weight λ with acoustic measurements on V oxAngeles using W avLM. ρ indicates Spearman’ s rank correlation coefficient. Blue and orange plots indicate vo wels and consonants, respectiv ely . 1 0 1 0 2 4 6 Density F1 (kHz) 0.5 0.0 0.5 0 2 4 6 8 Density F1 (kHz) 1 0 1 0 2 4 Density F2 (kHz) 2 0 0 1 2 3 4 Density F2 (kHz) 1 0 1 0.0 0.5 1.0 1.5 2.0 Density F1B W (kHz) 200 0 200 0.00 0.01 0.02 Density HNR (dB) 5 0 5 0.0 0.5 1.0 1.5 Density COG (kHz) 5 0 5 0.0 0.5 1.0 1.5 Density COG (kHz) 0 1 0 2 4 6 Density F1 (kHz) 1 0 1 0 2 4 6 8 Density F1 (kHz) 1 0 1 0 2 4 Density F2 (kHz) 1 0 1 0 1 2 3 4 Density F2 (kHz) 1 0 1 0.0 0.5 1.0 1.5 2.0 Density F1B W (kHz) 200 0 200 0.00 0.01 0.02 Density HNR (dB) 5 0 5 0.0 0.5 1.0 1.5 Density COG (kHz) 5 0 0.0 0.5 1.0 1.5 Density COG (kHz) 1 0 1 0 2 4 Density F1 (kHz) 1 0 0 2 4 6 Density F1 (kHz) 2 0 0 1 2 3 Density F2 (kHz) 2 0 0 1 2 Density F2 (kHz) 1 0 1 0 1 2 3 Density F1B W (kHz) 200 0 200 0.00 0.01 0.02 0.03 Density HNR (dB) 5 0 5 0.0 0.5 1.0 1.5 Density COG (kHz) 5 0 5 0 1 2 3 Density COG (kHz) 1 0 0 2 4 Density F1 (kHz) 1 0 1 0 2 4 6 Density F1 (kHz) 2 0 0 1 2 3 Density F2 (kHz) 2 0 2 0 1 2 Density F2 (kHz) 1 0 1 0 1 2 3 Density F1B W (kHz) 200 0 200 0.00 0.01 0.02 0.03 Density HNR (dB) 0 5 0.0 0.5 1.0 1.5 Density COG (kHz) 5 0 0 1 2 3 Density COG (kHz) Figure 25: Comparing the acoustic measurements of original and synthesized speech ( λ = 0 ) on TIMIT (upper two rows) and V oxAngeles (lower two rows). W e use the same range of x-axis from y-axis in Figures 4 and 24 . W e observe that the dif ferences are highly centralized to zero, ensuring the stability of resynthesis through the vocoder . 20 5 0 5 1 0 1 F1 (kHz) r h i + v h i = 0 . 0 1 7 5 0 5 1 0 1 F1 (kHz) r l o + v l o = - 0 . 0 2 2 5 0 5 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 0 6 3 5 0 5 1 0 1 F2 (kHz) r r o u n d + v r o u n d = - 0 . 0 5 8 5 0 5 2 1 0 1 F1B W (kHz) r n a s + v n a s = 0 . 0 3 6 5 0 5 200 0 200 HNR (dB) r s o n + v s o n = - 0 . 1 6 1 5 0 5 2.5 0.0 2.5 5.0 COG (kHz) r s t r i d + v s t r i d = 0 . 0 5 9 5 0 5 2 0 2 4 COG (kHz) r v o i + v v o i = - 0 . 0 4 8 5 0 5 1 0 1 F1 (kHz) r h i + v h i = 0 . 0 1 8 5 0 5 1 0 1 F1 (kHz) r l o + v l o = - 0 . 0 0 8 5 0 5 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 0 3 7 5 0 5 1 0 1 F2 (kHz) r r o u n d + v r o u n d = - 0 . 0 3 7 5 0 5 2 1 0 1 F1B W (kHz) r n a s + v n a s = 0 . 0 1 9 5 0 5 200 0 200 HNR (dB) r s o n + v s o n = - 0 . 0 7 2 5 0 5 2.5 0.0 2.5 5.0 COG (kHz) r s t r i d + v s t r i d = 0 . 0 7 4 5 0 5 2 0 2 4 COG (kHz) r v o i + v v o i = - 0 . 0 8 2 5 0 5 1 0 1 F1 (kHz) r h i + v h i = - 0 . 1 3 8 5 0 5 1 0 1 F1 (kHz) r l o + v l o = 0 . 1 8 2 5 0 5 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 3 5 8 5 0 5 1 0 1 F2 (kHz) r r o u n d + v r o u n d = - 0 . 0 5 7 5 0 5 1.0 0.5 0.0 0.5 1.0 F1B W (kHz) r n a s + v n a s = 0 . 0 5 4 5 0 5 200 100 0 100 200 HNR (dB) r s o n + v s o n = - 0 . 2 5 2 5 0 5 6 4 2 0 2 COG (kHz) r s t r i d + v s t r i d = 0 . 2 2 0 5 0 5 6 4 2 0 2 COG (kHz) r v o i + v v o i = 0 . 1 0 7 5 0 5 1 0 1 F1 (kHz) r h i + v h i = - 0 . 2 1 4 5 0 5 1 0 1 F1 (kHz) r l o + v l o = 0 . 1 0 1 5 0 5 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 3 8 5 5 0 5 1 0 1 F2 (kHz) r r o u n d + v r o u n d = - 0 . 1 2 1 5 0 5 1.0 0.5 0.0 0.5 1.0 F1B W (kHz) r n a s + v n a s = 0 . 0 7 0 5 0 5 200 100 0 100 200 HNR (dB) r s o n + v s o n = - 0 . 2 3 6 5 0 5 6 4 2 0 2 COG (kHz) r s t r i d + v s t r i d = 0 . 0 1 3 5 0 5 6 4 2 0 2 COG (kHz) r v o i + v v o i = 0 . 0 2 8 Figure 26: Comparing the phonological vector weight λ with acoustic measurements on TIMIT (upper two ro ws) and V oxAngeles (lower tw o rows) using phonological vectors from audio-sliced MFCC. ρ indicates Spearman’ s rank correlation coefficient. Blue and orange plots indicate vowels and consonants, respecti vely . There is little to no controllability , with the exception of weak correlation on back vo wel on V oxAngeles. 5 0 5 1 0 1 F1 (kHz) r h i + v h i = 0 . 0 0 1 5 0 5 1 0 1 F1 (kHz) r l o + v l o = - 0 . 0 0 2 5 0 5 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 0 3 4 5 0 5 1 0 1 F2 (kHz) r r o u n d + v r o u n d = - 0 . 0 3 0 5 0 5 2 1 0 1 F1B W (kHz) r n a s + v n a s = 0 . 0 1 9 5 0 5 200 100 0 100 200 HNR (dB) r s o n + v s o n = - 0 . 3 4 5 5 0 5 2 1 0 1 2 COG (kHz) r s t r i d + v s t r i d = 0 . 0 2 2 5 0 5 2 0 2 COG (kHz) r v o i + v v o i = - 0 . 0 3 3 5 0 5 1 0 1 F1 (kHz) r h i + v h i = - 0 . 0 0 8 5 0 5 1 0 1 F1 (kHz) r l o + v l o = 0 . 0 0 3 5 0 5 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 0 0 3 5 0 5 1 0 1 F2 (kHz) r r o u n d + v r o u n d = 0 . 0 0 6 5 0 5 2 1 0 1 F1B W (kHz) r n a s + v n a s = 0 . 0 0 2 5 0 5 200 100 0 100 200 HNR (dB) r s o n + v s o n = - 0 . 1 8 0 5 0 5 2 1 0 1 2 COG (kHz) r s t r i d + v s t r i d = 0 . 0 3 2 5 0 5 2 0 2 COG (kHz) r v o i + v v o i = - 0 . 0 5 0 5 0 5 1 0 1 F1 (kHz) r h i + v h i = 0 . 0 7 6 5 0 5 1 0 1 F1 (kHz) r l o + v l o = - 0 . 0 9 3 5 0 5 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 2 6 8 5 0 5 1 0 1 F2 (kHz) r r o u n d + v r o u n d = 0 . 1 6 4 5 0 5 1.0 0.5 0.0 0.5 1.0 F1B W (kHz) r n a s + v n a s = 0 . 0 1 6 5 0 5 200 100 0 100 200 HNR (dB) r s o n + v s o n = - 0 . 1 9 7 5 0 5 4 2 0 2 COG (kHz) r s t r i d + v s t r i d = 0 . 4 0 6 5 0 5 6 4 2 0 2 COG (kHz) r v o i + v v o i = 0 . 0 1 6 5 0 5 1 0 1 F1 (kHz) r h i + v h i = 0 . 1 1 6 5 0 5 1 0 1 F1 (kHz) r l o + v l o = - 0 . 0 9 2 5 0 5 1 0 1 F2 (kHz) r b a c k + v b a c k = - 0 . 2 8 0 5 0 5 1 0 1 F2 (kHz) r r o u n d + v r o u n d = 0 . 0 6 2 5 0 5 1.0 0.5 0.0 0.5 1.0 F1B W (kHz) r n a s + v n a s = 0 . 0 1 9 5 0 5 200 100 0 100 200 HNR (dB) r s o n + v s o n = - 0 . 1 5 4 5 0 5 4 2 0 2 COG (kHz) r s t r i d + v s t r i d = 0 . 2 7 7 5 0 5 6 4 2 0 2 COG (kHz) r v o i + v v o i = 0 . 0 4 7 Figure 27: Comparing the phonological vector weight λ with acoustic measurements on TIMIT (upper two ro ws) and V oxAngeles (lower two ro ws) using phonological vectors from feature-sliced MFCC. ρ indicates Spearman’ s rank correlation coef ficient. Blue and orange plots indicate v owels and consonants, respecti vely . Similar to Figure 26 , there is little to no controllability . 21

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment