End-to-End Training Approaches for Discriminative Segmental Models

END-T O-END TRAINING APPR O A CHES FOR DISCRIMINA TIVE SE GMENT AL MODELS Hao T ang, W eiran W ang, K evin Gimpel, Kar en Livescu T oyota T echnological Institute at Chicago { haotang, weiranwang, kgimpel, klivescu } @ttic.edu ABSTRA CT Recent work on discrim inativ e segmental models has sho wn that they can achiev e competitive speech recogn ition p erfor- mance, using featu res based on d eep neur al frame classiﬁers. Howe ver, segmental models can be m ore challengin g to train than standa rd fram e-based appro aches. While some segmen- tal models have been successfully trained end to end, ther e is a lack o f un derstandin g o f their training und er different set- tings and with different losses. W e inves tigate a model cla ss based on recen t successful approa ches, consisting of a linear mod el that co mbines seg- mental featur es based on an LSTM frame class iﬁer . Similar ly to hy brid HMM-neur al network models, s egmental mod els of this class can be trained in two stages (frame classiﬁer trainin g followed by linear segmental m odel weight training) , end to end (jo int tr aining o f both frame classiﬁer and linear weights), or with end-to-e nd ﬁne-tun ing after tw o-stage training . W e study se gmenta l models trained end to end with hinge loss, lo g loss, latent hing e loss, and marginal lo g loss. W e consider s everal losses for the case where training alignm ents are av ailab le as well as where they are not. W e ﬁnd that in general, marginal log lo ss provid es the most consistent strong per forman ce without requ iring groun d-truth alignments. W e also ﬁnd that training with dropo ut is very impor tant in obtaining good p erform ance with end -to-end training. Finally , th e best results are typi- cally ob tained by a c ombination of two-stage trainin g and ﬁne-tunin g. Index T erms — Discriminative se gmen tal m odels, end- to-end training 1. INTR ODUCTION End-to- end training has proved to be successful, for example, in con nectionist tempor al classiﬁcation (CTC) [1], enco der- decoder s [2], h idden Markov model (HMM) based hy brid systems [3], deep segmen tal neural ne tworks (DSNN) [4], and segmental recur rent n eural networks ( SRNN) [5]. All of these models have a feature encod er a nd an outp ut model fo r gener- ating label sequences. T he feature encoder can be a recurre nt or a feedf orward n eural network, and the outp ut model can be a recurrent neural decode r , su ch as a lo ng short- term mem- ory network (LSTM), or a prob abilistic g raphical model, such as an HMM, a co nditional rand om ﬁeld (CRF), o r a semi- Markov CRF . The actual deﬁnition of end -to-end training is rarely mad e explicit in the litera ture. In th is work, we de- ﬁne end-t o-end t raining as optimiz ing the encod er p arame- ters and the ou tput model param eters jointly . Th e alternativ e, which we refer to as two-stag e training , optimizes the fea- ture enco der and output mod el parameter s sep arately in two stages. These two families of training ap proach es dif fer in term s of annotation r equireme nts, computational an d lear ning ef- ﬁciency , and the loss function s cu stomarily used for each. T wo-stage training typically req uires frame-level labels for the ﬁrst stage, but may therefor e requ ire fewer samp les to learn fro m [6]. End- to-end tr aining av oids the cascading er- rors of pipelines, but re sults in hard-to- optimize objectives that are sensiti ve to in itialization. It is also possible to p er- form end-to -end ﬁne-tuning after two-stage tr aining, which has been found useful in past work [7]. In this work, we stud y trainin g appro aches for segmental models. Segmental models hav e been shown to be successful when trained end to end from scratch [5 ]. W e f ocus on a par- ticular class o f segmental mod els, with LSTM s as encod ers, and linear segmental models as ou tput mod els. For mod els trained in two stag es, there is often an extra restriction on the representation of the encod ed featu res. For example, they may be log probabilities of triphone states in HM M hybrid systems [8]. Systems trained en d to e nd ( encoder-decod ers, DSNNs, a nd SRNNs) ar e not so constraine d. T o enable fair compariso n, we use m odel architectur es that seamlessly per- mit both kinds of training without requiring any chan ge to t he model parameter ization. The on ly d ifference is that two-stage training leads to interpre table encoded features, but the fun c- tional architectur es are iden tical. 1 In order to thoro ughly compare tw o-stage and en d-to-end training, we consider a variety of loss function s and trainin g settings. When en d-to-end systems were ﬁrst pro posed, such as CTC-LSTMs, e ncoder-decode rs, and SRNNs, they were tied to speciﬁc loss f unctions, such as CTC, per-outpu t cross 1 W e note that though our model class is suitable for studying end-to-end systems in v arious a spects, using be tter encode rs, such a s SR NNs, might lea d to bette r absolu te performance. entropy , and marginal log lo ss. Howe ver , these system s can be trained with different loss functions; e.g ., encoder-decoder systems can be tra ined with hinge loss [9]. It is thus imp or- tant to isolate th e effect of tr aining loss func tions fr om mod- els. For ou r model class, the deﬁnitio n of encoder and out- put mod el is com pletely ind ependen t of the deﬁnitio n of lo ss function s. This allows u s to c ompare training losses while keeping e very thing else ﬁxed. T wo-stage training typically uses ﬁne-graine d labels for training th e ﬁrst stage, such as segmentations. For some datasets, such as T IMIT , we have the luxur y to use m anually annotated segmentations, but for most of the d atasets, we do not. If need ed, segmentations are ty pically inf erred b y fo rce aligning labels to frames. For our model class, the s ystem can be trained with or without segmentation s d epending on the choice of loss function. In the f ollowing sections, we explicitly d eﬁne our model class an d loss fun ctions, in p articular, hinge loss and log loss for cases where we have gr ound tru th segmentations, and late nt hinge loss and marginal lo g loss when we do not. W e pe rform experiments studying two-stage and end-to - end training in different setting s with different losses. On a phone me recog nition task, we show that en d-to-en d training from scratch with ma rginal log loss ach iev es the b est re- sult in th e setting withou t gr ound truth segmentations, while two-stage training followed by end-to- end ﬁne-tu ning with log lo ss achiev es the best re sult in the setting with ground truth segmentations.W e also ﬁnd that drop out is crucial for combating overﬁtting. 2. DISCRIMIN A TIVE SEGMENT AL MODELS Speech recognition, o r sequence prediction in general, can be formulated a s a search pro blem. The search space is a set o f paths, each of which is co mposed o f segments. Each segment is associated with a weight, an d in tu rn each pa th is associated with a we ight. Prediction becomes ﬁnding the highest weigh ted path in th e search spac e. W e fo rmalize this below . Let X b e the input sp ace, a set of sequences of frames, e.g., MFCCs or Mel ﬁlter ban k outp uts. Let L b e the label set, e. g., a pho ne set for phon eme recog nition. A segment is a tuple ( s, t, y ) , where s is the start tim e, t is the end time, an d y ∈ L is th e label. T wo segments e 1 , e 2 are con- nected if the end time of e 1 is the sam e as th e start time of e 2 . A pa th is a seq uence of con nected segments. A path p = (( s 1 , t 1 , y 1 ) , . . . , ( s n , t n , y n )) can also be seen as a label seque nce y = ( y 1 , . . . , y n ) an d a segmentation z = (( s 1 , t 1 ) , . . . , ( s n , t n )) , or simply p = ( y , z ) . Let E b e the set of all po ssible segments. A segmental model is a tup le ( θ , Λ , φ Λ ) , wh ere θ ∈ R d is a parame ter vector , φ Λ : X × E → R d is a featur e fun ction that u ses a feature encode r p arameterized by the set of parameters Λ . W e will gi ve deﬁnitions of featur e enco ders and feature functions in later sections. W ith a slight abuse o f notation, for a p ath p = ( y , z ) , let φ Λ ( x, p ) = φ Λ ( x, y , z ) = P e ∈ p φ Λ ( x, e ) . Prediction can be formulated as argmax p ∈P θ ⊤ φ Λ ( x, p ) = argmax p ∈P X e ∈ p θ ⊤ φ Λ ( x, e ) , (1) where P is the set of all path s. Thoug h th e output conta ins both a label sequence and a segmentation , the segmen tation is often disregarded during e valuation. Learning a segmental model amou nts to ﬁnding p a- rameters θ and Λ that minim ize a speciﬁed loss f unction. Learning can be divided into two cases, one with access to groun d truth segmentations, and one without. When we have grou nd truth segmentations, we receive a dataset S = { ( x 1 , y 1 , z 1 ) , . . . , ( x m , y m , z m ) } and learning aims to solve argmin θ , Λ 1 m m X i =1 ℓ ( θ , Λ ; x i , y i , z i ) . (2) When we do n ot have ground truth segmentations, we have a dataset S = { ( x 1 , y 1 ) , . . . , ( x m , y m ) } , and learning become s solving argmin θ , Λ 1 m m X i =1 ℓ ( θ , Λ ; x i , y i ) . (3) 3. LOSS FUNCTIONS Since segmental models fall unde r structu red pr ediction, any general loss function for struc tured prediction is applicable to segmental models. In particular, we in vestigate hinge loss and log loss fo r the case with ground truth segmentations, and latent hin ge lo ss and marginal log lo ss for the case without groun d tru th segmen tations. All loss deﬁnitions ℓ ( θ, Λ) below are g iv en in terms of a single trainin g sample ( x, y , z ) wher e x ∈ X , ( y , z ) ∈ P . Hinge loss is deﬁned as max ( y ′ ,z ′ ) ∈P h cost (( y , z ) , ( y ′ , z ′ )) − θ ⊤ φ Λ ( x, y , z ) + θ ⊤ φ Λ ( x, y ′ , z ′ ) i , (4) where cost is a functio n that m easures th e distance between two paths. Lo g loss is deﬁned as − log p ( y , z | x ) (5) where p ( y , z | x ) = 1 Z exp( θ ⊤ φ Λ ( x, y , z )) (6) and Z = P ( y ′ ,z ′ ) ∈P exp( θ ⊤ φ Λ ( x, y ′ , z ′ )) . Both h inge loss and log loss require segmentations. Hing e loss has an explicit cost fu nction , while log loss does not. In fact durin g predic- tion, hinge loss is alw ays an upper bound of the cost fun ction. Hinge loss i s no n-smooth due to the max operatio n, while log loss is smooth . Both h inge loss and log loss a re conve x in θ , yet non-co n vex in Λ if a neur al network is used. Latent hinge loss is deﬁned as max ( y ′ ,z ′ ) ∈P h cost (( y , ˜ z ) , ( y ′ , z ′ )) − max z ′′ θ ⊤ φ Λ ( x, y , z ′′ ) + θ ⊤ φ Λ ( x, y ′ , z ′ ) i (7) where ˜ z = ar g max z ′′ ∈Z ( y ) θ ⊤ φ Λ ( x, y , z ′′ ) and Z ( y ) is the set o f p ossible segmenta tions of y . Marginal log loss is de- ﬁned as − log p ( y | x ) = − log X z ∈Z ( y ) p ( y , z | x ) . (8) Both latent h inge loss and m arginal log loss do not requ ire groun d tru th segmen tations. D uring p rediction, laten t hinge loss is also an up per boun d of the cost functio n. Latent h inge loss is non -smooth, while marginal log loss is sm ooth. Both latent hin ge loss and marginal lo g loss are non-convex in both θ and Λ . Hinge loss training fo r segmental mo dels ﬁrst appeared in [10], log loss in [11], and marginal log loss in [12]. For train- ing ﬁrst-pass segmental mod els, [1 3] is the ﬁrst to use h inge loss, [11] is the ﬁrst to use log loss, and [14] is the ﬁrst to use marginal log loss. For tr aining ﬁrst-pass segmental m odels end to end, [4] is th e ﬁrst to use marginal log loss. Other loss function s, such as empirica l Bayes risk and structured ra mp loss, hav e been used in [15] for training segmental mo dels. The above lo ss function s can be o ptimized with stocha stic gradient descen t or its variants. W e p ropaga te gradients back throug h the fea ture f unction φ , allowing all param eters to be updated jointly . 4. FEA TURE FUN CTIONS Here we deﬁn e explicitly feature functions we will use in the experiments. Th ese feature functions ﬁrst appeared in [16]. W e assume the re is a feature encoder , fo r example, an LSTM, wh ich p roduces h 1 , . . . , h T giv en inp ut x 1 , . . . , x T . For a ny t ∈ { 1 , . . . , T } , we pro ject h t to a | L | -dimension al vector and pass the resulting vector through a log-softm ax layer and get k t . In other word s, k t,i = P j W ij h t,j − log P ℓ exp( P j W ℓj h t,j ) , where W is the pro jection matrix . In th is case, the set of parameters Λ includes the pro jection matrix W a nd parameters in the LSTM. The following is a list o f features, and the ﬁnal feature function pro duces a concatenation of featur e vecto rs pro - duced by the individual feature function s. The average of frames over a segment is deﬁned as φ avg ( x, ( s, t, y )) = 1 t − s t − 1 X i = s k i ⊗ 1 y (9) where ⊗ is the tensor p roduct, and 1 y is a | L | -dimension al one-ho t vector for the label y . Th e frame sample at th e r - th percentile is deﬁned as φ at- r ( x, ( s, t, y )) = k ⌊ s + rd ⌋ ⊗ 1 y (10) where d = t − s + 1 . The fram e at the left bounda ry is deﬁned as φ left- r ( x, ( s, t, y )) = k s − r ⊗ 1 y (11) and similarly , the frame at the right bound ary is φ right- r ( x, ( s, t, y )) = k t + r ⊗ 1 y . (12) Additionally , we have features th at do no t depen d on th e feature encoder . Th e length score is deﬁned as φ len ( x, ( s, t, y )) = 1 d ⊗ 1 y (13) where d = t − s + 1 . Finally , th ere is a bias for each ind i vidu al label φ bias ( x, ( s, t, y )) = 1 y . (14) Gradients a re propa gated thr ough vectors k 1 , . . . , k T to the feature e ncoder . Parameter s of the entire segmental model, including the feature encoder , can be updated jointly . 5. EXPERIMENT S W e conduct phonetic recog nition experimen ts on TIMIT , a 6- hour pho netically transcribed dataset. W e f ollow the conv en- tional setting, train ing mod els on the 369 6-utteran ce training set, a nd evaluate on the 1 92-utter ance core test set. W e use the rest of the 400 utterance s in the tes t s et as the development set. Follo wing the co n vention, we collap se 61 pho nes d own to 4 8 for tr aining, and further collapse th em to 39 p hones for ev alu ation. The feature encoder we use is a 3- layer bidirectional LSTM with 256 cells per layer . The outputs of the third layer are p rojected f rom 256 dimen sions to 48 and pa ss thr ough a log-softm ax layer so that the ﬁnal outp ut a re log pro babilities. Inputs to the enco der are 39-dimensional MFCCs, no rmalized per d imension by subtracting the m ean an d dividing b y the standard deviation calculated fr om the training set. 5.1. T wo-Stage T raining Since TI MIT is phon etically tran scribed, we hav e ac cess to phone labels f or each individual frame. W e ﬁrst train LST M T able 1 . Fra me erro r rates fo r different encoder architec tures. feat dev test CNN MFCC+fbank 22.27 23.0 3 LSTM 256x3 MFCC 22.60 LSTM 256x3 +dropou t MFCC 21.0 9 2 1.36 frame classiﬁers with cr oss entr opy loss at ea ch fr ame. This LSTM will serve as our feature en coder later on, and train- ing suc h LSTM co rrespond s to the ﬁrst stage of two-stage learning. LST M para meters are initialized unif ormly in the range [ − 0 . 1 , 0 . 1] . Biases for forget gates are in itialized to one [17], while other biases are initialized to zero. Drop out for LSTMs [18] is ap plied at all in put layer s and the last o ut- put layer with a dropo ut rate of 5 0%. W e co mpare AdaGrad with step sizes in { 0 . 01 , 0 . 02 , 0 . 04 } and RMSProp with step size 0.001 and decay 0.9. Mini-batch size is alw ays one utter- ance. Both optimizer s a re run for 50 epoch s. W e choose th e best perf orming mode l acco rding to the frame error rate on the development set, also k nown as early stoppin g. No gra dient clipping is used during training . For co mparison, following [13] we train a conv olutional neu ral n etwork (CNN) consist- ing of 5 layer conv olution followed by 3 fu lly-conn ected lay- ers. Frame classiﬁcation results are shown in T able 1. W e observe that th e best perf orming LSTM achieves a c ompara- ble frame err or rate as the CNN. W ith drop out, th e fr ame error rate is further lowered. After ob taining LSTM frame classiﬁers, we proceed to the second stage , trainin g segmental models with features based on LSTM log probab ilities. Segmental mod els are trained with the four loss f unctions fo r 50 epochs with ea rly stop - ping. Overlap cost [15] is used in hin ge loss a nd latent hing e loss. A maximum dur ation of 30 frames is imp osed. W e use feature fu nctions describ ed in Sectio n 4. No regularizer is used except early stop ping. W e c ompare AdaGrad with step sizes in { 0 . 1 , 0 . 2 , 0 . 4 } and RMSProp with step size 0.00 1 and decay 0.9. Phonetic reco gnition results for hinge loss are shown in T able 2 . W e observe that LSTMs perf orm better in frame classiﬁcation, b ut give little improvement over CNNs in phon etic recognition. Recognitio n results for the rest o f the lo sses are in T able 3. No te that ev en tho ugh laten t hin ge loss and marginal log loss do not requ ire segmentatio ns du r- ing training, we do u se grou nd truth segmentatio ns fo r train - ing the frame classiﬁer . It is not a common setting, and is done purely for co mparison purpo ses. W e ob serve that, ex- cept laten t hin ge, othe r losses perfo rm equa lly well, with log loss ha ving a slight edge over the others. 5.2. End-to -End T raining with W arm Start After two-stage trainin g, we ﬁne tune the encoder an d seg- mental model jointly to fur ther lower th e training lo ss. The T able 2 . Phone erro r rates for segmental models trained with hinge lo ss usin g log prob abilities genera ted from various en- coders in T able 1. feat dev test CNN MFCC+fbank 21.4 22.5 LSTM 256x3 MFCC 23.1 LSTM 256x3 +dropou t MFCC 21.4 22 .1 T able 3 . Phone error rates for segmental models trained in two stages with dif ferent losses. dev test hinge 21.4 22.1 log loss 21.2 21.9 latent hing e 23.5 24.6 marginal log loss 21. 6 22 .5 four losses are comp ared w ith and without drop out. When dropo ut is used, dr opout rate 50% is chosen to match the rate during frame classiﬁer training. The inp ut layers and the out- put lay er ar e scaled by 0.5 whe n no dropo ut is used . First, we initialize the m odels with the one train ed with h inge loss above. W e r un AdaGra d with step size 0 .001 f or 10 epo chs with early stopping. Results are sho wn in T able 4. W e ob- serve healthy reductions in pho ne error rates by ﬁne tuning the two-stage sy stem across all loss functio ns. W e also ﬁnd that ﬁne-tun ing withou t dropo ut tends to be b etter than with dropo ut. T hough ﬁne-tunin g with hinge loss l eads to the most error redu ction, we note th at the two-stage system is trained with hinge loss. At least we a re centain that the two-stage system trained with hinge loss is a descent initialization fo r other losses. Minimizing other losses from a m odel trained with h inge loss is less th an ideal. W e repeat the above experimen ts b y warm-starting from a model trained with the loss functio n that we are go ing to minimize . Results ar e shown in T able 5. W e observe sign iﬁcant gains for log loss and marginal log loss if initialized with the match ing loss fu nction. Similarly , th e gains with d ropou t in these cases are smaller than without dropo ut. 5.3. End-to -End T raining from Scratch Next, we train the same arch itecture end to end fr om scratch. W e make sure that all the models are initialized identically to the two-stage systems. Th e four losses a re used for train ing with dro pout rates in { 0 , 0 . 1 , 0 . 2 , 0 . 5 } . Groun d tr uth s egmen- tations are used w hen trainin g with hinge loss a nd log lo ss, and are disregard ed when trainin g with latent h inge loss and marginal log loss. The op timizers we use here are SGD with T able 4 . Phone err or r ates f or segmen tal models tr ained end to end initialized from the two-stage system train ed with hinge loss. dropo ut dev test hinge 0 19.4 20.7 0.5 20.8 log loss 0 20.2 21.7 0.5 21.1 latent hinge 0 19.3 21.0 0.5 20.8 marginal log loss 0 20. 7 22 .2 0.5 20.9 T able 5 . Phone error rates for segmen tal mo dels trained end to end initialized fro m two-stage systems trained with corre- sponding loss function s. dropo ut dev test hinge 0 19.4 20.7 0.5 20.8 log loss 0 18.8 19.7 0.5 20.3 latent hinge 0 20.0 21.2 0.5 22.1 marginal log loss 0 19. 2 20 .8 0.5 21.0 step sizes in { 0 . 1 , 0 . 5 } , momentu m 0.9 , and gradient clip- ping at norm 5, AdaGrad with step s izes in { 0 . 0 1 , 0 . 02 , 0 . 0 4 } and no c lipping, an d RMSProp with step size 0.00 1, decay 0.9, and no clip ping. W e run earch optimizer for 5 0 epoch s with early stopp ing. Results are shown in T able 6. First, all o ptimizers above fail to minimize latent hinge lo ss. All of them get stuck in lo cal optim a, and fail to prod uce rea- sonable fo rced align ments. Even thoug h all loss fun ctions in end-to- end training are non conv ex, laten t hing e loss is mor e sensiti ve to initialization than othe r losses. T he secon d ob ser- vation is that adding drop out im proves p erforma nce. How- ev er, using the same dro pout rate as the two-stage system re- sults in worse perf ormance . Fin ally , thou gh beh ind the best ﬁne-tuned mode l, marginal log loss with dro pout 0.2 slightly edges over other losses. 6. DISCUSSION W e have seen tha t e nd-to-e nd training initialized with a two- stage system leads to the best results. Since in end-to -end training, the mean ing of th e interme diate representation s is not enforc ed anymo re, it is un clear how the intermed iate rep- T able 6 . Phone erro r r ates f or segmental mod els tr ained end to end with dropou t. dropo ut dev test hinge 0 23.1 0.1 22.4 0.2 22.3 23.7 0.5 28.9 log loss 0 24.8 0.1 22.4 0.2 20.8 22.2 0.5 22.3 latent hinge failed marginal log loss 0 25. 3 0.1 22.1 0.2 20.0 22.0 0.5 22.0 T able 7 . A verage cro ss en tropy over f rames b efore and after end-to- end ﬁne-tun ing. LSTM train CE d ev C E dev PER 256x3 (best train) 0.056 9 2.239 5 256x3 (best dev) 0.417 9 0.944 2 256x3 +dropo ut 0.459 5 0.746 6 21.4 256x3 +dropo ut +e2e 0. 3864 0. 6928 19.4 resentations d eviate f rom the lea rned o nes. T o answer this, we measure p er-frame cross entro py for the LSTM fra me classi- ﬁer after end -to-end train ing. Results are shown in T able 7 . First, the per-frame cross entropy for th e best perfor ming LSTM on the training set can be as low as 0.06, which sho ws that a 3-lay er b idirectional LSTM with 25 6 cells p er lay er is able to essentially memor ize the entire TIM IT dataset. How- ev er, it is sev erely overﬁtting. Early stopping and d ropou t help balan ce cross entrop ies on th e train ing set and develop- ment set. In ad dition, the cross entropies on b oth sets drop after end- to-end tr aining. It shows that the mean ing of the intermediate repre sentations is still main tained by the LSTMs after end-to- end trainin g. Next, since the system trained with margin al log loss d oes not use th e groun d truth segmentations, and since the evalua- tion measure (pho ne erro r rate) do es no t consider segmen ta- tions, we do n ot know if the sy stem is able to d iscover rea- sonble ph one bo undaries withou t su pervision. W e ap proach this question by aligning the label s equen ces to the aco ustics, and co mpare the resulting segmentations against the manu ally annotated segmentatio ns. Th e alignmen t quality for different tolerance values is sh own in T able 8. Thou gh the results are behind models trained speciﬁcally to align [19], the segmen- T able 8 . Forced alignment quality on the test set as a percent- age of correctly positioned p hone bou ndaries within a pred e- ﬁned tolerance, measured with th e best-perfo rming segmental model trained with marginal log loss. t ≤ 10 ms t ≤ 2 0 ms t ≤ 30 ms t ≤ 40 ms 64.5 86.8 94.7 96.7 tal model tra ined with m arginal log loss is not su pervised with any grou nd truth segmentations. Limiting the maxim um d u- ration to 30 frames also af fects the alignment perfor mance. Since mo st spee ch da tasets do not have manually anno- tated segmentatio ns, it is desirable to tr ain witho ut man ual alignments. As we now know , the alignments produce d by our system trained with margin al log loss ar e of goo d qual- ity . Therefore, we can use the force d align ments to tr ain a two-stage system follo wed by end-to-e nd ﬁne-tunin g. W e fol- low the exact same p rocedur e as in the previous two-stage experiments by training an LSTM frame classiﬁer with the forced alignm ents, fo llowed b y training a segmental mo del with h inge loss. The fram e error rate on the development set of th e L STM classiﬁer is 21.68 % against the for ced align- ments and 28.91% against the ground -truth segmentatio ns. Thoug h the fr ame er ror rate is signiﬁcantly worse than when training with g round -truth se gmentatio ns, this two-stage sys- tem achieves a phone error rate of 21.0% on the development set. W e then ﬁne -tune the entire system with h inge loss. The ﬁnal system ach iev es 18.6% pho ne error rate on the devel- opment set, and 20.1 % on the test set, a signiﬁcant improve- ment f rom the model trained end -to-end with marginal log loss, while not relying on ground truth segmentations. In ter ms of efﬁciency in training, all four losses require forward-ba ckward-like algo rithms for co mputing gradients. Hinge loss r equires on e pass on the en tire searc h space, log loss re quires two passes on the entire search space , latent hinge requir es o ne pass on the en tire search space an d one on th e segmentation sp ace, and margina l log loss req uires two p asses on the entire search space an d two passes on the segmentation space. The average number of ho urs per epo ch spent o n co mputing grad ients, excluding LSTM com puta- tions, is sh own in T ab le 9. T o pu t them into co ntext, feed ing forward an d backp ropagatio n for LSTMs takes 1.65 hou rs p er epoch. The timing is done on a single 3.4 GHz f our-core CPU. The nu mber of hour s is consistent with the nu mber of passes required to comp ute gr adients. Note that th e time spent on LSTMs can be h alved without incurr ing a p erform ance loss by app lying frame skipping [20, 21] as shown f or segmental models in [22]. T able 9 . A verage numb er o f h ours per epo ch spent on com- puting gradients excluding LSTM computations. hinge log loss latent hinge marginal log loss 0.52 1.08 0.73 2.10 7. CONCLUSION In this work, we study end- to-end tr aining in the con text of segmental mo dels. The m odel class o f choice includes a 3 - layer bidirection al LSTM as feature encode r and a segmental model using the f eatures to produ ce la bel sequ ences. Th is model class is su itable fo r stud ying end-to -end training , due to its ﬂexibility to be trained eith er in a two stage manne r , or end to end. The hy pothesis is that training such systems in two stages is easier than end-to -end training fr om scratch. On the other hand, end-to- end training can better op timize the loss function , but it might be sensiti ve to initialization. Our model deﬁn ition is separated from the deﬁnition of loss f unctions, giving us the ﬂexibility to choose loss fu nc- tions ba sed on the train ing settings. W e consider two com- mon training settings, on e with grou nd tr uth segmentations and on e withou t. Hinge loss and lo g lo ss req uire segmenta- tions b y deﬁnition, while laten t hing e loss and margin al log loss do not. W e show th at in the case where we hav e gr ound truth se g- mentations, two-stage training followed by end-to -end train - ing is sign iﬁcantly better tha n two-stage training alon e (im- proving up on it by 10 % relative) and end -to-end training fro m scratch. In additio n, we ﬁnd that end -to-end trainin g with marginal lo g loss fro m scratch ach iev es competitive results. As a b yprod uct, th e sy stem is ab le to gen erate high -quality forced alignments. T o remove the de penden cy on gr ound truth segmentation s, we train an other model on the force d alignments in two stages followed by end- to-end ﬁne-tunin g, improving upo n end- to-end train ing fr om scratch by 8.6% rel- ativ e. The ﬁn al produ ct is a strong system traine d end to en d without requiring ground truth segmentations. 8. A C KNO WLEDGEMENT This research was supported by a Go ogle facu lty research award and NSF grant IIS-14 3348 5. The opin ions expressed in this work ar e those o f the autho rs and do not nece ssarily reﬂect the views o f the fun ding agency . The GPUs used for this research were donated by NVIDIA. 9. REFERENCES [1] Alex Gra ves, Abdel-Rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks, ” in IEEE Internation al Confer en ce on Acous- tics, S peech and Sign al Pr ocessing , 20 13, p p. 6 645– 6649. [2] Ilya Sutskev er , Oriol V inyals, and Qu oc V Le, “Se- quence to sequ ence learning with ne ural n etworks, ” in Advance s in Neu ral Info rmation Pr ocessing Systems , 2014, pp. 3104– 3112. [3] Daniel Povey , V ijayad itya Ped dinti, Daniel Galvez, Pe- gah G hahrman i, V imal Man ohar, Xin gyu Na, Y iming W ang, an d Sanjeev Khudanp ur, “Purely sequ ence- trained neu ral networks for ASR based on lattice-free MMI, ” Ann ual Conference of th e In ternationa l Spee ch Communicatio n Association , 2016. [4] Ossama Ab del-Hamid, Li Den g, Dong Y u, and Hui Jiang, “Deep segmental neu ral networks fo r speech recogn ition., ” i n Annual Confer ence of Internation al Speech Communica tion Association , 2013 , pp. 18 49– 1853. [5] Liang Lu, Lingp eng K on g, Chris Dy er , Noah A. Smith, and Ste ve Ren als, “Segmental recurr ent neural networks for end -to-end speech re cognition , ” in Annu al Confer- ence of th e Inte rnational S peech Communic ation Asso- ciation , 201 6. [6] Shai Shalev-Shwartz and Amn on Shashu a, “On the sample comp lexity of en d-to-en d tr aining vs. seman tic abstraction t rainin g, ” CoRR , vol. abs/16 04.06 915, 2016 . [7] Karel V esel ` y, Arnab Ghoshal, Luk ´ as Burget, and Daniel Povey , “Sequence-d iscriminative training of d eep neu ral networks., ” in Annu al Conference of the International Speech Communication Association , 2013. [8] Alex Graves, Na vdeep Jaitly , and Abdel- Rahman Mo- hamed, “Hy brid speech rec ognition with d eep b idirec- tional LSTM, ” in IEEE W o rkshop on Automatic Speech Recognition and Understanding , 2013, pp. 273–278 . [9] Sam W isem an an d Alexan der M . Rush, “Sequence-to - sequence l earnin g as beam-search optimization , ” CoRR , vol. abs/1606.029 60, 2016. [10] Shi-Xiong Zhang and Mark Gales, “Structured SVMs for auto matic speec h reco gnition, ” IEEE T ransactions on Audio, Speech, and Lang uage Pr ocessing , vol. 21, no. 3, pp. 544–5 55, 2013. [11] Sunita Saraw agi and W illiam W Cohen , “Semi-M arkov condition al random ﬁelds f or in formation extraction, ” in Ad vances in Neural Information Pr ocessing Systems , 2004, pp. 1185– 1192. [12] Geoffrey Zweig and Patrick Ngu yen, “ A segmental CRF approa ch to large vocabulary co ntinuou s speech reco g- nition, ” in IEEE W orkshop on Automatic Speech Recog- nition & Understanding , 2009, pp. 152–157. [13] Hao T ang , W eiran W an g, Kevin Gimpe l, and Kare n Liv escu, “Discriminative s egmental ca scades for feature-r ich ph one reco gnition, ” in IEEE W orkshop on Automatic Sp eech Recogn ition a nd Understanding , 2015. [14] Geoffrey Zweig, “Classiﬁcation and r ecognition with direct segment m odels, ” in IEEE In ternational Co nfer- ence on Acoustics, Speech a nd Signal Pr ocessing , 2012, pp. 4161–4 164. [15] Hao T ang, Ke v in Gimp el, an d Karen Livescu, “ A com- parison of training appro aches f or discriminative seg- mental mod els, ” in Ann ual Conference of th e Intern a- tional Speech Communication Association , 2014. [16] Y an zhang He and Eric Fosler-Lussier , “Efﬁcient seg- mental co nditional r andom ﬁeld s fo r ph one r ecogni- tion, ” in Annua l Confer ence of the Intern ational Sp eech Communicatio n Association , 2012, pp. 1898–19 01. [17] Rafal Jozefowicz, W ojciceh Zarem ba, and Ilya Sutske ver , “ An empirical exploration of recurren t network architectur es, ” in In ternationa l Confer ence on Machine Learning , 2015. [18] W ojciech Zaremb a, Ilya Sutskever , an d Oriol V inyals, “Recurrent neur al network regularization , ” CoRR , vol. abs/1409 :2320, 201 4. [19] Joseph Keshet, Shai Shalev-Shwartz, Y oram Singe r , a nd Dan Chazan, “ A large margin algor ithm f or speech -to- phone me and music-to -score align ment, ” IEEE T rans- actions on Audio, Spee ch, and Lang uage Pr ocessing , vol. 15, no . 8, pp. 2373–23 82, 2007. [20] Y ajie Miao, Jinyu Li, Y o ngqian W an g, Shixiong Zhang , and Y ifan Gong , “Simplifyin g long short-term memor y acoustic models for fast training and decoding , ” in IEE E Internation al Confer ence on Acoustics, Speech and Sig- nal Pr ocessing , 201 5. [21] V incent V anhou cke, Ma tthieu Devin, and Georg Heigold, “Multifr ame deep neural n etworks f or acoustic modeling , ” in IEE E Internation al Confer ence on Acous- tics, Sp eech and Signa l Pr ocessing (ICASS P) . IEEE , 2013, pp. 7582– 7585. [22] Hao T ang , W eiran W an g, Kevin Gimpe l, and Kare n Liv escu, “Efﬁcient segmental cascades fo r speech recogn ition, ” in A nnual Confer en ce of the International Speech C ommun ication Association , 2016.

End-to-End Training Approaches for Discriminative Segmental Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment