MLP GMM Agreement

From SpeechWiki

1 Smoothed GMMs vs MLPs and GMM accuracy
2 Smoothing MLP p(q|o)
3 Choosing the smoothing filter order
- 3.1 Based on the support log(p(o))
- 3.2 Based on the phone identity
4 Conclusions
5 Frame errors introduced as a result of segment replacement
6 how often the q in max p(o,q) frame in segment differs from max q in p(q|o)'s
7 The experiments that should be done
8 References

Smoothed GMMs vs MLPs and GMM accuracy

To compute phone given the observation p(q|o) we can use either MLP activations or GMMs. If both classifiers were correct their outputs would equal to each other and to the target. In reality they agree only somewhat. The GMMs tend to have sharp differences between neighboring frames, while MLPs tend to be smoother. By eyeballing p(q|o) where the GMMs and MLPs differ, it appears that the benefits of frame-replacement on segments determined from <math>p_{mlp}(q|o)</math> disappears. For example, a typical correction is /D/ -> /N/ (/N/ is the correct label). And here is what p(q|o) looks like:

o)</math> gets it wrong.

Filtering the GMM p(q|o) with a hamming window improves accuracy of the GMMs, and agreement with MLPs.

o)</math> gets it right, unlike the unfiltered case above.

Cohen's kappa is used to measure agreement between p(q|o) generated by GMMs and MLPs as well as between the true target phonetic label for each frame. We can also look only at the frames that have sufficient support in the model: i.e. those frames with log(p(o))>T, where T is some threshold. Presumably the GMM should be more accurate on those frames. p(o) has a surprisingly normal distribution, perhaps because it's a sum of 48 somewhat independent p(o,q) observations.

p(log(p(o))) distribution of log-likelihood of observations o

The following statistics were gathered on the first 233 utterances of the training set that was already timeshrunk with tau=.9 (31402 frames).

                              all frames           frames o such that log(p_{GMM}(o)>-33
   n in hamming(n) kappa(mlp,gmm) kappa(gmm,target) kappa(mlp,gmm) kappa(gmm,target)
   1               0.3939            0.3511            0.4878        0.3570
   3               0.3986            0.3546            0.4950        0.3639
   5               0.4203            0.3701            0.5226        0.3859
   7               0.4334            0.3803            0.5258        0.3935
   9               0.4428            0.3861            0.5363        0.4079
  11               0.4500            0.3920            0.5427        0.4147
  13               0.4550            0.3962            0.5509        0.4270
  15               0.4584            0.3982            0.5600        0.4445
  17               0.4599            0.3979            0.5531        0.4496
  19               0.4593            0.3966            0.5696        0.4609
  21               0.4579            0.3962            0.5637        0.4596
  23               0.4553            0.3933            0.5651        0.4572

When using the unsmoothed MLP, kappa(mlp,target) = 0.2921. In this case we are using all frames, since p(o) cannot be computed from the MLP activations. The accuracy rate (|mlp == target|/|all frames|) for the mlp on the same data 0.3250, so kappa is a bit more conservative in declaring agreement.

This accuracy is lower that the accuracy of .5707 computed on all of the non-timeshrunk training data reported in MLP Activation Features, partly because 6% of the most accurate frames were time-shrunk, and partly because the first 233 utterances where probably tougher speech than the rest of the training data.

Smoothing MLP p(q|o)

filtering the p(q|o) with a hamming window also helps, but not enough to beat the p(q|o) from GMMs:

   n in hamming(n)      kappa(target,mlp)
   1                    0.2921
   3                    0.2945
   5                    0.3019
   7                    0.3089
   9                    0.3164
  11                    0.3232
  13                    0.3305
  15                    0.3374
  17                    0.3438
  19                    0.3485
  21                    0.3526
  23                    0.3548
  25                    0.3569
  27                    0.3564
  29                    0.3557
  31                    0.3543
  33                    0.3518
  35                    0.3482

Choosing the smoothing filter order

Based on the support log(p(o))

Frames with higher likelihood p(o) tend to do better with longer smoothing filters. High likelihood frames also tend to belong to vowel-like phonemes. In any case the hamming window should have a length between 13 and 47.

Plot computed on 2500 utterances (463286 frames), instead of 233 as used in discussion, but none of the number change significantly

Based on the phone identity

We can also chose filter order based on the phone.

Plot computed on 2500 utterances (463286 frames), instead of 233 as used in discussion

Correlation between filter length and agreement where frames are binned by log(p(o)) (on the central part -80:-24, where we have enough examples): 0.56

Correlation between filter length and agreement where frames are binned by phone: 0.61

Agreement where we chose the best filter length for each phoneme, weighted by the number of frames in each target phoneme: .4081

Agreement where we chose the best filter length for p(o), weighted by the number of frames in each unfiltered p(o) bin: .3996

So, in fact choosing the filter length based on the phone is about the same as choosing a filter length based on p(o).

Plot computed on 2500 utterances (463286 frames), instead of 233 as used in discussion

If we take the filter length that gives the maximum agreement for each phone (as in the picture above), we get kappa =0.3964 for all frames and kappa = 0.4997 for frames with log(p(o))>-33.

Choosing the filters this way assumes that the discriminant functions p(q|o) don't interact, but of course they do. Nevertheless this gives a decent improvement in agreement.

Learning the filter lengths only on the frames where log(p(o))>-33 gives crazy results (.2701 agreement on all frames, .4696 on log(p(o))>-33) because there are probably not enough training data for good statistics (735 frames spread unevenly among 48 phonemes).

Conclusions

                                             perphone-filt          unfiltered
log(p(o)) Bin  frames                    GMM            MLP            GMM
--------------------------------------------------------------------------------------
        -Inf   0                           NaN          NaN            NaN
        -100   85                          0.25908      0.051997     0.17328
         -90   416                         0.30971      0.15592      0.26986
         -80   2239                        0.34609      0.18001      0.30075
         -70   8725                        0.36134      0.2363       0.33371
         -60   12478                       0.39291      0.26886      0.3538
         -50   5531                        0.43779      0.38974      0.36987
         -40   1450                        0.50398      0.51657      0.41312
         -30   398                         0.48553      0.58487      0.3573
         -20   71                          0.34144      0.61763      0.20793
         -10   9                           0.47059      0.47059      0.47059
         Inf   0                           NaN           NaN            NaN

filtered-gmm and filtered-mlp is better than unfiltered.
filtered-gmm with per-phone filter length is better than some fixed length filter for all phones (tried with filter length 13 and 9, not shown in the above table)
MLPs outperform GMMs 'close to civilization' where p(o) is high
filtered GMM does better than MLP for log(p(o)) of less than -40, which is actually most frames

on 2500 utterances

kappa(gmm,target) unfiltered 0.35589
kappa(mlp,target) unfiltered 0.32619
kappa(gmm,target) best filter length conditioned by phone: .39733
- only on log(p(o))>-33: .45707

                                           perphone-filt          unfiltered
 log(p(o)) Bin  frames                   GMM          MLP            GMM
 --------------------------------------------------------------------------------------

        -Inf          277                0.1249       0.074682     0.090823
         -99         1533                0.26311      0.16551      0.27024
         -89        10166                0.29999      0.21491      0.27513
         -79        52196                0.33079      0.22387      0.29931
         -69   1.3867e+05                0.36925      0.26009      0.32804
         -59   1.5156e+05                0.40638      0.3173       0.36313
         -49        70282                0.44807      0.43212      0.39767
         -39        25938                0.46277      0.53956      0.41381
         -29         9672                0.42837      0.56169      0.39937
         -19         2507                0.49423      0.5931       0.47115
          -9          485                0.31279      0.29432      0.41931
         Inf            0                NaN          NaN          NaN

on 2000 dev set utterances Can be used as the test set for this purpose

                               perphone-filt           unfiltered
 log(p(o)) Bin  frames       GMM          MLP         GMM        MLP
 ----------------------------------------------------------------------
   -100         3719      0.40192       0.38186      0.32858     0.33073                     
    -90        20128      0.40242       0.36689      0.34442     0.30818                     
    -80        87978      0.40683       0.36715      0.35045     0.30725                     
    -70   2.0201e+05      0.42151        0.3894      0.36066     0.32734                     
    -60   2.0084e+05      0.43364       0.43082      0.37067     0.3731                      
    -50   1.0783e+05      0.44002       0.49902      0.38035     0.45515                     
    -40        63801      0.36286       0.47732      0.31299     0.45692                     
    -30        40773      0.25776       0.35737      0.22021     0.35467                     
    -20        20617      0.14172       0.19363      0.13572     0.1965                      
    -10            0          NaN           NaN          NaN     NaN                         
 ----------------------------------------------------------------------
    all                   0.44242       0.45157      0.38798     0.40265          
All, Ignoring errors among sil<->eow<->non-speech:
                          0.5275        0.51813      0.4747      0.4699

Filter lengths are optimized independently per phone on <math>p_{GMM}(q|o)</math> ignoring the sil<->eow<->non-speech errors.

The same filter lengths are also used for filtered MLP.

Frame errors introduced as a result of segment replacement

We can get a guess the amount of damage is done by this frame replacement, by looking at the frame classification errors introduced. It's not a lower limit of the damage because many of the errors have to do only with the boundaries - the sequence of the phonemes remains the same. It's also not an upper limit, because we don't know how the frame error rate translates into the WER. Anyway,

Dashed lines are using the a hamming window of length 13 for all phones, solid lines are using tuned filter lengths as described above.

So even at <math>\tau=.84</math> only .3% of the frames are made incorrect. Hopefully this won't offset the better models we hope to get by selecting more certain representative frames.

how often the q in max p(o,q) frame in segment differs from max q in p(q|o)'s

It's apparently a bad idea to pick the frame with max p(o,q) as the representative frame, since p(o) appears quite noisy. If the <math> max p(q,o)</math> predicts some <math>q_1</math>, and it's at the edge of a whole sequence of frames where <math>max p(q|o)=q_2</math>, <math>q_2</math> is probably right. When there is a difference between p(q,o) and segment label (ignoring the eow-sil confusion), p(q,o) is wrong on about 93% of the frames, and segment label is wrong on about 60% of the frames.

The minimum required by GEM is that <math>\sum_{i \in S} log(p(o_i,q_i^*)) < |S|log(p(o_b,q^*_b))</math> where <math>b\in S</math> is our representative frame from the segment. So it's better to let <math>q^*_b</math> be the state that most frames vote for and not the that one frame votes for most strongly. How to quantify this robustness goal?

This seems like a better thing to do, as the purple and yellow lines in the next pictures demonstrate. They are the only thing different between the two plots.

The plots also demonstrate the advantage of roughly tuning the filter lengths to minimize the in-segment frame error rate which ignores the EOW<->SIL<->non-speech-sounds difference. The search was done using the Nelder-Mead (a.k.a. amoeba, a.k.a fminsearch() in matlab) algorithm over integers, rounding the parameter to integers, which almost certainly made it work worse. The right thing to do would be simply to iteratively integer-grid search around the current filter lengths vector.

Stats on 463286 frames. Solid lines are on amoeba-algorithm optimized weights, dashed lines are on pre-amoeba filter weights.) Maximum p(o,q) is chosen as the representative frame in the segment.

o)</math> such that it's <math>p(o,q)</math> is higher than the geometric average over the <math>p(o,q)</math>'s of the segment is chosen as the representative frame for the segment.

The experiments that should be done

smoothing
- smoothing with a constant length filter (say 9 or something conservative) (may be not this one)
- smoothing with best length filters for each phoneme

tau
- use <math>\tau=.9</math> This will check the iteration more cleanly - the only difference is mlp vs plp between initialization and iteration.
- use <math>\tau=.84</math> Since it seems to have the same error rate at that tau as at tau=.9. see picture , and we are affecting more frames (4.2% vs 2.7%)

Filtered GMMs actually have a strictly lower frame error rate .070 for GMM vs .114 for MLP. This could be good. Let's hope.

drop low-conf frames by ^0

Do the experiments on the last iteration, do the best one on the whole thing.

figure out how to incorporate p(o) into criteria. just threshold or ignore or what?

References

MLP GMM Agreement

From SpeechWiki

Contents

Smoothed GMMs vs MLPs and GMM accuracy

Smoothing MLP p(q|o)

Choosing the smoothing filter order

Based on the support log(p(o))

Based on the phone identity

Conclusions

Frame errors introduced as a result of segment replacement

how often the q in max p(o,q) frame in segment differs from max q in p(q|o)'s

The experiments that should be done

References

Views

Personal tools

Navigation

Toolbox

Search