MLP GMM Agreement

From SpeechWiki

Revision as of 07:05, 4 April 2010 by Arthur (Talk | contribs)
Jump to: navigation, search

Contents

Smoothed GMMs vs MLPs and GMM accuracy

To compute phone given the observation p(q|o) we can use either MLP activations or GMMs. If both classifiers were correct their outputs would equal to each other and to the target. In reality they agree only somewhat. The GMMs tend to have sharp differences between neighboring frames, while MLPs tend to be smoother. By eyeballing p(q|o) where the GMMs and MLPs differ, it appears that the benefits of frame-replacement on segments determined from <math>p_{mlp}(q|o)</math> disappears. For example, a typical correction is /D/ -> /N/ (/N/ is the correct label). And here is what p(q|o) looks like:

o)</math> gets it wrong.


Filtering the GMM p(q|o) with a hamming window improves accuracy of the GMMs, and agreement with MLPs.

o)</math> gets it right, unlike the unfiltered case above.

Cohen's kappa is used to measure agreement between p(q|o) generated by GMMs and MLPs as well as between the true target phonetic label for each frame. We can also look only at the frames that have sufficient support in the model: i.e. those frames with log(p(o))>T, where T is some threshold. Presumably the GMM should be more accurate on those frames. p(o) has a surprisingly normal distribution, perhaps because it's a sum of 48 somewhat independent p(o,q) observations.

p(log(p(o))) distribution of log-likelihood of observations o

The following statistics were gathered on the first 233 utterances of the training set that was already timeshrunk with tau=.9 (31402 frames).

                              all frames           frames o such that log(p_{GMM}(o)>-33
   n in hamming(n) kappa(mlp,gmm) kappa(gmm,target) kappa(mlp,gmm) kappa(gmm,target)
   1               0.3939            0.3511            0.4878        0.3570
   3               0.3986            0.3546            0.4950        0.3639
   5               0.4203            0.3701            0.5226        0.3859
   7               0.4334            0.3803            0.5258        0.3935
   9               0.4428            0.3861            0.5363        0.4079
  11               0.4500            0.3920            0.5427        0.4147
  13               0.4550            0.3962            0.5509        0.4270
  15               0.4584            0.3982            0.5600        0.4445
  17               0.4599            0.3979            0.5531        0.4496
  19               0.4593            0.3966            0.5696        0.4609
  21               0.4579            0.3962            0.5637        0.4596
  23               0.4553            0.3933            0.5651        0.4572


When using the unsmoothed MLP, kappa(mlp,target) = 0.2921. In this case we are using all frames, since p(o) cannot be computed from the MLP activations. The accuracy rate (|mlp == target|/|all frames|) for the mlp on the same data 0.3250, so kappa is a bit more conservative in declaring agreement.

This accuracy is lower that the accuracy of .5707 computed on all of the non-timeshrunk training data reported in MLP Activation Features, partly because 6% of the most accurate frames were time-shrunk, and partly because the first 233 utterances where probably tougher speech than the rest of the training data.

Smoothing MLP p(q|o)

filtering the p(q|o) with a hamming window also helps, but not enough to beat the p(q|o) from GMMs:

   n in hamming(n)      kappa(target,mlp)
   1                    0.2921
   3                    0.2945
   5                    0.3019
   7                    0.3089
   9                    0.3164
  11                    0.3232
  13                    0.3305
  15                    0.3374
  17                    0.3438
  19                    0.3485
  21                    0.3526
  23                    0.3548
  25                    0.3569
  27                    0.3564
  29                    0.3557
  31                    0.3543
  33                    0.3518
  35                    0.3482

Choosing the smoothing filter order

Frames with higher likelihood p(o) tend to do better with longer smoothing filters. High likelihood frames also tend to belong to vowel-like phonemes. In any case the hamming window should have a length between 13 and 47.

KappaFilterOrderVsSupport.png

Correlation between filter length and agreement where frames are binned by log(p(o)) (on the central part -80:-24, where we have enough examples): 0.56

Correlation between filter length and agreement where frames are binned by phone: 0.61

Agreement where we chose the best filter length for each phoneme, weighted by the number of frames in each target phoneme: .4081

Agreement where we chose the best filter length for p(o), weighted by the number of frames in each unfiltered p(o) bin: .3996

So, in fact choosing the filter length based on the phone is about the same as choosing a filter length based on p(o).

If we take the filter length that gives the maximum agreement for each phone, we get kappa =0.3964 for all frames and kappa = 0.4997 for frames with log(p(o))>-33.

Choosing the filters this way assumes that the discriminant functions p(q|o) don't interact, but of course they do. Nevertheless this gives a decent improvement in agreement.

Learning the filter lengths only on the frames where log(p(o))>-33 gives crazy results (.2701 agreement on all frames, .4696 on log(p(o))>-33) because there are probably not enough training data for good statistics (735 frames spread unevenly among 48 phonemes).

The experiments that should be done

  • smoothing with a constant length filter (say 9 or something conservative)
  • smoothing with best length filters for each phoneme
  • check how often the q in max p(o,q) frame in segment differs from max q in p(q|o)'s
  • drop low-conf frames by ^0

Do the experiments on the last iteration, do the best one on the whole thing.

figure out how to incorporate p(o) into criteria. just threshold or what?


References

<references/>

Personal tools