MLP Activation Features

From SpeechWiki

Jump to: navigation, search

Contents

MLP classifiers

In addition to the PLP features, we train a Multi-layer Perceptron (MLP) to classify each frame into one of possible phonemes. To classify frame i in some utterance the inputs are the PLP features for the 9 consequtive frames [i-4 ... i+4] centered on frame i (input layer has a total of 39*9 nodes). The output layer is the phonemes, (our phonebet is 48 phonemes, so 48 output nodes).

The number of hidden nodes is a black art. I set it at 3500 (althogh 350000 would probably be better). See the config file for some discussion of it.

The whole process follows closely the <ref name="Frankel2006Articulatory">Frankel2006Articulatory</ref> paper and use the aptly named quicknet package. Simple as rakes, but effective.

The scripts for the whole thing are in http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/mlp.

The MLPs are trained on the training set of the Fisher Corpus, and cross validated on the development set, minus the first 10000 utterances of the dev set.

Frame Accuracies

Training was done on a 4-core 2.4ghz CPU, compiled with intel's compiler, and intel's math library Cross validation takes around 3:40 hours, and training around 3 days, 3:30 hours per epoch.

Accuracy
Epoch Learning rate Train set CV set Entire corpus
1 0.0004 55.97% 54.54%
2 0.0004  ??  ??
3 0.0004  ??  ??
6 (Final) 0.000050 59.39% 57.07% 57.56%

Final entire corpus stats:

Recognition time: 117697.96 secs (32 hours, 41 mins, 37 secs).
Recognition speed: 8832.58 MCPS, 6308.8 frames/sec.
Recognition accuracy: 427433316 right out of 742529486, 57.56% correct.
Program stop: Wed May 27 19:49:26 2009

Post Processing

The output-node non-linearity function is softmax, and the frame-by-frame MLP outputs are very spiky - one node is highly active while the others are close to 0. Since we want to fit this data with mixture-of-gaussians, we compress the activations by apply a log to the activations. Since we want to use diagonal covariances for the gaussians, we decorrelate the data with PCA, and also select (project onto) the minimum number of the decorrelated features while keeping the Frobeneus Norm of the projected covariance matrix within .05 of the original covariance matrix.

The covariance for the PCA is computed only the training portion of the Fisher Corpus, but the entire corpused is post-processed.


Bugs

Some utterance have multiple </S> </S> ... end-of-utterance tags at the end of the transcriptions, (due to my careless forced alignment code), and so the MLP models were trained with these transcriptions. The good news is that these multiple </S> </S> tags seem to just cover some noisy silence at the end of an utterance.

References

<references/>

Personal tools