From SpeechWiki

(Difference between revisions)

Revision as of 04:16, 27 December 2008

A set of traditional phone-based baselines against which to compare our mixed-unit systems. An implementation of it using GMTK and associated scripts is here.

Multi-pronunciation Monophone Model

The initial model is monophone, with the number of states per phone specified in phone.state. With 64 gaussians per state.

The dictionary used allows multiple pronunciations (up to 6). See more details here.

Training on the entire fisher training set (1775831 utterances, specified in Fisher Corpus), takes an exceedingly long time: more than 15 hours for a single EM iteration, on a 32 cpu cluster or about 20 cluster compute days to get 64-gaussian mixtures, doubling the number of mixtures after converging with 6 iterations.

Possible solutions are:

Do training on data where word boundaries are observed. (Thanks Chris) For this we need to force-align the entire fisher corpus using some existing recognizer.
Use a smaller corpus.
Use teragrid (our allocation of 30000 cpu-hours is only 40 days on our cluster - almost not worth the effort of porting)
Fool around with better triangulations (not much hope for improvement since the RVs with each frame are densely connected in our DBN)

Training on word-aligned data

To speed up training up, I've used the above model to obtain word boundaries using force alignment. See forceAlign.pl script. The word boudaries, are in wordObservationsWordAligned.scp. Some of the utterance were impossible to word align to transcriptions. The transcription and scp files omitting those bad utterances are also generated.

The model with observed word boundaries could be trained about 6 times faster, and took about 3-4 days to train. The model from word-aligned training is in mpWAmodel dir.

Decoding and Scoring

Decoding is done by GMTKViterbiNew and scoring by sclite.

Preliminary Results

Tested on the first 500 utterances of the Dev Set, using the following language models. LM_SCALE and LM_PENALTY were not tuned. The language model and scoring treat all word fragments, filled pauses, and non-speech events as distinct words, and so the WER reported is conservative.

experiments with language model and vocab choices.
Row	Vocab	Lang Model	Config	WER
1	10000	bigram	config 0	67.3%
2	10000	trigram	config 9	67.2%
3	5000	trigram	config 10	67.7%
4	1000	trigram	config 11	69.5%
5	500	trigram	config 12	71.6%

Testing the word-aligned model quality

Tested on the first 500 utterances of the Dev Set, using the 10k vocab 3-gram LM with no word fragments and no <unk>, mapping filled-pauses (uh, huh, etc.) and non-speech (e.g. [LAUGH]) to optionally deletable %hesitation, and allowing word fragments to match the prefix or suffix of hypothesized word (identical to row two in the next table with the bi-gram replaced with tri-gram and mpModel with mpWAModel). Using the model in mpWAModel yields 65.6% WER, about 0.9% absolute worse than not specifying word boundaries.

Decoding and Scoring as NIST does it

Tested on the first 500 utterances of the Dev Set, using the 10k vocab 2-gram LM. LM_SCALE and LM_PENALTY were not tuned (fixed at 10 -1 respectively).

Experiments with different scoring rules.
Row	LM	post-viterbi processing and sclite Scoring rules	Config	WER
1	Same as row 1 in prev section (for comparison purposes)	filled pauses and word fragments treated as distinct words , identical to score in prev table	config 0	67.3%
2	Same as in prev section, but vocab contains no word fragments (vocab filled in with more whole words to reach 10k), and no <unk>	word fragments match any word with common prefix/suffix, filled pauses and non-speech sounds mapped to %hesitation, and are optionally deletable	config 0	64.7%
3	Same as row 1 in prev section	word fragments deleted in post-viterbi processing, and made optionally deletable in reference transcription	config 17	66.6%
4	Same as row 1 in prev section	word fragments and <unk> deleted in post-viterbi processing (but unk is allowed in the language model), and made optionally deletable in reference transcription	config 17	64.9%
5	20k vocab, 3-gram, GT smoothed, pruned, no word fragments, no <unk>	word fragments match common prefix/suffix, filled pauses and non-speech sounds mapped to %hesitation, and are optionally deletable	config 13	64.9%

Single Pronunciation Dictionary Experiment

This is the first experiment supporting my thesis.

A single pronunciation dictionary fisherPhoneticSpronDict.txt was generated from a multiple pronunciation dictionary fisherPhoneticMpronDict.txt by selecting the highest probability pronunciation, where the probabilities are learned with baum-welch in the mpModel.

In this we use the post-viterbi processing and sclite scoring that gave the best score in the previous section (row 2).

single vs. multi pronunciation dictionary
Row	LM	Dictionary	Config	WER	compute time
1 (identical to row 2 in previous table)	10k vocab bigram, GT smoothed, pruned	Multi-pron	config 0	64.7%	2:56 hours
2	10k vocab bigram, GT smoothed, pruned	Single-pron	config 0	63.9%	2:00 hours
3	10k vocab trigram, GT smoothed, pruned	Single-pron	config 9	64.0%	2:00 hours

Single-pronunciation Triphone Model

I've also used the single pronunciation to train a triphone model. The triphone clustering questions can be found in genTriUnitParams.py script.

The structure of the graphical model is as follows:

structure for a cross-word triphone GM

You can also open it in dia format to show/hide the right context/left context layers.

WER as a function of number of components

Now that we have more RAM in our cluster, we can try using more components per mixture and see when the WER bottoms out. All the experiments are done using the word-internal triphones config 5, and tested on the 10k vocab trigram (row 3 in prev table).

WER vs. number of components per mixture
Row	Num Components	Descr	WER
1	1		75.9%
2	2		69.6%
3	4		65.9%
4	8		63.9%
5	16		??%
6	32		??%
7	64		??%
8	128		??%
9	256		??%
10	??	Row 9 with low probability components removed	??%

cross-word left context

Does not slow things down too much and the results are below Testing is done on row 3 of the table in #Single Pronunciation Dictionary Experiment.

cross-word vs. word-internal left context
Row	Train config	WER
1	config 5	??
2	config 3	??

cross-word right context

Does not get efficiently triangulated (yet - Jeff's looking at it) and slows training by about 30 times over the word-internal right context. We'll try it when we can speed it up some.

@@ Line 163: / Line 163: @@
 The structure of the graphical model is as follows:
-  [[Image:triphoneTrain.png]]
+  [[Image:triphoneTrain.png|center|thumb|1000px|structure for a cross-word triphone GM]]
 You can also [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/triphone/PARAM/triphoneTrain.dia open it in dia format] to show/hide the right context/left context layers.

Fisher Baseline Experiments

From SpeechWiki

Revision as of 04:16, 27 December 2008

Contents

Multi-pronunciation Monophone Model

Training on word-aligned data

Decoding and Scoring

Preliminary Results

Testing the word-aligned model quality

Decoding and Scoring as NIST does it

Single Pronunciation Dictionary Experiment

Single-pronunciation Triphone Model

WER as a function of number of components

cross-word left context

cross-word right context

Views

Personal tools

Navigation

Toolbox

Search