Fisher Corpus
From SpeechWiki
m |
|||
Line 8: | Line 8: | ||
* Conversation 00001A to 09360B : training set | * Conversation 00001A to 09360B : training set | ||
* Conversation 09361A to 10530B : devel set | * Conversation 09361A to 10530B : devel set | ||
- | * Conversation 10531A to 11699B : | + | * Conversation 10531A to 11699B : Test set |
Revision as of 22:13, 24 April 2008
The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.
Contents |
Train/Devel/Test partition
I've split the entire Fisher corpus into 80/10/10 percent for Train/Devel/Test partitions
The utterance id file is in filelists/uttIds.txt And the splits are as follows:
- Conversation 00001A to 09360B : training set
- Conversation 09361A to 10530B : devel set
- Conversation 10531A to 11699B : Test set
Vocabulary
total non-empty utterances | 2223159 | |
---|---|---|
total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) ) | 283935 | |
total word tokens in corpus (including uncertain words) | 21905137 | 100% |
total non-speech markers enclosed in [] (e.g. [LAUGH])) | 559629 | 2.555% |
total partial words (starting or ending in -) | 153098 | 0.6990% |
total partial words that could be repaired | 101550 | 0.4636% |
total unique words | 64924 | 100% |
---|---|---|
unique words occuring once in the corpus | 23192 | 35.72% |
unique words occuring once or twice in the corpus | 31272 | 48.17% |
corpus coverage if vocab does not include words occuring once in the corpus | 99.894% | |
corpus coverage if vocab does not include words occuring once or twice in the corpus | 99.857% |
Language Model
Acoustic Model
There are two sets of PLP feature vectors created for the entire corpus.
PLPs for MLP classifiers
PLPs created in exactly the same way as the training data for MLPs described in J. Frankel et al., “Articulatory feature classifiers trained on 2000 hours. of telephone speech,” ICASSP, 2007
The hcopy config PLP for MLP file is
# PLP base coefficient config courtesy of AMI eval RT05s #the commented out lines and their replacements by Arthur 4/24/08 #MAXTRYOPEN=10 MAXTRYOPEN=3 SOURCEKIND = WAVEFORM #SOURCEFORMAT = NIST SOURCEFORMAT = WAV TARGETFORMAT = HTK TARGETKIND = PLP_E TARGETRATE = 100000.0 HPARM: SAVECOMPRESSED = T HPARM: SAVEWITHCRC = T HPARM: ZMEANSOURCE = T HPARM: WINDOWSIZE = 250000.0 HPARM: USEHAMMING = T HPARM: PREEMCOEF = 0.97 HPARM: NUMCHANS = 24 HPARM: LPCORDER = 12 HPARM: COMPRESSFACT = 0.3333333 HPARM: NUMCEPS = 12 HPARM: CEPLIFTER = 22 HPARM: ESCALE = 1.0 HPARM: ENORMALISE = T HPARM: SILFLOOR = 50.0 HPARM: USEPOWER = T HPARM: CEPSCALE = 10 # required to read fisher/switchboard wavefiles as HTK does not support shorten compression #HWAVEFILTER = 'w_decode -o pcm $ -'
PLPs for gaussian mixtures
PLPs, same as above, except bandlimited to 125hz-3800khz, along with deltas and delta-deltas. The cepstral mean is normalized on the PLPs and cepstral variance is normalized for the PLPs, delta and delta-deltas. Normalization is done per conversation side. This is suggested in The 1998 HTK System For Transcription of Conversational Telephone Speech
The PLPs over which gaussians will be built are generated with this hcopy config:
#modified from the "PLP base coefficient config courtesy of AMI eval RT05s" #the original param is commented and it's replacement is right next to it, if there was a change #-arthur 4/24/08 #MAXTRYOPEN=10 MAXTRYOPEN=3 SOURCEKIND = WAVEFORM #SOURCEFORMAT = NIST SOURCEFORMAT = WAV TARGETFORMAT = HTK #TARGETKIND = PLP_E TARGETKIND = PLP_E_D_A TARGETRATE = 100000.0 HPARM: SAVECOMPRESSED = T HPARM: SAVEWITHCRC = T HPARM: ZMEANSOURCE = T HPARM: WINDOWSIZE = 250000.0 HPARM: USEHAMMING = T HPARM: PREEMCOEF = 0.97 HPARM: NUMCHANS = 24 HPARM: LPCORDER = 12 HPARM: COMPRESSFACT = 0.3333333 HPARM: NUMCEPS = 12 HPARM: CEPLIFTER = 22 HPARM: ESCALE = 1.0 HPARM: ENORMALISE = T HPARM: SILFLOOR = 50.0 HPARM: USEPOWER = T HPARM: CEPSCALE = 10 #ADDED BY ARTHUR #cut-off frequencies for telephone speech #FIXME: may be rerun with 125hz-3800khz as in [hain1999Htk] HPARM: LOFREQ = 125 HPARM: HIFREQ = 3800 #cepstral mean and variance will be normalized after the utterances are extracted
At this point we extract only those observations corrseponding to transcribed speech. There is still one file per conversation side.
Cepstral mean and variance statistics are collected with HCompV -c
, per conversation side, and global variance estimated on the training set.
Finally, the each converstation side is zero-meaned, and variance renormalized to match the global variance on the training set.
The end result is in plpcut.