From SpeechWiki

Revision as of 03:31, 20 May 2008 by Arthur (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.

Train/Devel/Test partition

I've split the entire Fisher corpus into 80/10/10 percent for Train/Devel/Test partitions

The utterance id file is in filelists/uttIds.txt And the splits are as follows:

Set	Conversation Sides	Lines in uttIds.txt
Training	00001A to 09360B
Devel	09361A to 10530B
Test	10531A to 11699B

Dictionaries

Some Corpus Statistics
total non-empty utterances	2223159
total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) )	283935
total word tokens in corpus (including uncertain words)	21905137	100%
total non-speech markers enclosed in [] (e.g. [LAUGH]))	559629	2.555%
total partial words (starting or ending in -)	153098	0.6990%
total partial words that could be repaired	101550	0.4636%

Some vocab statistics
total unique words	64924	100%
unique words occuring once in the corpus	23192	35.72%
unique words occuring once or twice in the corpus	31272	48.17%
corpus coverage if vocab does not include words occuring once in the corpus	99.894%
corpus coverage if vocab does not include words occuring once or twice in the corpus	99.857%

In Fisher, the partial words (those starting or ending with a '-'), often have the complete word in the vicinity (within 6 words of the same conversation side). I've replaced the - with the missing part of the word completed from the nearby word having the same non '-' part. Statistics for this new vocabulary are below:

Statistics of the Corpus with Repaired Partial Words
total unique words	79742	100%
unique words occuring once in the corpus	32703	41.01%
unique words occuring once or twice in the corpus	42967	53.88%
words not in the cmudict 0.6 dictionary	18588	23.3%
Corpus coverage by words not in the cmudict 0.6 dictionary	??

Phonetic Dictionary

A word pronunciation was derived using Phonetic Transcription Tool, in this order of preference:

If a word is in the dictionary, use the dictionary definition.
If a word contains numbers spell out the single digits.
If a word contains underscores, treat it as an acronym and spell out single letters.
If a partial word (has [] brackets) but the whole word is in the dictionary, do forced alignment.
Otherwise do viterbi decoding.

Phonetic Transcription Tool still could not handle some of the words. These I transcribed by hand, and they are listed here.

Mixed Unit Dictionary

Language Model

Acoustic Model

There are two sets of PLP feature vectors created for the entire corpus.

PLPs for MLP classifiers

PLPs created in exactly the same way as the training data for MLPs described in J. Frankel et al., “Articulatory feature classifiers trained on 2000 hours. of telephone speech,” ICASSP, 2007

The hcopy config PLP for MLP file is

# PLP base coefficient config courtesy of AMI eval RT05s

#the commented out lines and their replacements by Arthur 4/24/08
#MAXTRYOPEN=10
MAXTRYOPEN=3
SOURCEKIND = WAVEFORM
#SOURCEFORMAT = NIST
SOURCEFORMAT = WAV
TARGETFORMAT = HTK
TARGETKIND = PLP_E
TARGETRATE = 100000.0
HPARM: SAVECOMPRESSED = T
HPARM: SAVEWITHCRC = T
HPARM: ZMEANSOURCE = T
HPARM: WINDOWSIZE = 250000.0
HPARM: USEHAMMING = T
HPARM: PREEMCOEF = 0.97
HPARM: NUMCHANS = 24
HPARM: LPCORDER = 12
HPARM: COMPRESSFACT = 0.3333333
HPARM: NUMCEPS = 12
HPARM: CEPLIFTER = 22
HPARM: ESCALE = 1.0
HPARM: ENORMALISE = T
HPARM: SILFLOOR = 50.0
HPARM: USEPOWER = T
HPARM: CEPSCALE = 10

# required to read fisher/switchboard wavefiles as HTK does not support shorten compression
#HWAVEFILTER    = 'w_decode -o pcm $ -'

PLPs for gaussian mixtures

PLPs, same as above, except bandlimited to 125hz-3800khz, along with deltas and delta-deltas. The cepstral mean is normalized on the PLPs and cepstral variance is normalized for the PLPs, delta and delta-deltas. Normalization is done per conversation side. This is suggested in The 1998 HTK System For Transcription of Conversational Telephone Speech

The PLPs over which gaussians will be built are generated with this hcopy config:

#modified from the "PLP base coefficient config courtesy of AMI eval RT05s"
#the original param is commented and it's replacement is right next to it, if there was a change 
#-arthur 4/24/08

#MAXTRYOPEN=10
MAXTRYOPEN=3
SOURCEKIND = WAVEFORM
#SOURCEFORMAT = NIST
SOURCEFORMAT = WAV

TARGETFORMAT = HTK
#TARGETKIND = PLP_E
TARGETKIND = PLP_E_D_A
TARGETRATE = 100000.0
HPARM: SAVECOMPRESSED = T
HPARM: SAVEWITHCRC = T
HPARM: ZMEANSOURCE = T
HPARM: WINDOWSIZE = 250000.0
HPARM: USEHAMMING = T
HPARM: PREEMCOEF = 0.97
HPARM: NUMCHANS = 24
HPARM: LPCORDER = 12
HPARM: COMPRESSFACT = 0.3333333
HPARM: NUMCEPS = 12
HPARM: CEPLIFTER = 22
HPARM: ESCALE = 1.0
HPARM: ENORMALISE = T
HPARM: SILFLOOR = 50.0
HPARM: USEPOWER = T
HPARM: CEPSCALE = 10

#ADDED BY ARTHUR
#cut-off frequencies for telephone speech
#FIXME: may be rerun with 125hz-3800khz as in [hain1999Htk]
HPARM: LOFREQ = 125
HPARM: HIFREQ = 3800

#cepstral mean and variance will be normalized after the utterances are extracted

At this point we extract only those observations corrseponding to transcribed speech. There is still one file per conversation side. Cepstral mean and variance statistics are collected with HCompV -c, per conversation side, and global variance estimated on the training set. Finally, the each converstation side is zero-meaned, and variance renormalized to match the global variance on the training set. The end result is in the plpcut dir.

Fisher Corpus

From SpeechWiki

Contents

Train/Devel/Test partition

Dictionaries

Phonetic Dictionary

Mixed Unit Dictionary

Language Model

Acoustic Model

PLPs for MLP classifiers

PLPs for gaussian mixtures

Views

Personal tools

Navigation

Toolbox

Search