Revision as of 17:40, 2 October 2008

The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.

Train/Devel/Test partition

I've split the entire Fisher corpus into 80/10/10 percent for Train/Devel/Test partitions

The utterance id file is in filelists/uttIds.txt And the splits are as follows:

Set	Conversation Sides	Lines in uttIds.txt
Training	00001A to 09360B	1 to 1775831
Devel	09361A to 10530B	1775832 to 1991965
Test	10531A to 11699B	1991965 to 2223159

Dictionaries

Fisher Dictionaries

Language Model

There is a lot to say about the Fisher Language Models so they get their own page.

Front End

Fisher Front End

@@ Line 24: / Line 24: @@
 | 1991965 to 2223159
 |}
 =Dictionaries=
-{| class="wikitable" style="text-align:center"
+[[Fisher Dictionaries]]
-|+Corpus Statistics
-|-
-! total non-empty utterances
-| 2223159
-|-
-! total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) )
-| 283935
-|-
-! total word tokens in corpus (including uncertain words)
-| 21905137 || 100%
-|-
-! total non-speech markers enclosed in [] (e.g. [LAUGH]))
-| 559629 || 2.555%
-|-
-! total partial words (starting or ending in -)
-| 153098 || 0.6990%
-|-
-! total partial words that could be repaired
-| 101550 || 0.4636%
-|-
-|}
-{| class="wikitable" style="text-align:center"
-|+Vocab statistics on the raw corpus
-! total unique words
-| 64924  || 100%
-|-
-! unique words occuring once in the corpus
-| 23192 || 35.72%
-|-
-!unique words occuring once or twice in the corpus
-| 31272 || 48.17%
-|-
-!corpus coverage if vocab does not include words occuring once in the corpus
-| 99.894%
-|-
-!corpus coverage if vocab does not include words occuring once or twice in the corpus
-| 99.857%
-|}
-In Fisher, the partial words (those starting or ending with a '-'), often have the complete word in the vicinity (within 6 words of the same conversation side).  I've replaced the - with the missing part of the word completed from the nearby word having the same non '-' part and enclosed in [] brackets.  Statistics for this new vocabulary are below:
-{| class="wikitable"
-|+  Vocab statistics on the corpus with repaired partial words
-|-
-! total unique words or word segments
-| 79742
-| 100%
-|-
-! total unique words
-| 57036
-|
-|-
-! unique words or fragments occuring once in the corpus
-| 32703
-| 41.01%
-|-
-! unique words or fragments occuring once or twice in the corpus
-| 42967
-| 53.88%
-|-
-! whole words not in the [http://www.speech.cs.cmu.edu/cgi-bin/cmudict cmudict 0.6 dictionary]
-| 16652
-|
-|-
-! Corpus coverage (token count) by whole words not in the cmudict 0.6 dictionary
-| 0.47% (103244)
-|
-|-
-! Corpus coverage (token count) by words, word fragments and non-speech sounds not in the cmudict 0.6 dictionary
-| 3.76% (822146)
-|
-|-
-|}
-Note that there two uses for the [] brackets:
-# A complete word in [] brackets denotes non-speech events, e.g. [LAUGH] or [SIGH]
-# A word with only part of the word enclosed in [] brackets denotes a partial word, with the word in [] brackets missing, (e.g.  RA[THER]  => R AE).
-==Phonetic Dictionary==
-===The phone set===
-The set of phones, along with the number of states per phone is [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/phone.state phone.state].
-It is essentially the CMUDICT phoneset.  It is used because the dictionary using these phones is available, and the pronounce tool is trained on this dictionary.  Additionally, it contains some non-speech sounds which are transcribed in the fisher corpus.
-The number of states is taken from the JHU06 phoneset, with the following differences:
-# of multiple variants (e.g. oy1 and oy2), the one with more states is kept
-# the plosives are merged with their closures (p 1 and pcl 2 become p 3)
-# The following phones (and their number of states) are missing: -ax  3 -axr 3 -dx  3 -en  3
-===The phonetic transcriptions===
-A word pronunciation was derived using [[Phonetic Transcription Tool]], in this order of preference:
-# If a word is in the dictionary, use the dictionary definition.
-# If a word contains numbers spell out the single letters and digits.
-# If a word contains underscores, treat it as a compound word (or an acronym) and concatinate the dictionary pronunciations of the parts between the underscores (e.g. I_B_MAT => AY B IY M AE T)
-# If a partial word (has [] brackets) but the whole word is in the dictionary, do forced alignment.
-# Otherwise do viterbi decoding.
-# Phonetic Transcription Tool still could not handle some of the words.  These I transcribed by hand, and they are listed [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/manualDict.txt manualDict.txt].
-The final dictionary containing every word in the repaired fisher corpus is in [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/dict/fisherPhonemicDict.txt fisherPhonemicDict.txt].
-Some words found in the Fisher corpus have multiple pronunciations in the CMUDICT.  These alternative pronunciations have been added into the single-pronunciation  [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/dict/fisherPhonemicDict.txt fisherPhonemicDict.txt], to obtain the multiple pronunciation [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/dict/fisherPhoneticMpronDict.txt fisherPhoneticMpronDict.txt].
-Both these dictionaries have the start and end of utterance markers <nowiki><S></nowiki> and <nowiki></S></nowiki>, pronounced as SIL.
-====<span style="color: Red">Known bugs in the dictionary:</span>====
-# acronyms with A are transcribed as AH (A_S_A => AH EH S AH). This affects about 4 thousand tokens
-# Some known acronyms are not spelled out with underscores and the model has a hard time with that (MSNBC => M IH Z N IH K).
-You can see the whole words missing from CMUDict in [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/dict/wholeWordsMissingFromCMUDictDefs.txt wholeWordsMissingFromCMUDictDefs.txt].
-These bugs will not be corrected for my current set of experiments to keep things consistent.
-===Word perplexity given a permitted sequence of phones===
-The dictionary contains different words which are pronounced as an identical sequence of phonemes.  Naturally the speech recognizer will have problems with those words, relying only on the language model to make the decision.
-For the Fisher corpus, the perplexity of word <math>W</math> given phone sequence <math>S</math> is calculated as follows.
-Let <math>H(W|S=s)</math> be the entropy of <math>W</math> given a paricular phone sequence <math>s</math>. This can be obtained from the corpus statistics.  Then <math>H(W|S)=E[H(W|S=s)]_{p(S)}</math>, since <math>p(S)</math> is the count of tokens pronounced as <math>S</math> divided by the total number of tokens in the corpus.
-I've calcuated the  <math>H(W|S)</math> using the partial-words-repaired corpus and the multi-pronunciation dictionary.  Since we only have word transcriptions and don't have phonetic transcriptions, we need to make a guess which of the pronunciations was used for each token of each multi-pronunciation word.  If a word has M pronunciations, and occurs in the corpus N times, we assume that it was  pronounced as <math> \left \lfloor\frac{N}{M}\right \rfloor</math> times.  This clearly wrong, as the pronunciations per word tend to follow a zipfs law (or perhaps exponential, but certainly not uniform distribution), but we have no basis to make a better guess.
-The code for the following statistics is in [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/wordPerplexity/calcWordPerlexity.m calcWordPerlexity.m].
-{| class="wikitable" style="text-align:center"
-|+Word perplexity and related statistics
-! Total Tokens
-| 26351455  || 100%
-|-
-! Tokens lost to flooring
-| 2949 || 0.01119%
-|-
-!Tokens affected by the multiple pronunciation redistribution
-| 7866152 || 29.85%
-|-
-!<math>H(W|S)</math> ignoring the <nowiki><S></nowiki> and <nowiki></S></nowiki> markers
-| 0.2275 bits
-|-
-!The corresponding perplexity for the above <math>H(W|S)</math>
-| 1.1708 words per unique phoneme string
-|}
-This means that even if the phoneme string and its boundaries was recognized perfectly, the language model would have to choose one from 1.17 words on average.
-==Mixed Unit Dictionary==
 =Language Model=
 There is a lot to say about the [[Fisher Language Model]]s so they get their own page.
-=Acoustic Model=
+=Front End=
-There are two sets of PLP feature vectors created for the entire corpus.
+[[Fisher Front End]]
-==PLPs for MLP classifiers==
-PLPs created in exactly the same way as the training data for MLPs described in
-<ref name="frankel2007articulatory">[http://www.cstr.ed.ac.uk/downloads/publications/2007/Cetin_icassp07_tandem.pdf J. Frankel et al., “Articulatory feature classifiers trained on 2000 hours. of telephone speech,” ICASSP, 2007]</ref>
-The hcopy config file to generate PLP features for MLP input is [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/hcopy.wav2plpForMlp.config here].
-This way, we can use the MLPs presented in the above paper for segmenting the speech for  [[timeshrinking]] experiments.
-==mean and variance normalized, ARMAed PLPs for gaussian mixtures ==
-The second set of features is used to construct the mixture gaussian models. The features are PLPs, deltas and accelerations generated with [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/hcopy.wav2plp.config this] hcopy config.  The following aspects are slightly non-standard:
-* The mel-frequency filter bank is constructed only over the band of 125hz-3800khz, and not over the entire  telephone speech range of 0-4000hz.  There is some slight benefit to this found in <ref name="MVA">[http://ssli.ee.washington.edu/people/chiaping/mva.html MVA: a noise-robust feature processing scheme]</ref>, although in <ref name="Hain1998Htk"/>  band-limiting has an ambiguous affect on accuracy.
-* The 0th cepstral coefficient is used, instead of the log-energy again due to experiments in <ref name="MVA"/>.
-At this point, only the frames which correspond to transcribed audio are extracted, and the following steps are performed only on frames from time periods of transcribed audio.  The features are still stored in one file per conversation side.
-===Normalization===
-* The cepstral coefficients, the deltas and accelerations are each normalized to 0-mean, unit-variance. as in <ref name="MVA"/>.  This is different from the HTK book, which normalizes only the coefficients, and takes the deltas and accelerations afterwards (deltas and accelerations are not re-normalized). Normalization is done per conversation side as recommended in <ref name="Hain1998Htk">[http://citeseer.ist.psu.edu/267590.html Hain 1998, The 1998 HTK System For Transcription of Conversational Telephone Speech]</ref>.
-* Finally a order-2 ARMA filter is used.  The whole thing is made easy by this [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/MVA.cc MVA program] written by Chia-ping Chen.
-<references/>

Fisher Corpus

From SpeechWiki

Revision as of 17:40, 2 October 2008

Contents

Train/Devel/Test partition

Dictionaries

Language Model

Front End

Views

Personal tools

Navigation

Toolbox

Search