From SpeechWiki

Transcription Statistics

Corpus Statistics
total non-empty utterances	2223159
total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) )	283935
total word tokens in corpus (including uncertain words)	21905137	100%
total non-speech markers enclosed in [] (e.g. [LAUGH]))	559629	2.555%
total partial words (starting or ending in -)	153098	0.6990%
total partial words that could be repaired	101550	0.4636%

Vocab statistics on the raw corpus
total unique words	64924	100%
unique words occuring once in the corpus	23192	35.72%
unique words occuring once or twice in the corpus	31272	48.17%
corpus coverage if vocab does not include words occuring once in the corpus	99.894%
corpus coverage if vocab does not include words occuring once or twice in the corpus	99.857%

In Fisher, the partial words (those starting or ending with a '-'), often have the complete word in the vicinity (within 6 words of the same conversation side). I've replaced the - with the missing part of the word completed from the nearby word having the same non '-' part and enclosed in [] brackets. Statistics for this new vocabulary are below:

Vocab statistics on the corpus with repaired partial words
total unique words or word segments	79742	100%
total unique words	57036
unique words or fragments occuring once in the corpus	32703	41.01%
unique words or fragments occuring once or twice in the corpus	42967	53.88%
whole words not in the cmudict 0.6 dictionary	16652
Corpus coverage (token count) by whole words not in the cmudict 0.6 dictionary	0.47% (103244)
Corpus coverage (token count) by words, word fragments and non-speech sounds not in the cmudict 0.6 dictionary	3.76% (822146)

Note that there two uses for the [] brackets:

A complete word in [] brackets denotes non-speech events, e.g. [LAUGH] or [SIGH]
A word with only part of the word enclosed in [] brackets denotes a partial word, with the word in [] brackets missing, (e.g. RA[THER] => R AE).

Phonetic Dictionary

The phone set

The set of phones, along with the number of states per phone is phone.state. It is essentially the CMUDICT phoneset. It is used because the dictionary using these phones is available, and the pronounce tool is trained on this dictionary. Additionally, it contains some non-speech sounds which are transcribed in the fisher corpus. The number of states is taken from the JHU06 phoneset, with the following differences:

of multiple variants (e.g. oy1 and oy2), the one with more states is kept
the plosives are merged with their closures (p 1 and pcl 2 become p 3)
The following phones (and their number of states) are missing: -ax 3 -axr 3 -dx 3 -en 3

The phonetic transcriptions

A word pronunciation was derived using Phonetic Transcription Tool, in this order of preference:

If a word is in the dictionary, use the dictionary definition.
If a word contains numbers spell out the single letters and digits.
If a word contains underscores, treat it as a compound word (or an acronym) and concatinate the dictionary pronunciations of the parts between the underscores (e.g. I_B_MAT => AY B IY M AE T)
If a partial word (has [] brackets) but the whole word is in the dictionary, do forced alignment.
Otherwise do viterbi decoding.
Phonetic Transcription Tool still could not handle some of the words. These I transcribed by hand, and they are listed manualDict.txt.

The final dictionary containing every word in the repaired fisher corpus is in fisherPhoneticDict.txt.

Some words found in the Fisher corpus have multiple pronunciations in the CMUDICT. These alternative pronunciations have been added into the single-pronunciation fisherPhoneticDict.txt, to obtain the multiple pronunciation fisherPhoneticMpronDict.txt.

Running forced alignment with fisherPhoneticMpronDict.txt, I collected statistics on which pronunciation variant was used more frequently, and generated another single pronunciation dictionary fisherPhoneticSpronDict.txt, containing only the single most frequent variant.

All these dictionaries have the start and end of utterance markers <S> and </S>, pronounced as SIL.

The words used in Fisher but missing from the CMU dictionary are in wholeWordsMissingFromCMUDictDefs.txt. The word fragments (also missing from CMU dict) are not included in this list.

Known bugs in the dictionary:

acronyms with A are transcribed as AH (A_S_A => AH EH S AH). This affects about 4 thousand tokens
Some known acronyms are not spelled out with underscores and the model has a hard time with that (MSNBC => M IH Z N IH K).

You can see the whole words missing from CMUDict in wholeWordsMissingFromCMUDictDefs.txt.

These bugs will not be corrected for my current set of experiments to keep things consistent.

Word perplexity given a permitted sequence of phones

The dictionary contains different words which are pronounced as an identical sequence of phonemes. Naturally the speech recognizer will have problems with those words, relying only on the language model to make the decision. For the Fisher corpus, the perplexity of word <math>W</math> given phone sequence <math>S</math> is calculated as follows.

Let <math>H(W|S=s)</math> be the entropy of <math>W</math> given a paricular phone sequence <math>s</math>. This can be obtained from the corpus statistics. Then <math>H(W|S)=E[H(W|S=s)]_{p(S)}</math>, since <math>p(S)</math> is the count of tokens pronounced as <math>S</math> divided by the total number of tokens in the corpus. I've calcuated the <math>H(W|S)</math> using the partial-words-repaired corpus and the multi-pronunciation dictionary. Since we only have word transcriptions and don't have phonetic transcriptions, we need to make a guess which of the pronunciations was used for each token of each multi-pronunciation word. If a word has M pronunciations, and occurs in the corpus N times, we assume that it was pronounced as <math> \left \lfloor\frac{N}{M}\right \rfloor</math> times. This clearly wrong, as the pronunciations per word tend to follow a zipfs law (or perhaps exponential, but certainly not uniform distribution), but we have no basis to make a better guess.

The code for the following statistics is in calcWordPerlexity.m.

Word perplexity and related statistics
Total Tokens	26351455	100%
Tokens lost to flooring	2949	0.01119%
Tokens affected by the multiple pronunciation redistribution	7866152	29.85%
S)</math> ignoring the <S> and </S> markers	0.2275 bits
S)</math>	1.1708 words per unique phoneme string

This means that even if the phoneme string and its boundaries was recognized perfectly, the language model would have to choose one from 1.17 words on average.

Mixed Unit Dictionary

Fisher Dictionaries

From SpeechWiki

Contents

Transcription Statistics

Phonetic Dictionary

The phone set

The phonetic transcriptions

Known bugs in the dictionary:

Word perplexity given a permitted sequence of phones

Mixed Unit Dictionary

Views

Personal tools

Navigation

Toolbox

Search