Revision as of 00:52, 26 March 2008 by Arthur (Talk | contribs)

The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.

Vocabulary

Some Corpus Statistics
total utterances	2223159
total word tokens in corpus	21905137	100%
total non-speech markers enclosed in [] (e.g. [LAUGH]))	559629	2.555%
total partial words (starting or ending in -)	154130	0.7036%

Some vocab statistics
total unique words	64924	100%
unique words occuring once in the corpus	23192	35.72%
unique words occuring once or twice in the corpus	31272	48.17%
corpus coverage if vocab does not include words occuring once in the corpus	99.894%
corpus coverage if vocab does not include words occuring once or twice in the corpus	99.857%

Fisher Corpus