Fisher Corpus
From SpeechWiki
(Difference between revisions)
m |
m |
||
Line 17: | Line 17: | ||
|- | |- | ||
! total partial words (starting or ending in -) | ! total partial words (starting or ending in -) | ||
- | | | + | | 153098 || 0.6990% |
+ | |- | ||
+ | ! total partial words that could be repaired | ||
+ | | 101550 || 0.4636% | ||
|- | |- | ||
|} | |} |
Revision as of 08:11, 28 March 2008
The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.
Vocabulary
total non-empty utterances | 2223159 | |
---|---|---|
total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) ) | 283935 | |
total word tokens in corpus (including uncertain words) | 21905137 | 100% |
total non-speech markers enclosed in [] (e.g. [LAUGH])) | 559629 | 2.555% |
total partial words (starting or ending in -) | 153098 | 0.6990% |
total partial words that could be repaired | 101550 | 0.4636% |
total unique words | 64924 | 100% |
---|---|---|
unique words occuring once in the corpus | 23192 | 35.72% |
unique words occuring once or twice in the corpus | 31272 | 48.17% |
corpus coverage if vocab does not include words occuring once in the corpus | 99.894% | |
corpus coverage if vocab does not include words occuring once or twice in the corpus | 99.857% |