Fisher Corpus

From SpeechWiki

(Difference between revisions)
Jump to: navigation, search
m
m
Line 17: Line 17:
|-
|-
! total partial words (starting or ending in -)
! total partial words (starting or ending in -)
-
| 154130 || 0.7036%
+
| 153098 || 0.6990%
 +
|-
 +
! total partial words that could be repaired
 +
| 101550 || 0.4636%
|-
|-
|}
|}

Revision as of 08:11, 28 March 2008

The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.

Vocabulary

Some Corpus Statistics
total non-empty utterances 2223159
total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) ) 283935
total word tokens in corpus (including uncertain words) 21905137 100%
total non-speech markers enclosed in [] (e.g. [LAUGH])) 559629 2.555%
total partial words (starting or ending in -) 153098 0.6990%
total partial words that could be repaired 101550 0.4636%
Some vocab statistics
total unique words 64924 100%
unique words occuring once in the corpus 23192 35.72%
unique words occuring once or twice in the corpus 31272 48.17%
corpus coverage if vocab does not include words occuring once in the corpus 99.894%
corpus coverage if vocab does not include words occuring once or twice in the corpus 99.857%

Language Model

Personal tools