:Units Paper
From SpeechWiki
Contents |
Outline
- Intro
- Unit Selection
- Mistake instance
- Unit
- Replacement
- Multwords
- Baseline Description
- Vocab: single most frequent pronunciation from a multi-pronunciation dictionary (better than multi-pronunciation)
- Results
- what to emphasize? Ideally, units+DTs will beat just DTs for every number of components. Even if we cannot grow the components until improvement bottoms out, at least there will be a trend.
- Conclusion
- Future work: consider context during unit selection (right now the unit is context-free - the same unit appearing in all contexts where replacements took place).
Tests for units paper
compPer: | units: | monophone states | Mix: | totalComp: | WER | Test WER | Important |
---|---|---|---|---|---|---|---|
512 | 1 | 503 | 256k | TR | |||
256 | 1 | 1000 | 256k | TR | |||
64 | 1 | 3854 | 256k | 49.3 | |||
64 | 2 | 2000 | 256k | ? | |||
32 | 4 | 2000 | 256k | ? | |||
32 | 2 | 4000 | 256k | ? | |||
alternatively | |||||||
256 | 48 | 137 | 503 | 127971 | 53.0 | ||
128 | 48 | 137 | 1033 | 131185 | 50.9 | ||
32 | 48 | 137 | 3845 | 122907 | 51.4 | ||
64 | 112 | 615 | 2024 | ~128k | TR | ||
16 | 4 | 2000 | 128k | ? | |||
16 | 112 | 615 | 4000 | 128k | ? |
The units make it worse
with LM_PENALTY=0
DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn SENTENCE RECOGNITION PERFORMANCE sentences 500 with errors 89.2% ( 446) with substitions 72.4% ( 362) with deletions 26.2% ( 131) with insertions 74.4% ( 372) WORD RECOGNITION PERFORMANCE Percent Total Error = 99.1% (5107) Percent Correct = 28.6% (1476) Percent Substitution = 66.2% (3411) Percent Deletions = 5.1% ( 265) Percent Insertions = 27.8% (1431) Percent Word Accuracy = 0.9% Ref. words = (5152) Hyp. words = (6318) Aligned words = (6583) CONFUSION PAIRS Total (2790) With >= 1 occurances (2790)
with LM_PENALTY=-1
test2kUtt/config16Disaster/test0/accuracy/out.nosil.trn.dtl DETAILED OVERALL REPORT FOR THE SYSTEM: test2kUtt/config16/test0/accuracy/out.nosil.trn SENTENCE RECOGNITION PERFORMANCE sentences 500 with errors 88.2% ( 441) with substitions 72.8% ( 364) with deletions 32.8% ( 164) with insertions 68.6% ( 343) WORD RECOGNITION PERFORMANCE Percent Total Error = 94.6% (4857) Percent Correct = 27.6% (1417) Percent Substitution = 65.4% (3358) Percent Deletions = 7.0% ( 357) Percent Insertions = 22.3% (1142) Percent Word Accuracy = 5.4% Ref. words = (5132) Hyp. words = (5917) Aligned words = (6274)
LM_PENALTY = -2
DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn SENTENCE RECOGNITION PERFORMANCE sentences 500 with errors 87.2% ( 436) with substitions 72.8% ( 364) with deletions 38.8% ( 194) with insertions 62.8% ( 314) WORD RECOGNITION PERFORMANCE Percent Total Error = 91.4% (4674) Percent Correct = 26.9% (1374) Percent Substitution = 64.0% (3271) Percent Deletions = 9.2% ( 468) Percent Insertions = 18.3% ( 935) Percent Word Accuracy = 8.6% Ref. words = (5113) Hyp. words = (5580) Aligned words = (6048) CONFUSION PAIRS Total (2699) With >= 1 occurances (2699)
LM_PENALTY = -3
DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn SENTENCE RECOGNITION PERFORMANCE sentences 500 with errors 87.0% ( 435) with substitions 72.6% ( 363) with deletions 40.6% ( 203) with insertions 56.4% ( 282) WORD RECOGNITION PERFORMANCE Percent Total Error = 88.9% (4535) Percent Correct = 25.4% (1298) Percent Substitution = 63.0% (3215) Percent Deletions = 11.5% ( 588) Percent Insertions = 14.4% ( 732) Percent Word Accuracy = 11.1% Ref. words = (5101) Hyp. words = (5245) Aligned words = (5833) CONFUSION PAIRS Total (2671) With >= 1 occurances (2671)
LM_PENALTY = -4
DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn SENTENCE RECOGNITION PERFORMANCE sentences 500 with errors 86.6% ( 433) with substitions 72.2% ( 361) with deletions 45.8% ( 229) with insertions 49.8% ( 249) WORD RECOGNITION PERFORMANCE Percent Total Error = 87.2% (4446) Percent Correct = 24.5% (1249) Percent Substitution = 60.4% (3078) Percent Deletions = 15.1% ( 770) Percent Insertions = 11.7% ( 598) Percent Word Accuracy = 12.8% Ref. words = (5097) Hyp. words = (4925) Aligned words = (5695) CONFUSION PAIRS Total (2590) With >= 1 occurances (2590)
LM_PENALTY = -5
DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn SENTENCE RECOGNITION PERFORMANCE sentences 500 with errors 86.4% ( 432) with substitions 72.2% ( 361) with deletions 49.2% ( 246) with insertions 46.2% ( 231) WORD RECOGNITION PERFORMANCE Percent Total Error = 87.0% (4423) Percent Correct = 23.1% (1176) Percent Substitution = 58.9% (2998) Percent Deletions = 17.9% ( 912) Percent Insertions = 10.1% ( 513) Percent Word Accuracy = 13.0% Ref. words = (5086) Hyp. words = (4687) Aligned words = (5599) CONFUSION PAIRS Total (2533) With >= 1 occurances (2533)
LM_PENALTY = -6
DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn SENTENCE RECOGNITION PERFORMANCE sentences 500 with errors 87.0% ( 435) with substitions 72.6% ( 363) with deletions 53.4% ( 267) with insertions 40.4% ( 202) WORD RECOGNITION PERFORMANCE Percent Total Error = 86.6% (4394) Percent Correct = 21.5% (1088) Percent Substitution = 56.8% (2883) Percent Deletions = 21.7% (1101) Percent Insertions = 8.1% ( 410) Percent Word Accuracy = 13.4% Ref. words = (5072) Hyp. words = (4381) Aligned words = (5482) CONFUSION PAIRS Total (2472) With >= 1 occurances (2472)
Clearly there is something wrong with the monophone model - a triunit gaussian model seems about even with monophone gaussian model - should be better by about 10% WER, I think.
Monophone tests
Trying to track down where the error is coming from:
Standard monophone converged once WER 86.8:
DETAILED OVERALL REPORT FOR THE SYSTEM: trTest/config19/test0/accuracy/out.nosil.trn SENTENCE RECOGNITION PERFORMANCE sentences 500 with errors 83.4% ( 417) with substitions 69.4% ( 347) with deletions 52.8% ( 264) with insertions 33.6% ( 168) WORD RECOGNITION PERFORMANCE Percent Total Error = 86.8% (4426) Percent Correct = 18.9% ( 961) Percent Substitution = 54.9% (2796) Percent Deletions = 26.3% (1340) Percent Insertions = 5.7% ( 290) Percent Word Accuracy = 13.2% Ref. words = (5097) Hyp. words = (4047) Aligned words = (5387)
The Units monophone:
DETAILED OVERALL REPORT FOR THE SYSTEM: monoUnitTest/config66/test0/accuracy/out.nosil.trn SENTENCE RECOGNITION PERFORMANCE sentences 500 with errors 86.8% ( 434) with substitions 72.2% ( 361) with deletions 43.6% ( 218) with insertions 51.2% ( 256) WORD RECOGNITION PERFORMANCE Percent Total Error = 87.7% (4497) Percent Correct = 25.2% (1291) Percent Substitution = 61.8% (3166) Percent Deletions = 13.0% ( 668) Percent Insertions = 12.9% ( 663) Percent Word Accuracy = 12.3% Ref. words = (5125) Hyp. words = (5120) Aligned words = (5788) CONFUSION PAIRS Total (2569) With >= 1 occurances (2569)
So there are probably two problems,
- one with monophones (more units makes things worse?!),
- and with triunits (adding context does not make things better).
I will dig apart monophones first.
config 27 Units with --maxStates 500 --unitType wordInternalOnly --subUnits asBefore --growUnitSet 0 , LM_PENALTY=-1
DETAILED OVERALL REPORT FOR THE SYSTEM: testc67OnBreakdownModelc26/config67/test0/accuracy/out.nosil.trn SENTENCE RECOGNITION PERFORMANCE sentences 500 with errors 88.6% ( 443) with substitions 73.8% ( 369) with deletions 36.8% ( 184) with insertions 56.0% ( 280) WORD RECOGNITION PERFORMANCE Percent Total Error = 91.1% (4679) Percent Correct = 24.5% (1260) Percent Substitution = 65.1% (3340) Percent Deletions = 10.4% ( 534) Percent Insertions = 15.7% ( 805) Percent Word Accuracy = 8.9% Ref. words = (5134) Hyp. words = (5405) Aligned words = (5939) CONFUSION PAIRS Total (2693)
Problems fixed so far
- All previous test were with the bigram model!!! all interesting tests should be rerun.
- Only 1 subunit on each boundary was clustered - a disadvantage against traditional units where the center unit was also clusterd. Now up to two subunits on each boundary are untied and clustered.
- many unused GMs were left in the trainable params - probably didn't affect the accuracy but slowed everything down
- SUB_PHONE_COUNTER_CARD was used where WORDSTATE_COUNTER_CARD should have been used. Any word with more than 15 substates was given 0 probability?! Another reason to redo all the monophone tests.
Finally, at least something reasonable: and (hopefully) a 2% WER improvement in the baseline So redoing monophone tests:
trainConfig | testConfig | Descr | States | WER | WER 2k | Comments |
---|---|---|---|---|---|---|
triUnitsModel/config28 | testc67triUnitsModelc28/config67 | baseline but using the new testing stuff: --maxStates 0 --unitType wordInternalOnly --subUnits asBefore --growUnitSet 0 | 137 | 84.3% | 86.0% | |
moreDTsModel/config20 | trTestTrigramFixed/config19 | baseline | 137 | 84.3% | Same as above but using older code | |
triUnitsModel/config27 | testc67TriOnBreakdownModelc27/config67 | --maxStates 500 --unitType wordInternalOnly --subUnits asBefore --growUnitSet 0 | 627 | 87.2 | 87.6 | |
triUnitsModel/config26 | testc67TriOnBreakdownModelc26/config67 | --maxStates 500 --unitType wordInternalOnly --subUnits asBefore | 627 | 87.1 was 91.1% | TE | |
triUnitsModel/config25 | xx | --maxStates 500 --unitType wordInternalOnly | 636 | xx | 87.1 | |
triUnitsModel/config24 | xx | --maxStates 500 | 615 | xx | xx | |
triUnitsModel/config29 | xx | --maxStates 500 --unitType multiWordOnly | 637 | xx | 88.2 | There must be something wrong with this - the confusion pairs are not making sense and there are way too many deletions |
triUnitsModel/config30 | xx | --maxStates 500 --unitType wordInternalOnly --subUnits asBefore, initializeUnitsFromFile | 627 | xx | 86.0 |