CMU Sphinx Resources: Full Text Dictation
Sphinx comes with documentation about how to create new vocabularies. The doc isn't that helpful about how to dictate full text (such as you might use to write a letter to a friend) instead of a limited command vocabulary (like the example turtle vocabulary which ships with Sphinx). To generate a full English vocabulary, I used the
Hub4 open source language models, with some modifications.
- sphinx2-full script; this is a copy of the sphinx2-simple script which ships with Sphinx, but it loads the hub4 models instead of the turtle vocab.
- full.dic.gz, the dictionary (created based on cmudict). The sphinx2-full script expects an unzipped version of this file in /usr/local/share/sphinx2/full.dic; if you put it anywhere else, be sure to modify the script to look in the correct location.
- The language model. The sphinx2-full script expects an unzipped version of this file in /usr/local/share/sphinx2/full.lm; note that the name of the file you download will be language_model.arpaformat.gz, so you will have to rename it.
Maintained by Jessica
P. Hekman.