marytts.tools.newlanguage
Class LexiconCreator

java.lang.Object
  extended by marytts.tools.newlanguage.LexiconCreator
Direct Known Subclasses:
CMUDict2MaryFST

public class LexiconCreator
extends java.lang.Object

The LexiconCreator is the base class for creating the files needed to run the phonemiser component for a new language. From a list of phonetically transcribed words, the class will create:

The input file is expected to contain data in the following format: grapheme | ' a l - l o - p h o n e s | (optional-part-of-speech) Hereby, the allophones must correspond to a defined allophone set, given in the constructor. The file's encoding is expected to be UTF-8. Subclasses of LexiconCreator can override prepareLexicon() to provide data in this format.

Author:
marc
See Also:
AllophoneSet

Field Summary
protected  AllophoneSet allophoneSet
           
protected  int context
           
protected  boolean convertToLowercase
           
protected  java.lang.String fstFilename
           
protected  java.lang.String lexiconFilename
           
protected  org.apache.log4j.Logger logger
           
protected  java.lang.String ltsFilename
           
protected  boolean predictStress
           
 
Constructor Summary
LexiconCreator(AllophoneSet allophoneSet, java.lang.String lexiconFilename, java.lang.String fstFilename, java.lang.String ltsFilename)
          Initialise a new lexicon creator.
LexiconCreator(AllophoneSet allophoneSet, java.lang.String lexiconFilename, java.lang.String fstFilename, java.lang.String ltsFilename, boolean convertToLowercase, boolean predictStress, int context)
          Initialise a new lexicon creator.
 
Method Summary
protected  void compileFST()
           
protected  void compileLTS()
           
 void createLexicon()
           
static void main(java.lang.String[] args)
           
protected  void prepareLexicon()
          This base implementation does nothing.
protected  void testFST()
           
protected  void testLTS()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected org.apache.log4j.Logger logger

allophoneSet

protected AllophoneSet allophoneSet

lexiconFilename

protected java.lang.String lexiconFilename

fstFilename

protected java.lang.String fstFilename

ltsFilename

protected java.lang.String ltsFilename

convertToLowercase

protected boolean convertToLowercase

predictStress

protected boolean predictStress

context

protected int context
Constructor Detail

LexiconCreator

public LexiconCreator(AllophoneSet allophoneSet,
                      java.lang.String lexiconFilename,
                      java.lang.String fstFilename,
                      java.lang.String ltsFilename)
Initialise a new lexicon creator. Letter to sound rules built with this lexicon creator will convert graphemes to lowercase before prediction, using the locale given in the allophone set; letter-to-sound rules will also predict stress; a context of 2 characters to the left and to the right of the current character will be used as predictive features.

Parameters:
allophoneSet - this specifies the set of phonetic symbols that can be used in the lexicon, and provides the locale of the lexicon
lexiconFilename - where to find the plain-text lexicon
fstFilename - where to create the compressed lexicon FST file
ltsFilename - where to create the letter-to-sound prediction tree.

LexiconCreator

public LexiconCreator(AllophoneSet allophoneSet,
                      java.lang.String lexiconFilename,
                      java.lang.String fstFilename,
                      java.lang.String ltsFilename,
                      boolean convertToLowercase,
                      boolean predictStress,
                      int context)
Initialise a new lexicon creator.

Parameters:
allophoneSet - this specifies the set of phonetic symbols that can be used in the lexicon, and provides the locale of the lexicon
lexiconFilename - where to find the plain-text lexicon
fstFilename - where to create the compressed lexicon FST file
ltsFilename - where to create the letter-to-sound prediction tree.
convertToLowercase - if true, Letter to sound rules built with this lexicon creator will convert graphemes to lowercase before prediction, using the locale given in the allophone set.
predictStress - if true, letter-to-sound rules will predict stress.
context - the number of characters to the left and to the right of the current character will be used as predictive features.
Method Detail

prepareLexicon

protected void prepareLexicon()
                       throws java.io.IOException
This base implementation does nothing. Subclasses can override this method to prepare a lexicon in the expected format, which should then be found at lexiconFilename.

Throws:
java.io.IOException

compileFST

protected void compileFST()
                   throws java.io.IOException
Throws:
java.io.IOException

testFST

protected void testFST()
                throws java.io.IOException
Throws:
java.io.IOException

compileLTS

protected void compileLTS()
                   throws java.io.IOException
Throws:
java.io.IOException

testLTS

protected void testLTS()
                throws java.io.IOException,
                       MaryConfigurationException
Throws:
java.io.IOException
MaryConfigurationException

createLexicon

public void createLexicon()
                   throws java.lang.Exception
Throws:
java.lang.Exception

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Parameters:
args -
Throws:
java.lang.Exception