marytts.language.tib
Class Tokeniser

java.lang.Object
  extended by marytts.modules.InternalModule
      extended by marytts.language.tib.Tokeniser
All Implemented Interfaces:
MaryModule

public class Tokeniser
extends InternalModule

This class tokenises the wylie-transcribed tibetan sentence in RAWMARYXML_tib into processed syllables in PARSEDSYL_tib, which is also in XML-format.

Author:
Maria Staudte

Field Summary
 
Fields inherited from class marytts.modules.InternalModule
logger, state
 
Fields inherited from interface marytts.modules.MaryModule
MODULE_OFFLINE, MODULE_RUNNING
 
Constructor Summary
Tokeniser()
          Constructor without arguments initialises the Data-inputTyes and -outputTypes
 
Method Summary
 void assignSeparators()
          Stores the punctuation items in the appropriate Strings used by the StringTokeniser.
 void buildAlphabet()
          Read in the generally admitted wylie-symbols, and vowels and consonants in particular
 void buildLexicon()
          Build the lexicon, i.e.
 void buildListMap()
          Store the lists in the definitions section of the xml file in listMap
 void buildParticleList()
          Build the particle list
 void checkPrefix(java.lang.String root, java.util.LinkedList syllable, org.w3c.dom.Element sylElement)
          This method gets the syllable and the root character such that it can extract the chars that precede the root Those are checked for slot2 and slot1.
 void checkRoot(java.lang.String root, java.util.LinkedList syllable, org.w3c.dom.Element sylElement)
          This method checks for a given char whether it is the root or possibly a slot3-filler such that the actual root is before it.
 void checkSuffix(java.lang.String root, java.util.LinkedList syllable, org.w3c.dom.Element sylElement)
          This method gets the syllable and the vowel such that it can extract the chars that follow the vowel Those are checked for slot4 (and slot4vowel) and slot5.
 int checkWord(java.lang.String syllable, java.lang.String[] syllArray, int sylPos)
          This method checks whether there are multi-syllabic words containing the given syllable (particle) and matches them against the whole syllable array
 java.util.HashMap getParaDelims(java.lang.String text)
          This method searches the given text for occurrences of the paragraph-delimiter regular expression and stores these in a hashmap
 java.util.LinkedList getSentenceDelims(java.lang.String paraText)
          This method searches the given text for occurrences of the sentence-delimiter regular expression and stores these in a list
 void loadSlotDefinitions()
          Read in the lists of the xml file and store them in slotList
 void parseParagraph(org.w3c.dom.Element paraElement, java.lang.String paraText, java.util.LinkedList regexpsEnd, java.lang.String paraStart, java.lang.String paraEnd)
          This method parses a given sentence into syllables and calls the parseSyllable-method for each token
 void parseSentence(org.w3c.dom.Element sentenceElement, java.lang.String sentenceText, java.lang.String sentEnd, java.lang.String paraStart, java.lang.String paraEnd)
          This method parses a given sentence into syllables and calls the parseSyllable-method for each token
 void parseSyllable(org.w3c.dom.Element token, java.lang.String sylText)
          This method parses a given syllable into its slots 1-5 and shows the distribution of each character onto the slots.
 java.util.LinkedList preprocessSyll(java.lang.String sylText)
          This method is a preprocessing for the syllable where the longest prefix/root letter is determined and all letters are stored in a List which is then used for further processing
 MaryData process(MaryData d)
          This method extracts the text to be parsed from the XML-Document and places the calls to parseSen and parseSyll in order to get the parsed result of that sentence.
 void startup()
          Read in the data of the xml file
 
Methods inherited from class marytts.modules.InternalModule
getLocale, getState, inputType, name, outputType, powerOnSelfTest, shutdown
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Tokeniser

public Tokeniser()
Constructor without arguments initialises the Data-inputTyes and -outputTypes

Method Detail

startup

public void startup()
             throws java.lang.Exception
Read in the data of the xml file

Specified by:
startup in interface MaryModule
Overrides:
startup in class InternalModule
Throws:
java.lang.Exception

loadSlotDefinitions

public void loadSlotDefinitions()
                         throws javax.xml.parsers.FactoryConfigurationError,
                                javax.xml.parsers.ParserConfigurationException,
                                org.xml.sax.SAXException,
                                java.io.IOException,
                                NoSuchPropertyException
Read in the lists of the xml file and store them in slotList

Throws:
javax.xml.parsers.FactoryConfigurationError
javax.xml.parsers.ParserConfigurationException
org.xml.sax.SAXException
java.io.IOException
NoSuchPropertyException

buildAlphabet

public void buildAlphabet()
Read in the generally admitted wylie-symbols, and vowels and consonants in particular


buildListMap

public void buildListMap()
Store the lists in the definitions section of the xml file in listMap


buildLexicon

public void buildLexicon()
                  throws java.io.IOException
Build the lexicon, i.e. read in all words in wylie

Throws:
java.lang.Exception - if there are problems reading the file
java.io.IOException

buildParticleList

public void buildParticleList()
                       throws java.io.IOException
Build the particle list

Throws:
java.io.IOException - if there are problems reading the file

assignSeparators

public void assignSeparators()
Stores the punctuation items in the appropriate Strings used by the StringTokeniser.


process

public MaryData process(MaryData d)
                 throws java.lang.Exception
This method extracts the text to be parsed from the XML-Document and places the calls to parseSen and parseSyll in order to get the parsed result of that sentence. This is in xml-form and is then appended to the MaryData result.

Specified by:
process in interface MaryModule
Overrides:
process in class InternalModule
Parameters:
d - XML-inputType
Returns:
result, new MaryData of outputType
Throws:
java.lang.Exception

getParaDelims

public java.util.HashMap getParaDelims(java.lang.String text)
This method searches the given text for occurrences of the paragraph-delimiter regular expression and stores these in a hashmap

Parameters:
text - to be searched
Returns:
HashMap containing the two delimiter maps, start- & end- paragraph markers

getSentenceDelims

public java.util.LinkedList getSentenceDelims(java.lang.String paraText)
This method searches the given text for occurrences of the sentence-delimiter regular expression and stores these in a list

Parameters:
paraText - - paragraph text to be searched
Returns:
LinkedList containing the found delimiters

parseParagraph

public void parseParagraph(org.w3c.dom.Element paraElement,
                           java.lang.String paraText,
                           java.util.LinkedList regexpsEnd,
                           java.lang.String paraStart,
                           java.lang.String paraEnd)
This method parses a given sentence into syllables and calls the parseSyllable-method for each token

Parameters:
paraElement - to which syllable are appended
paraText -
regexpsEnd -
paraStart -
paraEnd -

parseSentence

public void parseSentence(org.w3c.dom.Element sentenceElement,
                          java.lang.String sentenceText,
                          java.lang.String sentEnd,
                          java.lang.String paraStart,
                          java.lang.String paraEnd)
This method parses a given sentence into syllables and calls the parseSyllable-method for each token

Parameters:
sentenceElement - to which syllables are appended
sentenceText -
sentEnd -
paraStart -
paraEnd -

parseSyllable

public void parseSyllable(org.w3c.dom.Element token,
                          java.lang.String sylText)
This method parses a given syllable into its slots 1-5 and shows the distribution of each character onto the slots. The resulting slot elements are appended to an xml-document.

Parameters:
token -
sylText -

checkRoot

public void checkRoot(java.lang.String root,
                      java.util.LinkedList syllable,
                      org.w3c.dom.Element sylElement)
This method checks for a given char whether it is the root or possibly a slot3-filler such that the actual root is before it. When root is identified whatever is before the root, i.e. the prefix is examined by checkPrefix

Parameters:
root - The (suggested) root
syllable - LinkedList of Strings for further processing
sylElement -
Throws:
java.lang.IllegalArgumentException - for unparseable syllable structure

checkPrefix

public void checkPrefix(java.lang.String root,
                        java.util.LinkedList syllable,
                        org.w3c.dom.Element sylElement)
This method gets the syllable and the root character such that it can extract the chars that precede the root Those are checked for slot2 and slot1.

Parameters:
root - the syllable root
syllable - LinkedList of Strings for further processing
sylElement -
Throws:
java.lang.IllegalArgumentException - for unparseable syllable structure

checkSuffix

public void checkSuffix(java.lang.String root,
                        java.util.LinkedList syllable,
                        org.w3c.dom.Element sylElement)
This method gets the syllable and the vowel such that it can extract the chars that follow the vowel Those are checked for slot4 (and slot4vowel) and slot5.

Parameters:
root - The syllable root letter
syllable - LinkedList of Strings for further processing
sylElement -
Throws:
java.lang.IllegalArgumentException - for unparseable syllable structure

preprocessSyll

public java.util.LinkedList preprocessSyll(java.lang.String sylText)
This method is a preprocessing for the syllable where the longest prefix/root letter is determined and all letters are stored in a List which is then used for further processing

Parameters:
sylText -
Returns:
LinkedList containing wylie letters (with longest possible letter combinations)

checkWord

public int checkWord(java.lang.String syllable,
                     java.lang.String[] syllArray,
                     int sylPos)
This method checks whether there are multi-syllabic words containing the given syllable (particle) and matches them against the whole syllable array

Parameters:
syllable - a syllable which is a particle
syllArray - all syllable of the sentence
sylPos - position of syllable in syllable array
Returns:
integer that marks how many syllables (preceding it) belong to the found word