marytts.fst

## Class AlignerTrainer

• ```public class AlignerTrainer
extends Object```
This trains an alignment model between Strings. Applications are for example letter-to-sound rule training (see LTSTrainer) or transducer construction/minimization.

The basic idea is to perform a Levenshtein search for the cheapest path and read off an alignment from that. The costs used in the distance computation are not uniform but estimated in an iterative process, according to -log of the relative frequencies of the respective operations in the previous iteration. Perform several iterations (e.g. 4) of aligning in order to get stable estimates of the costs (and a good alignment in turn).

The algorithm, in its essence, is implemented after a description of Levenshtein distance as it can be found in Wikipedia (see below); consider the costs used in the pseudo-code:

``` d[i, j] := minimum
(
d[i-1, j] + 1,  // deletion
d[i, j-1] + 1,  // insertion
d[i-1, j-1] + 1 // substitution
)
```
In our implementation there are only two operations, corresponding to deletion and insertion. So, if you look at the matrices in the wiki article, you can only go down and to the right, but not diagonal. Second, the costs are not 1 but set as explained in the following (note that this is a heuristic that seems to work fine but not a derived EM-algorithm).

"insertion" menas in our case, to insert something for (dependent on) the current input symbol. The cost for this operation is lower if the two symbols were already aligned in the preceding iteration, they are set to -log P(output-symbol|"insertion",input-symbol).

"deletion" means in our case to go to the next input symbol. If a deletion operation is performed without an preceding insertion operation (i.e. two subsequent deletion operations) this is called a "skip" and will produce costs, going to the next symbol after an insertion is free (this is to avoid unaligned input symbols). The skip costs are estimated from the preceding iteration and set to -log P(skip|"deletion").

We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.
therefore the three arrays for all information and the swapping statements in the align method. (note that what are rows in Wikipedia are columns here)
Author:
benjaminroth
Computing Levenshtein distance
• ### Field Summary

Fields
Modifier and Type Field and Description
`protected Set<String>` `graphemeSet`
`protected List<String[]>` `inSplit`
`protected org.apache.log4j.Logger` `logger`
`protected List<String>` `optInfo`
`protected List<String[]>` `outSplit`
• ### Constructor Summary

Constructors
Constructor and Description
`AlignerTrainer()`
New AlignerTrainer for pairs of different symbol sets with no optional info.
```AlignerTrainer(boolean inIsOutAlphabet, boolean hasOptInfo)```
• ### Method Summary

Methods
Modifier and Type Method and Description
`void` ```addAlreadySplit(List<String> inStr, List<String> outStr)```
`void` ```addAlreadySplit(List<String> inStr, List<String> outStr, String optionalInfo)```
`void` ```addAlreadySplit(String[] inStr, String[] outStr)```
`void` ```addAlreadySplit(String[] inStr, String[] outStr, String optionalInfo)```
`int[]` ```align(String[] istr, String[] ostr)```
This computes the alignment that has the lowest distance between two Strings.
`void` `alignIteration()`
One iteration of alignment, using adapted Levenshtein distance.
`StringPair[]` `getAlignment(int entryNr)`
gets an alignment of the graphemes to the phones of an entry.
`String[]` `getAlignmentString(int entryNr)`
`StringPair[]` `getInfoAlignment(int entryNr)`
gets an alignment of the graphemes to the phones of an entry.
`Set<String>` `getInputSyms()`
`int` `lexiconSize()`
`void` ```readLexicon(BufferedReader lexicon, String splitSym)```
This reads a lexicon where input and output strings are separated by a delimiter that can be specified (splitSym).
`void` ```splitAndAdd(String inStr, String outStr)```
This adds the input and output string in the most simple way: symbols are simply the characters of the strings - no phonemisation/syllabification or whatsoever is performed.
• ### Methods inherited from class java.lang.Object

`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`
• ### Field Detail

• #### optInfo

`protected List<String> optInfo`
• #### inSplit

`protected List<String[]> inSplit`
• #### outSplit

`protected List<String[]> outSplit`
• #### graphemeSet

`protected Set<String> graphemeSet`
• #### logger

`protected org.apache.log4j.Logger logger`
• ### Constructor Detail

• #### AlignerTrainer

```public AlignerTrainer(boolean inIsOutAlphabet,
boolean hasOptInfo)```
Parameters:
`inIsOutAlphabet` - boolean indicating as input and output strings should be considered as belonging to the same symbol sets (alignment between identical symbol is then cost-free)
`hasOptInfo` - has opt info
• #### AlignerTrainer

`public AlignerTrainer()`
New AlignerTrainer for pairs of different symbol sets with no optional info.
• ### Method Detail

```public void readLexicon(BufferedReader lexicon,
String splitSym)
throws IOException```
This reads a lexicon where input and output strings are separated by a delimiter that can be specified (splitSym). Strings are taken as they are no normalization (eg. stress/syllable symbol removal, lower-casing ...) is performed; if space characters are present in the output string, it is used as a separator. In a third row additional info (eg. part of speech) can be given. Strings are stored split into symbols.
Parameters:
`lexicon` - reader for lexicon
`splitSym` - symbol to split columns of lexicon
Throws:
`IOException` - IOException

```public void splitAndAdd(String inStr,
String outStr)```
This adds the input and output string in the most simple way: symbols are simply the characters of the strings - no phonemisation/syllabification or whatsoever is performed. If outStr contains space characters, it is used as a separator for splitting.
Parameters:
`inStr` - inStr
`outStr` - outStr

```public void addAlreadySplit(List<String> inStr,
List<String> outStr)```

```public void addAlreadySplit(String[] inStr,
String[] outStr)```

```public void addAlreadySplit(List<String> inStr,
List<String> outStr,
String optionalInfo)```

```public void addAlreadySplit(String[] inStr,
String[] outStr,
String optionalInfo)```
• #### alignIteration

`public void alignIteration()`
One iteration of alignment, using adapted Levenshtein distance. After the iteration, the costs between a grapheme and a phone are set by the log probability of the phone given the grapheme. Analogously, The deletion cost is set by the log of deletion probability. In the first iteration, all operations cost maxCost.
• #### lexiconSize

`public int lexiconSize()`
• #### getAlignment

`public StringPair[] getAlignment(int entryNr)`
gets an alignment of the graphemes to the phones of an entry. a StringPair array is returned, where every entry contains a grapheme together with the phone sequence it is mapped to. The phone String is just the concatenation of the symbols in the aligned sequence.
Parameters:
`entryNr` - nr of the lexicon entry
Returns:
listArray
• #### getAlignmentString

`public String[] getAlignmentString(int entryNr)`
• #### getInfoAlignment

`public StringPair[] getInfoAlignment(int entryNr)`
gets an alignment of the graphemes to the phones of an entry. a StringPair array is returned, where every entry contains a grapheme together with the phone sequence it is mapped to. The phone String is just the concatenation of the symbols in the aligned sequence. In addition, the extra info (eg. POS) is appended as one symbol on the input side.
Parameters:
`entryNr` - nr of the lexicon entry
Returns:
listArray
• #### getInputSyms

`public Set<String> getInputSyms()`
• #### align

```public int[] align(String[] istr,
String[] ostr)```
This computes the alignment that has the lowest distance between two Strings. There are three differences to the normal Levenshtein-distance: 1. Only insertions and deletions are allowed, no replacements (i.e. no "diagonal" transitions) 2. insertion costs are dependent on a particular phone on the input side (the one they are aligned to) 3. deletion is equivalent to a symbol on the input side that is not aligned. There are costs associated with that. The method returns for each input symbol the indix of the right alignment boundary. eg. for input ['a','b'] and output ['a','a','b'] a correct alignment would be: [2,3]
Parameters:
`istr` - the input string
`ostr` - the output string
Returns:
length of p_al[ostr]