sotaq - segmentation and stressing

# What is sotaq?

sotaq is a program that reads a collection of phrases and prints for each a decomposition into rhythmic segments, with secondary stresses, following a model based on Optimality Theory.

At least, that is what it is supposed to do. Actually, sotaq is very experimental, and may be based on a lot of misunderstanding, in particular misunderstanding of OT. So, to make its description more precise, we will define a few terms as they are used in this document, which may be at variance with normal use (this bug should be corrected in the future):

syllable
a sequence of letters.
phrase
a sequence of syllables.
segment
a sequence of successive syllables in a phrase, with exactly one of them singled out as stressed. A lexical stress is a stress fixed on input, which must be respected throughout.
segment decomposition
a sequence of disjoint segments covering all the syllables of a phrase.

To each segment decomposition an integer cost is assigned, and sotaq outputs the decompositions of minimum cost. The cost is the sum of the individual costs assigned to its segments, plus the sum of costs assigned to pairs of successive segments.

Each individual cost is a sum of criteria, each comprised of a value and a weight. The value is computed on each segment or pair of segments, and may take into account properties like length, position of the stress, its relation to lexical components of the phrase, and so on. The weight is just a number assigned to a criterion, and can be used to establish a hierarchy of preference among criteria.

# Relation to optimality theory

One OT based model would have a hierarchy of conditions, and count violations of these, so that any violations of low ranked conditions are preferred over a violation of a higher ranked condition. Let us see an example:

Suppose we have three conditions named as:

SegMax,SegMin >> AlignI/L >> AlignW/L
where the symbol >> points from high rank to low. To make sotaq rank segment decompositions accordingly, one needs:
1. A criterion for each condition, supported internally in the program. The value of a segment, according to each criterion is 1 if the segment violates the condition, 0 otherwise.

2. Weights must be chosen to reflect the hierarchy. Typically one would get the desired results with weights 100, 10, 1. To be on the totally secure mathematical side, each weight should be at least n+1 times the next one, where n is the number of syllables in the phrase. The choice of weights may be done at the time of calling the program.

# Installing sotaq

First of all, in order to run sotaq you need perl, version 5.x (any x). In order to check if perl is installed in your computer, go to a command line and type the command
perl -v
The ensuing messages will be quite clear. If your computer hasn't got perl installed, there are two options: change computers, or install perl. The first option may be a real one in a lab; the second is feasible whatever computer you are using, but you'll probably need some help.

Now get sotaq, by saving this link. Your browser has an option for this, probably shift+some button.

This is it. You are ready to run sotaq. However, if your operating system is Unix or Linux, there is an optional step that will pay off in later convenience. First, execute

which perl
The computer will print something like /usr/local/bin/perl or /usr/bin/perl. Now, get sotaq into a text editor, and look at the first line:
#!/usr/local/bin/perl
edit it so that what comes after the magical characters #! is exactly what you got from that which. For instance, in a Linux system you would have to remove the /local part. Now, save the file, and execute the following command:
chmod 755 sotaq
and that's it. You are ready to use sotaq.

# Using sotaq

sotaq is a filter, that is, it reads from standard input, writes to standard output. That means that you call it like this, from a shell or DOS window:
perl sotaq [options] < infile > outfile
or, in case, you followed the last installation steps:
sotaq [options] < infile > outfile
here,
• outfile is the name of the file where the results will be saved. If the > outfile part is left out altogether, the results will be printed on your screen (I like to work in an emacs shell buffer for this).
• infile is the name of the input file, containing a collection of annotated phrases to be processed. The format of the input file is explained below.
• options
There are many different options, and most of them have the form
--name=N
meaning "assign weight N to the criterion dubbed name". For a full list of options, execute:
sotaq --help
A section below details the current options related to criteria. There is a further option --debug=N, that only concerns those that want to fiddle with the source code. The source explains it.
The output will consist of a listing of options, following by, for each input phrase:
• A representation of its encoding, preceded by the string "I: ". It shows in a semi-graphical way the split into lexical items and lexical stresses.
• All min cost segment decompositions. each preceded by the string "O: ", in a semi-graphical way that resembles the presentation of the original data.
• The min cost value and solution count.

#### The input file

There are two types of input files:
Anotated phrases
Each phrase is presented as a collection of separated syllables, each syllable preceded by a number coding some properties; each property code is a number, and joint properties are coded by adding the corresponding numbers. The current recognized properties are:
• Has lexical stress -- code 1.
• Starts a lexical item -- code 2.
Codes and syllables are separated by any nonzero number of spaces. A phrase can be broken through several successive input lines, provided each, but the last one, ends with a backslash \  (no spaces after the \ ).
A sample input file can be copied from here.
Bit encoding
For an annotated phrase as above, keep just the codes; write each as a two bit binary number, and string those numbers into a single bit-string. This is maintained for compatibility with older programs, and may be deprecated in the future. A sample input file can be copied from here.

#### The options

The following options give weights to criteria which test the violation of a condition on a segment. Thus, for each criterion, the value of the segment is 1 if the condition is violated. Some of the implemented criteria come from linguistic modeling, others are just a programmers fancy.
Option Condition
--ini=Nstress at the first syllable of the segment
--max2=N segment has length at most 2
--max4=N segment has length at most 4
--min2=N segment has length at least 2
--integ=N segment is contained in a lexical item
--integ=N segment is contained in the union of a lexical item with a preceding functional word (unstressed monosyllable)
--acmono=N stress not on a monosyllable
--lapse=N adjacent segments with at most one syllable between stresses
--lapseint=N adjacent segments within a lexical item with at most one syllable between stresses

Some of those criteria suggest a measure of failure, not simply occurrence of failure. So, this suggests additional criteria, whose values are not just 0-1:
Option Value
--inidist=N distance of stress from beginning of segment, discounting the first syllable if it is a word
--bindist=N distance of segment length from 2
--integc=N one less than number of words touched by segment

So, for instance, if you want to rank the options as ini >> max2 >> integ, and the input is on file example, you can get the best segment decompositions by calling:

sotaq --ini=1000 --max2=10 --integ=1 < example

## Further developments

The only syllable properties currently being considered are start of word and lex stress. Maybe other properties of linguistic significance can be considered, so that other criteria can be applied in evaluating segments. For instance, it may be relevant if a syllable starts (or ends) in a vowel, if it is followed by a punctuation mark, if it can be muted, and so on.

Some initial tests involving phrases with annotated secondary stresses suggest that, at least with the already implemented criteria, just a plain ranking of boolean conditions cannot achieve the desired results. That has led to the introduction of numerical criteria. Besides, one can play with similar weights for different criteria. If that yields good results, it will be a though act for the theory to follow.

In introducing new criteria, it is important to understand the following requirement of the model: the value of a segment should be computable using only information about the segment and syllable properties - it cannot depend on other segments. Similarly, the value of a pair of successive segments should be computable without reference to any other segments.

Arnaldo Mandel <am@ime.usp.br>