sotaq - segmentation and stressing
sotaq - segmentation and stressing
What is sotaq?
sotaq is a program that reads a collection of phrases and prints for
each a decomposition into rhythmic segments, with secondary stresses,
following a model based on Optimality Theory.
At least, that is what it is supposed to do. Actually, sotaq is very
experimental, and may be based on a lot of misunderstanding, in particular
misunderstanding of OT. So, to make its description more precise, we will
define a few terms as they are used in this document, which may be at
variance with normal use (this bug should be corrected in the future):
- syllable
- a sequence of letters.
- phrase
- a sequence of syllables.
- segment
- a sequence of successive syllables in a phrase, with exactly one of them
singled out as stressed. A lexical stress is a stress
fixed on input, which must be respected throughout.
- segment decomposition
- a sequence of disjoint segments covering all the syllables of a phrase.
To each segment decomposition an integer cost is assigned, and
sotaq outputs the decompositions of minimum cost. The cost is the
sum of the individual costs assigned to its segments, plus the sum of costs
assigned to pairs of successive segments.
Each individual cost is a sum of criteria, each comprised of a
value and a weight. The value is computed on each segment or
pair of segments, and may take into account properties like length, position
of the stress, its relation to lexical components of the phrase, and so on.
The weight is just a number assigned to a criterion, and can be used to
establish a hierarchy of preference among criteria.
Relation to optimality theory
One OT based model would have a hierarchy of conditions, and count violations
of these, so that any violations of low ranked conditions are preferred over a
violation of a higher ranked condition. Let us see an example:
Suppose we have three conditions named as:
SegMax,SegMin >> AlignI/L >> AlignW/L
where the symbol >> points from high rank to low. To make sotaq rank
segment decompositions accordingly, one needs:
- A criterion for each condition, supported internally in the program.
The value of a segment, according to each criterion is 1 if the segment
violates the condition, 0 otherwise.
- Weights must be chosen to reflect the hierarchy. Typically one would
get the desired results with weights 100, 10, 1. To be on the totally
secure mathematical side, each weight should be at least n+1 times
the next one, where n is the number of syllables in the phrase. The
choice of weights may be done at the time of calling the program.
Installing sotaq
First of all, in order to run sotaq you need perl, version 5.x (any
x). In order to check if perl is installed in your computer, go to a command
line and type the command
perl -v
The ensuing messages will be quite clear. If your computer hasn't got perl
installed, there are two options: change computers, or install perl. The
first option may be a real one in a lab; the second is feasible whatever
computer you are using, but you'll probably need some help.
Now get sotaq, by saving this link. Your browser
has an option for this, probably shift+some button.
This is it. You are ready to run sotaq. However, if your operating system is
Unix or Linux, there is an optional step that will pay off in later
convenience.
First, execute
which perl
The computer will print something like /usr/local/bin/perl or
/usr/bin/perl. Now, get sotaq into a text editor, and
look at the first line:
#!/usr/local/bin/perl
edit it so that what comes after the magical characters #! is exactly
what you got from that which. For instance, in a Linux system you would have
to remove the /local part. Now, save the file, and execute the
following command:
chmod 755 sotaq
and that's it. You are ready to use sotaq.
Using sotaq
sotaq is a filter, that is, it reads from standard input, writes to
standard output. That means that you call it like this, from a shell or DOS
window:
perl sotaq [options] < infile > outfile
or, in case, you followed the last installation steps:
sotaq [options] < infile > outfile
here,
- outfile is the name of the file where the results will be
saved. If the > outfile part is left out altogether, the
results will be printed on your screen (I like to work in an emacs shell
buffer for this).
- infile is the name of the input file, containing a collection
of annotated phrases to be processed. The format of the input file is
explained below.
- options
There are many different options, and most of them have the form
--name=N
meaning "assign weight N to the criterion dubbed
name". For a full list of options, execute:
sotaq --help
A section below details the current options related to criteria. There
is a further option --debug=N, that only concerns those that
want to fiddle with the source code. The source explains it.
The output will consist of a listing of options, following by, for each input
phrase:
- A representation of its encoding, preceded by the string
"I: ". It shows in a semi-graphical way the split into
lexical items and lexical stresses.
- All min cost segment decompositions. each preceded by the string
"O: ", in a semi-graphical way that resembles the
presentation of the original data.
- The min cost value and solution count.
The input file
There are two types of input files:
- Anotated phrases
- Each phrase is presented as a collection of separated syllables, each
syllable preceded by a number coding some properties; each property code
is a number, and joint properties are coded by adding the corresponding
numbers. The current recognized properties are:
- Has lexical stress -- code 1.
- Starts a lexical item -- code 2.
Codes and syllables are separated by any nonzero number of spaces. A
phrase can be broken through several successive input lines,
provided each, but the last one, ends with a backslash
\ (no spaces after the \ ).
A sample input file can be copied from here.
- Bit encoding
- For an annotated phrase as above, keep just the codes; write each as a
two bit binary number, and string those numbers into a single bit-string.
This is maintained for compatibility with older programs, and may be
deprecated in the future. A sample input file can be copied from here.
The options
The following options give weights to criteria which test the violation of a
condition on a segment. Thus, for each criterion, the value of the segment is
1 if the condition is violated. Some of the implemented criteria come
from linguistic modeling, others are just a programmers fancy.
| Option | Condition |
| --ini=N | stress at the first syllable of the segment |
| --max2=N | segment has length at most 2 |
| --max4=N | segment has length at most 4 |
| --min2=N | segment has length at least 2 |
| --integ=N | segment is contained in a lexical item |
| --integ=N | segment is contained in the union of a lexical
item with a preceding functional word (unstressed monosyllable) |
| --acmono=N | stress not on a monosyllable |
| --clash=N | adjacent segments without adjacent stressed syllables |
| --clashint=N | adjacent segments within a lexical
item without adjacent stressed syllables |
| --lapse=N | adjacent segments with at most one
syllable between stresses |
| --lapseint=N | adjacent segments within a lexical
item with at most one syllable between stresses |
Some of those criteria suggest a measure of failure, not simply occurrence of
failure. So, this suggests additional criteria, whose values are not just 0-1:
| Option | Value |
| --inidist=N | distance of stress from beginning of segment,
discounting the first syllable if it is a word |
| --bindist=N | distance of segment length from 2 |
| --integc=N | one less than number of words touched by segment |
| --integm=N | adjacent segments with adjacent stressed syllables |
So, for instance, if you want to rank the options as ini >> max2 >>
integ, and the input is on file example, you can get the best segment
decompositions by calling:
- sotaq --ini=1000 --max2=10 --integ=1 < example
Further developments
The only syllable properties currently being considered are start of word and
lex stress. Maybe other properties of linguistic significance can be
considered, so that other criteria can be applied in evaluating segments. For
instance, it may be relevant if a syllable starts (or ends) in a vowel, if it
is followed by a punctuation mark, if it can be muted, and so on.
Some initial tests involving phrases with annotated secondary stresses suggest
that, at least with the already implemented criteria, just a plain ranking of
boolean conditions cannot achieve the desired results. That has led to the
introduction of numerical criteria. Besides, one can play with similar
weights for different criteria. If that yields good results, it will be a
though act for the theory to follow.
In introducing new criteria, it is important to understand the following
requirement of the model: the value of a segment should be computable using
only information about the segment and syllable properties - it cannot depend
on other segments. Similarly, the value of a pair of successive segments
should be computable without reference to any other segments.
Arnaldo Mandel <am@ime.usp.br>
Last modified: Fri Aug 13 19:22:01 EST 1999