The definitive version of this article can be found on Springer Verlag's site, LNCS 1660
by Ricardo Ueda Karpischek
ueda@ime.usp.br
http://www.ime.usp.br/~ueda
|
We present biba, a package designed to deal with representations of large automata. It offers a library able to build, even on a modest computer, automata where the sum of the numbers of states and edges achieves one billion or more. Two applications that use this library are provided as examples. They build the reduced automaton for a given vocabulary, and the suffix automaton of a given word. New programs can be developed using this library. In order to overcome physical memory limitations, biba implements a paging scheme, in such a way that the automata really reside on disk, making possible their permanent storage. Through a simple interface suited for perl, small scripts can be easily written to use and extract informations from these automata. |
Large and complex automaton-based data structures may arise when
dealing with massive data sets. These automata can be viewed as
ways to extract information from those sets, as well as ways to
put and maintain them in a form suitable for specific
applications, like search engines or spelling checkers. In this
second case, the automaton is not a temporary need, but one
definitive format under which the data set (dictionary) will be
maintained.
Developed in a first moment to apply statistical tests in
biological sequences [10], biba is mainly a C
programming library, but it also contains programs built over
this library offered as applications. It is able to make
user-level memory paging, in the same sense that this term
is used in operating systems textbooks (see for instance Bach
[1]). However, biba's paging implementation is not only a
way to overcome physical memory limitations, but it is also and
mainly an efficient way to permanently store automata on disk.
Paging will be a very interesting option in those cases where the
automaton can compare to a large relational database, that
resides on disk but is continuously being consulted and changed
punctually along a large interval of time. An example we can
think about is an automaton-based index for a large dynamic
textual database, as glimpse [7] does.
In this paper we present the main features and characteristics of
biba. A complete technical description may be found along
the man pages and the source code, freely available in the URL
given at the end of the paper. It was tested by the author in
x86 and axp Linux, but the library will
compile and run with few or no changes in almost any 32 or 64-bit
platform for which a C compiler is available.
We introduce the features of biba, through some
examples. For now, the visible part of biba are just two
programs that integrate the package, reduce and
suffaut.
The reduced automaton that recognizes a set of words, explicited
one per line in a file, can be built using the reduce
tool, that implements the algorithm due to Revuz [8]:
The automaton just built resides in the pair of files
a.aut.st (the states) and a.aut.ed (the
edges). These are ordinary files, that you can copy, compress or
transfer. Because of practical limitations that the user may need
to deal with, we allow to split the files that contain an automaton
in many parts, so one can combine the available areas of many
filesystems, and take advantage of using disks in parallel. For
instance, the following command will divide a.aut.ed in
two parts, the first of them containing just 25 megabytes:
Next we present suffaut, a program that builds
the suffix automaton of a given word, using the algorithm due to
Blumer et alli [2]. In this same example we show how one paging
parameter (among others) can be controlled through options. In
this case we're specifying that the maximum ratio between total
virtual memory size and total physical (RAM) memory size is
5. Just like in the first example, the automaton will reside in
the files a.aut.st and a.aut.ed. We chose a
binary word (the file /bin/sh seen as a sequence of
bytes):
This command took 64 seconds to complete in a 486 100Mhz with 32
megabytes of main memory. By default suffaut allocates 5
megabytes of RAM before allowing the ratio between virtual and
physical memory becomes larger than 1. As the page size is 4096,
suffaut allocated 1300 RAM pages. The final sizes for
files a.aut.st and a.aut.ed were 5.5 and 3.2
megabytes, so suffaut allocated 13 bytes per state and 5
bytes per edge. These numbers will be explained in the next
section (note that the algorithm requires two additional 32-bit
fields per state).
Now we'll combine these two programs in order to test the
accuracy of both. Let's consider Good-de Bruijn words over
the alphabet {0,1}. The scprit good generates such a
word with parameter p, that is, a word w where each
element of {0,1}p occurs just once as a factor
of w. The number of states of the suffix automaton of
w can be theoretically computed as being
2p+1-1 [10]. We can build this automaton using
suffaut or reduce, and compare the results:
In order to generate the word, good uses the classical
construction based on Eulerian Tours. When called with the
-s switch, it generates the word and all its
suffixes. This script is available with the biba
distribution.
In the next two examples biba deals with natural language
dictionaries and biological vocabularies. In both cases we're
building reduced automata, in the first case for the entire
br.ispell dictionary for brazilian portuguese [11] (just
like done by Lucchesi and Kowaltowski in [5]), which contains
195539 entries (total size 2120562), and, in the second, for the
blocks protein database [4] (109660 entries, 3249189 total
size). The filter eb, which extracts the sequences from
the database distribution file removing comments, is provided in
the biba distribution.
The first command took just 240 seconds to complete in the
same 486 computer described. The second took 27 minutes. Note
that this time may vary largely depending on the choice of the
paging parameters. We're omitting the final size of the files
a.aut.st and a.aut.ed because in these cases
they're full of garbage. In fact, their sizes are not determined
by the automaton size, but by the vocabulary size, because the
algorithm first builds a trie (this behaviour may be partially
avoided using a command-line option). Anyway, at the end of the
construction, each state uses 9 bytes, and each edge uses 5
bytes.
The library can also be used for many other purposes, but develop
and link programs with the biba library is a
time-consuming exercise of C programming, so we offer a
simplistic alternative that opens partially the library's
internal functionality for scripts that can be quickly written
using perl or similar languages.
This trivial script reads the file specified as its first parameter,
reads each line, checks if it is or not spelled by the automaton
a.aut, and prints it in the affirmative case. Hence, this script
can be used as a filter. A more elaborate example is
contained in the biba distribution. It computes the
complexity of a biological sequence as define by Trifonov
in [9].
For the programmer we provide a short description of the C
API (for a detailed description please refer the man pages)..
Just like we said that automata are like files to the user, they
are like files to the programmer too. So deal with automata using
this API is like deal with files through the C language
library. Automata handles are provided, and they can be used to
create or specify an existing automaton, and to subsequently work
on it.
Contents
1. Introduction
2. An overview from the user's viewpoint
$ reduce words
$ biba -s 25 a.aut.ed a.aut-1.ed
$ suffaut -p 5 -f /bin/sh
word length: 300668
states: 428275 (14 finals) edges: 636565
ram pages: 1300 virtual pages: 2141
$ good 7 | suffaut
word length: 134
states: 255 (8 finals) edges: 381
$ good -s 7 | reduce
states: 255 (8 finals) edges: 381
$ ispell -e vocab -d br | reduce
states: 14119 edges: 36588
ram pages: 1300 virtual pages: 1303
$ eb | reduce -s 200000 -m 28 -c 28 -d 1000000
states: 1318407, edges: 1401826
ram pages: 6200 virtual pages: 6182
#!/usr/bin/perl
use FileHandle;
use IPC::Open2;
$pid = open2(\*Reader,\*Writer,"biba -w $ARGV[1]");
Writer->autoflush();
while (<stdin>) {
print (Writer "r\ns$_");
if (
3. Short API description
| create_aut | similar to create(2) |
|---|---|
| register_aut | similar to open(2) |
| close_aut | similar to close(2) |
In order to deal with automata states and edges, very simple services are provided. Each one require the specification of a handle, amongst other specific parameters. The complete list follows:
| new_state | alloc a new state |
|---|---|
| set_final | set (to a given value) the state final flag |
| is_final | read the state final flag |
| set_dest | set the destination of an edge |
| dest | read the destination of an edge |
| stdeg | read the output degree of a state |
| next_dest | visit all edges leaving a given state |
| clone | clone a state |
For low level configuration and session recovery support, the following services are provided:
| init_aut | initializes the API and specify paging parameters |
|---|---|
| stop_pgout | disallow page outs, in order to keep the consistency of the structures represented on disk. |
| flush_aut | synchronize the disk files with the contents of RAM memory |
States and edges reside on separate data structures, that grow as needed (currently they cannot shrink). The main goal of these data structures is to save memory, because we want to be able to represent large automata, so we need to try to minimize paging at the expense of a larger cost of in-memory operations.
The data structure that represents states is nothing more than a large array. Each entry of this array is a "state", composed of fixed fields (its output degree, a flag to identify if the state is final, and a pointer to the structure that represents its edges) and application-dependent fields, like flags and counters.
The data structure that represents the edges is an array too. It limits the alphabet size to 256, and currently does not support nondeterministic automata. In order to represent the edges that leave a state with output degree d, where d is in the range 0..256, it allocates just d contiguous entries of the array of edges. The allocation routines guarantee that these d entries reside on the same page, a fundamental property, needed to avoid performance degradation. Each edge is composed by a label field and a destination field which points to a state. Application-specific fields are not allowed.
To allocate a state is no more than increase by 1 the current size of the array of states, and manage the necessary changes in the data structure, which may include the allocation of a new page. Inserting an edge leaving a state demands the allocation of d contiguous entries of the array of edges (best-fit is used), where d is the resulting output degree. This is managed by the maintenance of free lists of free areas within this array, one per each possible free area size.
Both arrays reside on disk, but at each moment only some of their pages are loaded into memory, accordingly to the paging routines. Each state and each edge uses just 40 bits to be represented (the size of a state may be larger depending on the presence of application-specific fields). This number is due to current limit on the alphabet size, that cannot be larger than 256, plus the choice of a 31-bit addressing limit for states and edges. So if n is the sum of the number of states and edges of an automaton, then the total disk size needed to represent it will be 5n bytes.
Page replacement proceeds swapping out pages based on their
ages. The pages are circularly visited along the program
execution, just like a memory refresh circuit does. The age of a
page is the amount of time elapsed since the last time it was
acessed. Comparing the age of a page with the medium age computed
over all pages, one can decide if it will be swapped out or not.
The paging scheme by itself represents a performance penalty,
because of the additional memory and time comsumption needed to
manage and detect page faults and access memory. In a user-level
paging scheme like the one implemented in biba, this penalty
is more significant because we cannot use, as the operating
system kernel does, hardware features that make the paging more
efficient (and unportable).
In order to estimate the time penalty, we depend on more
programming effort. Using the suffix array tool in development
(details in the next section) we'll be able to make it.
On the other hand, regarding the memory usage, paging is not
critical. The tables and additional information needed to manage
paging by biba represent less than 1.3% of the total
amount of physical memory allocated (details on how this
percentage was computed can be found in the file mem.c of the
distribution).
Regarding edges representation, some enhancements are needed,
like support for nondeterministic automata. Changes are being
planned in order to allow the addition of application-specific
fields to edges, as well as the usage of larger alphabets and/or
support automata where the labels of the edges are words over a
given alphabet.
A search tool for WWW servers similar to glimpse [7] will
be released soon. It currently uses an index based on GNU
gdbm, but an automaton-based alternative is being added.
More general data structures besides automata can be represented
using biba. We plan to release soon a program that builds
the suffix array, and implements both the refinement algorithm
due to Manber and Myers [6] and the external sorting algorithm
due to Gonnet [3]. We wrote this program some time ago, and it is
currently being adapted to use the biba paging scheme.
A full access to the internal library capabilities though
perl scripts is being worked on. We currently do not have
a schedule for other automata or graph algorithms to be
implemented, but we're accepting suggestions.
We are thankful to Imre Simon by his continuous advice and
encouragement. We must acknowledge our master thesis examinators
Nivio Ziviani and Yoshiharu Kohayakawa. The biologists Romeu
Guimarães and Sérgio Mattioli pointed out biologic
resources. Carlos Juiti Watanabe and Eduardo Ermel Wronowski
helped to make the implementation. Thanks to Jean-Eric Pin,
Marcelo Oliveira, Marko Loparic and the referees for their
suggestions. This work was partly supported by Conselho Nacional
de Desenvolvimento Científico e Tecnológico (CNPq) and Fundação
de Amparo à Pesquisa do Estado de São Paulo (FAPESP).
The sources are available under the terms of the GNU GPL, and can
be copied from
ftp://ftp.ime.usp.br/pub/ueda/biba. General, programming and
last-minute informations are available at
http://www.ime.usp.br/~ueda/biba.
[1] Maurice J. Bach. The Design of the Unix Operating
System. Prentice Hall, 1986.
[2] A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussler, M. T. Chen
and J. Seiferas, The smallest automaton recognizing the subwords
of a text. Theoretical Computer Science, vol. 40, pp 31-55, 1985.
[3] Gaston H. Gonnet, Ricardo A. Baeza-Yates and
T. Snider. Lexicographical indices for text: inverted files
vs. PAT trees. TR OED-91-01, UW Centre for the New Oxford English
Dictionary, University of Waterloo, 1991.
[4] S Henikoff and JG Henikoff. Automated assembly of
protein blocks for database searching, Nucleic Acids
Res. 19:6565-6572, 1991.
[5] Cláudio L. Lucchesi and Tomasz Kowaltowski, Applications of
Finite Automata Representing Large Vocabularies, in Software
Practice and Experience, Vol. 23, pp. 15-30, 1993.
[6] Udi Manber and Gene Myers. Suffix arrays: A new method for
on-line string searches. SIAM Journal on Computing,
22(5):935-948, October 1993.
[7] Udi Manber and Sun Wu, GLIMPSE: A Tool to Search Through
Entire File Systems, Usenix Winter 1994 Technical Conference,
San Francisco pp. 23-32, 1994.
[8] Dominique Revuz, Minimization of Acyclic Deterministic
Automata in Linear Time. Theoretical Computer Science, vol. 92 pp
181-189, 1992.
[9] Edward N. Trifonov, Making sense of the human
genome. Structure & Methods, 1:69-77, 1990.
[10] Ricardo Ueda Karpischek. The Suffix Automaton (in
portuguese). Master Thesis, University of São Paulo, 1993.
[11] Ricardo Ueda Karpischek, br.ispell, a brazilian
portuguese dictionary for ispell,
http://www.ime.usp.br/~ueda/br.ispell/, 1995.
5. Limits and limitations
The limits imposed by addressing strategies or performance
requirements follow. We plan to relax some of them in future
releases of the library (see the section Future
Directions).
6. Notes on performance
7. Future Directions
8. Acknowledgements
9. Availability
10 Bibliography