The definitive version of this article can be found on Springer Verlag's site, LNCS 1660


Paging Automata

by Ricardo Ueda Karpischek
ueda@ime.usp.br
http://www.ime.usp.br/~ueda


We present biba, a package designed to deal with representations of large automata. It offers a library able to build, even on a modest computer, automata where the sum of the numbers of states and edges achieves one billion or more. Two applications that use this library are provided as examples. They build the reduced automaton for a given vocabulary, and the suffix automaton of a given word. New programs can be developed using this library. In order to overcome physical memory limitations, biba implements a paging scheme, in such a way that the automata really reside on disk, making possible their permanent storage. Through a simple interface suited for perl, small scripts can be easily written to use and extract informations from these automata.

Contents

1. Introduction

Large and complex automaton-based data structures may arise when dealing with massive data sets. These automata can be viewed as ways to extract information from those sets, as well as ways to put and maintain them in a form suitable for specific applications, like search engines or spelling checkers. In this second case, the automaton is not a temporary need, but one definitive format under which the data set (dictionary) will be maintained.

Developed in a first moment to apply statistical tests in biological sequences [10], biba is mainly a C programming library, but it also contains programs built over this library offered as applications. It is able to make user-level memory paging, in the same sense that this term is used in operating systems textbooks (see for instance Bach [1]). However, biba's paging implementation is not only a way to overcome physical memory limitations, but it is also and mainly an efficient way to permanently store automata on disk.

Paging will be a very interesting option in those cases where the automaton can compare to a large relational database, that resides on disk but is continuously being consulted and changed punctually along a large interval of time. An example we can think about is an automaton-based index for a large dynamic textual database, as glimpse [7] does.

In this paper we present the main features and characteristics of biba. A complete technical description may be found along the man pages and the source code, freely available in the URL given at the end of the paper. It was tested by the author in x86 and axp Linux, but the library will compile and run with few or no changes in almost any 32 or 64-bit platform for which a C compiler is available.

2. An overview from the user's viewpoint

We introduce the features of biba, through some examples. For now, the visible part of biba are just two programs that integrate the package, reduce and suffaut.

The reduced automaton that recognizes a set of words, explicited one per line in a file, can be built using the reduce tool, that implements the algorithm due to Revuz [8]:

    $ reduce words

The automaton just built resides in the pair of files a.aut.st (the states) and a.aut.ed (the edges). These are ordinary files, that you can copy, compress or transfer. Because of practical limitations that the user may need to deal with, we allow to split the files that contain an automaton in many parts, so one can combine the available areas of many filesystems, and take advantage of using disks in parallel. For instance, the following command will divide a.aut.ed in two parts, the first of them containing just 25 megabytes:

    $ biba -s 25 a.aut.ed a.aut-1.ed

Next we present suffaut, a program that builds the suffix automaton of a given word, using the algorithm due to Blumer et alli [2]. In this same example we show how one paging parameter (among others) can be controlled through options. In this case we're specifying that the maximum ratio between total virtual memory size and total physical (RAM) memory size is 5. Just like in the first example, the automaton will reside in the files a.aut.st and a.aut.ed. We chose a binary word (the file /bin/sh seen as a sequence of bytes):

    $ suffaut -p 5 -f /bin/sh
    word length: 300668
    states: 428275 (14 finals) edges: 636565
    ram pages: 1300 virtual pages: 2141

This command took 64 seconds to complete in a 486 100Mhz with 32 megabytes of main memory. By default suffaut allocates 5 megabytes of RAM before allowing the ratio between virtual and physical memory becomes larger than 1. As the page size is 4096, suffaut allocated 1300 RAM pages. The final sizes for files a.aut.st and a.aut.ed were 5.5 and 3.2 megabytes, so suffaut allocated 13 bytes per state and 5 bytes per edge. These numbers will be explained in the next section (note that the algorithm requires two additional 32-bit fields per state).

Now we'll combine these two programs in order to test the accuracy of both. Let's consider Good-de Bruijn words over the alphabet {0,1}. The scprit good generates such a word with parameter p, that is, a word w where each element of {0,1}p occurs just once as a factor of w. The number of states of the suffix automaton of w can be theoretically computed as being 2p+1-1 [10]. We can build this automaton using suffaut or reduce, and compare the results:

    $ good 7 | suffaut
    word length: 134
    states: 255 (8 finals) edges: 381

    $ good -s 7 | reduce
    states: 255 (8 finals) edges: 381

In order to generate the word, good uses the classical construction based on Eulerian Tours. When called with the -s switch, it generates the word and all its suffixes. This script is available with the biba distribution.

In the next two examples biba deals with natural language dictionaries and biological vocabularies. In both cases we're building reduced automata, in the first case for the entire br.ispell dictionary for brazilian portuguese [11] (just like done by Lucchesi and Kowaltowski in [5]), which contains 195539 entries (total size 2120562), and, in the second, for the blocks protein database [4] (109660 entries, 3249189 total size). The filter eb, which extracts the sequences from the database distribution file removing comments, is provided in the biba distribution.

    $ ispell -e vocab -d br | reduce
    states: 14119 edges: 36588
    ram pages: 1300 virtual pages: 1303

    $ eb | reduce -s 200000 -m 28 -c 28 -d 1000000
    states: 1318407, edges: 1401826
    ram pages: 6200 virtual pages: 6182

The first command took just 240 seconds to complete in the same 486 computer described. The second took 27 minutes. Note that this time may vary largely depending on the choice of the paging parameters. We're omitting the final size of the files a.aut.st and a.aut.ed because in these cases they're full of garbage. In fact, their sizes are not determined by the automaton size, but by the vocabulary size, because the algorithm first builds a trie (this behaviour may be partially avoided using a command-line option). Anyway, at the end of the construction, each state uses 9 bytes, and each edge uses 5 bytes.

The library can also be used for many other purposes, but develop and link programs with the biba library is a time-consuming exercise of C programming, so we offer a simplistic alternative that opens partially the library's internal functionality for scripts that can be quickly written using perl or similar languages.

    #!/usr/bin/perl

    use FileHandle;
    use IPC::Open2;
    $pid = open2(\*Reader,\*Writer,"biba -w $ARGV[1]");
    Writer->autoflush();

    while (<stdin>) {
        print (Writer "r\ns$_");
        if ( =~ /^1/) {
            print $_;
        }
    }
    close(F);

This trivial script reads the file specified as its first parameter, reads each line, checks if it is or not spelled by the automaton a.aut, and prints it in the affirmative case. Hence, this script can be used as a filter. A more elaborate example is contained in the biba distribution. It computes the complexity of a biological sequence as define by Trifonov in [9].

3. Short API description

For the programmer we provide a short description of the C API (for a detailed description please refer the man pages).. Just like we said that automata are like files to the user, they are like files to the programmer too. So deal with automata using this API is like deal with files through the C language library. Automata handles are provided, and they can be used to create or specify an existing automaton, and to subsequently work on it.

create_aut similar to create(2)
register_aut similar to open(2)
close_aut similar to close(2)

In order to deal with automata states and edges, very simple services are provided. Each one require the specification of a handle, amongst other specific parameters. The complete list follows:

new_state alloc a new state
set_final set (to a given value) the state final flag
is_final read the state final flag
set_dest set the destination of an edge
dest read the destination of an edge
stdeg read the output degree of a state
next_dest visit all edges leaving a given state
clone clone a state

For low level configuration and session recovery support, the following services are provided:

init_aut initializes the API and specify paging parameters
stop_pgout disallow page outs, in order to keep the consistency of the structures represented on disk.
flush_aut synchronize the disk files with the contents of RAM memory

4. Internal representation of automata

States and edges reside on separate data structures, that grow as needed (currently they cannot shrink). The main goal of these data structures is to save memory, because we want to be able to represent large automata, so we need to try to minimize paging at the expense of a larger cost of in-memory operations.

The data structure that represents states is nothing more than a large array. Each entry of this array is a "state", composed of fixed fields (its output degree, a flag to identify if the state is final, and a pointer to the structure that represents its edges) and application-dependent fields, like flags and counters.

The data structure that represents the edges is an array too. It limits the alphabet size to 256, and currently does not support nondeterministic automata. In order to represent the edges that leave a state with output degree d, where d is in the range 0..256, it allocates just d contiguous entries of the array of edges. The allocation routines guarantee that these d entries reside on the same page, a fundamental property, needed to avoid performance degradation. Each edge is composed by a label field and a destination field which points to a state. Application-specific fields are not allowed.

To allocate a state is no more than increase by 1 the current size of the array of states, and manage the necessary changes in the data structure, which may include the allocation of a new page. Inserting an edge leaving a state demands the allocation of d contiguous entries of the array of edges (best-fit is used), where d is the resulting output degree. This is managed by the maintenance of free lists of free areas within this array, one per each possible free area size.

Both arrays reside on disk, but at each moment only some of their pages are loaded into memory, accordingly to the paging routines. Each state and each edge uses just 40 bits to be represented (the size of a state may be larger depending on the presence of application-specific fields). This number is due to current limit on the alphabet size, that cannot be larger than 256, plus the choice of a 31-bit addressing limit for states and edges. So if n is the sum of the number of states and edges of an automaton, then the total disk size needed to represent it will be 5n bytes.

Page replacement proceeds swapping out pages based on their ages. The pages are circularly visited along the program execution, just like a memory refresh circuit does. The age of a page is the amount of time elapsed since the last time it was acessed. Comparing the age of a page with the medium age computed over all pages, one can decide if it will be swapped out or not.

5. Limits and limitations

The limits imposed by addressing strategies or performance requirements follow. We plan to relax some of them in future releases of the library (see the section Future Directions).

6. Notes on performance

The paging scheme by itself represents a performance penalty, because of the additional memory and time comsumption needed to manage and detect page faults and access memory. In a user-level paging scheme like the one implemented in biba, this penalty is more significant because we cannot use, as the operating system kernel does, hardware features that make the paging more efficient (and unportable).

In order to estimate the time penalty, we depend on more programming effort. Using the suffix array tool in development (details in the next section) we'll be able to make it.

On the other hand, regarding the memory usage, paging is not critical. The tables and additional information needed to manage paging by biba represent less than 1.3% of the total amount of physical memory allocated (details on how this percentage was computed can be found in the file mem.c of the distribution).

7. Future Directions

Regarding edges representation, some enhancements are needed, like support for nondeterministic automata. Changes are being planned in order to allow the addition of application-specific fields to edges, as well as the usage of larger alphabets and/or support automata where the labels of the edges are words over a given alphabet.

A search tool for WWW servers similar to glimpse [7] will be released soon. It currently uses an index based on GNU gdbm, but an automaton-based alternative is being added.

More general data structures besides automata can be represented using biba. We plan to release soon a program that builds the suffix array, and implements both the refinement algorithm due to Manber and Myers [6] and the external sorting algorithm due to Gonnet [3]. We wrote this program some time ago, and it is currently being adapted to use the biba paging scheme.

A full access to the internal library capabilities though perl scripts is being worked on. We currently do not have a schedule for other automata or graph algorithms to be implemented, but we're accepting suggestions.

8. Acknowledgements

We are thankful to Imre Simon by his continuous advice and encouragement. We must acknowledge our master thesis examinators Nivio Ziviani and Yoshiharu Kohayakawa. The biologists Romeu Guimarães and Sérgio Mattioli pointed out biologic resources. Carlos Juiti Watanabe and Eduardo Ermel Wronowski helped to make the implementation. Thanks to Jean-Eric Pin, Marcelo Oliveira, Marko Loparic and the referees for their suggestions. This work was partly supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP).

9. Availability

The sources are available under the terms of the GNU GPL, and can be copied from ftp://ftp.ime.usp.br/pub/ueda/biba. General, programming and last-minute informations are available at http://www.ime.usp.br/~ueda/biba.

10 Bibliography

[1] Maurice J. Bach. The Design of the Unix Operating System. Prentice Hall, 1986.

[2] A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussler, M. T. Chen and J. Seiferas, The smallest automaton recognizing the subwords of a text. Theoretical Computer Science, vol. 40, pp 31-55, 1985.

[3] Gaston H. Gonnet, Ricardo A. Baeza-Yates and T. Snider. Lexicographical indices for text: inverted files vs. PAT trees. TR OED-91-01, UW Centre for the New Oxford English Dictionary, University of Waterloo, 1991.

[4] S Henikoff and JG Henikoff. Automated assembly of protein blocks for database searching, Nucleic Acids Res. 19:6565-6572, 1991.

[5] Cláudio L. Lucchesi and Tomasz Kowaltowski, Applications of Finite Automata Representing Large Vocabularies, in Software Practice and Experience, Vol. 23, pp. 15-30, 1993.

[6] Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, October 1993.

[7] Udi Manber and Sun Wu, GLIMPSE: A Tool to Search Through Entire File Systems, Usenix Winter 1994 Technical Conference, San Francisco pp. 23-32, 1994.

[8] Dominique Revuz, Minimization of Acyclic Deterministic Automata in Linear Time. Theoretical Computer Science, vol. 92 pp 181-189, 1992.

[9] Edward N. Trifonov, Making sense of the human genome. Structure & Methods, 1:69-77, 1990.

[10] Ricardo Ueda Karpischek. The Suffix Automaton (in portuguese). Master Thesis, University of São Paulo, 1993.

[11] Ricardo Ueda Karpischek, br.ispell, a brazilian portuguese dictionary for ispell, http://www.ime.usp.br/~ueda/br.ispell/, 1995.