Next: Some theory Up: Sequence Comparison: Some Theory Previous: Sequence Comparison: Some Theory

Sequence comparison

Sequence comparison is a deep and fascinating subject in Computer Science, both theoretical and practical. However, in our opinion, neither the theoretical nor the practical aspects of the problem are well understood and we feel that their mastery is a true challenge for Computer Science.

The central problem can be stated very easily: find an algorithm, as efficient and practical as possible, to compute a longest common subsequence (lcs for short) of two given sequences.

As usual, a subsequence of a sequence is another sequence obtained from it by deleting some (not necessarily contiguous) terms. Thus, both en pri and en pai are longest common subsequences of sequence comparison and theory and practice.

It turns out that this problem has many, many applications in many, many apparently unrelated fields, such as computer science, mathematics, molecular biology, speech recognition, gas chromatography, bird song analysis, etc. A comprehensive study of the role of the problem in these fields and how it is coped with in each of them can be found in the beautiful book of Sankoff and Kruskal [37].

In particular, in computer science the problem has at least two applications. The main one is a file comparison utility, nowadays universally called diff, after its popularization through the UNIX operating system. This tool is intensively used to discover differences between versions of a text file. In this role it is useful in keeping track of the evolution of a document or of a computer program. It is also used as a file compression utility, since many versions of a (long) file can be represented by storing one (long) version of it and many (short) scripts of transforming the stored version in the remaining ones. Another aplication in computer science is to approximate string matching used, for instance, in the detection of misspelled versions of names. It should be noted, however, that sequence comparison is not the main tool to solve this very important problem. For more details the reader is referred to [16].

An interesting aspect of the problem is that it can be solved by a simple and perhaps even intuitive `folklore' algorithm based on a dynamic programming approach. This appealing algorithm has been discovered many, many times. Indeed, it has been discovered by engineers, by biologists and by computer scientists, in Russia, Japan, United States, France and Canada in the period 1968 to 1975. The first publication of the algorithm seems to be, according [37], in a 1968 paper by the russian engineer Vintsyuk [45].

The big challenge to computer science comes from the complexity of the folklore algorithm. Indeed, it requires time proportional to the product of the lengths of the sequences, and no essentially better practical algorithm is known. The question is to search for a possible algorithm which is simultaneously efficient and practical.

As far as we know, the existence of a linear algorithm has not been ruled out. Neither has been found a practical algorithm which worst case time complexity is better than O(mn), where m and n are the lengths of the given sequences. There exists, however, an algorithm of time complexity for pairs of sequences of length n over a fixed finite alphabet, discovered by Masek and Paterson [29]. This algorithm is not suitable for practical purposes but its existence is a hint that better algorithms than the ones in current use must exist.

All told, a very good start would be to find out whether or not there exists a linear time algorithm to compute a longest common subsequence of two given sequences over two letters. We get a more modest but still very interesting start by replacing ``linear time'' for ``time complexity ''. Here, the time bounds should be measured on a random access machine under the unit-cost model.

A weaker version, already answered by Masek and Paterson, has been proposed by D. E. Knuth in a technical report coauthored with V. Chvátal and D. A. Klarner in 1972 [9]:

Problem 35.   Greatest common substrings.
  It is possible to find the longest common subsequence of two
sequences of a's and b's in a time proportional to the product
of their lengths. Can one do better? 
Note: aba is a subsequence of aabbbba.

Incidentally, this seems to be the first reference to the problem and to the folklore algorithm within Computer Science.

Next: Some theory Up: Sequence Comparison: Some Theory Previous: Sequence Comparison: Some Theory

Imre Simon
Wed May 28 19:09:09 EST 1997