Suffix arrays & LCP_落脚点_百度空间企业库|免费b2b网站

Suffix arrays & LCP_落脚点_百度空间

A suffix array is a simple data structure that enables lookup of any substring of a text and identification of repeated substrings. It is more compact than a suffix tree and is amenable to storage in secondary memory.

A suffix array is (notionally) a sorted list of the suffixes of a given string. (A suffix is a substring that extends to the end of the given string.) The sorted list is presented as an array of integers that identify the suffixes in order. The suffix array for the 11-letter text ABRACADABRA, with an endmark # that collates low, is given below, together with the corresponding suffixes.

11   #
10   A#
 7   ABRA#
 0   ABRACADABRA#
 3   ACADABRA#
 5   ADABRA#
 8   BRA#
 1   BRACADABRA#
 4   CADABRA#
 6   DABRA#
 9   RA#
 2   RACADABRA#

Together, the suffix array and string enable binary search for any substring, e.g. CAD.

A useful auxiliary data structure is an `LCP array', an array of lengths of the longest common prefix between each substring and its predecessor in the suffix array. The second column below gives the LCP array for the previous example. The third column shows that a suffix array may also be interpreted as an ordered list of circular shifts.

11   0   #ABRACADABRA
10   0   A#ABRACADABR
 7   1   ABRA#ABRACAD
 0   4   ABRACADABRA#
 3   1   ACADABRA#ABR
 5   1   ADABRA#ABRAC
 8   0   BRA#ABRACADA
 1   3   BRACADABRA#A
 4   0   CADABRA#ABRA
 6   0   DABRA#ABRACA
 9   0   RA#ABRACADAB
 2   2   RACADABRA#AB

The last column of letters, ARD#RCAAAABB, is the Burrows-Wheeler transform, which is central to B-W data compression. The archive contains C functions that compute suffix arrays efficiently by methods that build on work of Manber and Myers. The functions handle strings of integers, which may contain codes for characters, words, or other types of data. Contents:

sarray.h: Header file.
ssarray.c: A simplification of the Myers/Manber technique, by Peter McIlroy and Douglas McIlroy. This is the program to study, but sarray.c is the one for heavy lifting.
sarray.c: An extensively engineered hybrid of ssarray.c and quicksort, usually more than five times as fast as ssarray.c, by Sean Quinlan and Sean Dorward. Also contains a specialized interface, bsarray, tailored to byte-string data. A appeared in N. Jesper Larson's PhD thesis.
lcp.c: Computes an LCP array for a given suffix array, by a linear-time method due to Kasai et al.
scode.c: Encodes a string into a canonical form for input to ssarray or sarray.
SuffixArray.java and SuffixArray.c: Java interface to the C functions.
sarray.3: Unix-style man page, troff source, also available in , and .

Updated July 21, 2004

zz:http://www.cs.dartmouth.edu/~doug/sarray/

郑重声明：资讯【Suffix arrays & LCP_落脚点_百度空间】由发布，版权归原作者及其所在单位，其原创性以及文中陈述文字和内容未经(企业库qiyeku.com)证实，请读者仅作参考，并请自行核实相关内容。若本文有侵犯到您的版权，请你提供相关证明及申请并与我们联系（qiyeku # qq.com）或【在线投诉】，我们审核后将会尽快处理。

—— 相关资讯 ——