Initial checkin

This commit is contained in:
marty-emmanuel 2019-04-01 18:04:56 +02:00
commit e216b0c544
43 changed files with 6514 additions and 0 deletions

10
.gitignore vendored Normal file
View File

@ -0,0 +1,10 @@
bin
vault
packed_lzsa
packed_lz4
packed_lizard
packed_lz5
packed_zx7
packed_rnc
obj
lzsa

43
LICENSE.cc0.md Executable file
View File

@ -0,0 +1,43 @@
## creative commons
# CC0 1.0 Universal
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED HEREUNDER.
### Statement of Purpose
The laws of most jurisdictions throughout the world automatically confer exclusive Copyright and Related Rights (defined below) upon the creator and subsequent owner(s) (each and all, an "owner") of an original work of authorship and/or a database (each, a "Work").
Certain owners wish to permanently relinquish those rights to a Work for the purpose of contributing to a commons of creative, cultural and scientific works ("Commons") that the public can reliably and without fear of later claims of infringement build upon, modify, incorporate in other works, reuse and redistribute as freely as possible in any form whatsoever and for any purposes, including without limitation commercial purposes. These owners may contribute to the Commons to promote the ideal of a free culture and the further production of creative, cultural and scientific works, or to gain reputation or greater distribution for their Work in part through the use and efforts of others.
For these and/or other purposes and motivations, and without any expectation of additional consideration or compensation, the person associating CC0 with a Work (the "Affirmer"), to the extent that he or she is an owner of Copyright and Related Rights in the Work, voluntarily elects to apply CC0 to the Work and publicly distribute the Work under its terms, with knowledge of his or her Copyright and Related Rights in the Work and the meaning and intended legal effect of CC0 on those rights.
1. __Copyright and Related Rights.__ A Work made available under CC0 may be protected by copyright and related or neighboring rights ("Copyright and Related Rights"). Copyright and Related Rights include, but are not limited to, the following:
i. the right to reproduce, adapt, distribute, perform, display, communicate, and translate a Work;
ii. moral rights retained by the original author(s) and/or performer(s);
iii. publicity and privacy rights pertaining to a person's image or likeness depicted in a Work;
iv. rights protecting against unfair competition in regards to a Work, subject to the limitations in paragraph 4(a), below;
v. rights protecting the extraction, dissemination, use and reuse of data in a Work;
vi. database rights (such as those arising under Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, and under any national implementation thereof, including any amended or successor version of such directive); and
vii. other similar, equivalent or corresponding rights throughout the world based on applicable law or treaty, and any national implementations thereof.
2. __Waiver.__ To the greatest extent permitted by, but not in contravention of, applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and unconditionally waives, abandons, and surrenders all of Affirmer's Copyright and Related Rights and associated claims and causes of action, whether now known or unknown (including existing as well as future claims and causes of action), in the Work (i) in all territories worldwide, (ii) for the maximum duration provided by applicable law or treaty (including future time extensions), (iii) in any current or future medium and for any number of copies, and (iv) for any purpose whatsoever, including without limitation commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each member of the public at large and to the detriment of Affirmer's heirs and successors, fully intending that such Waiver shall not be subject to revocation, rescission, cancellation, termination, or any other legal or equitable action to disrupt the quiet enjoyment of the Work by the public as contemplated by Affirmer's express Statement of Purpose.
3. __Public License Fallback.__ Should any part of the Waiver for any reason be judged legally invalid or ineffective under applicable law, then the Waiver shall be preserved to the maximum extent permitted taking into account Affirmer's express Statement of Purpose. In addition, to the extent the Waiver is so judged Affirmer hereby grants to each affected person a royalty-free, non transferable, non sublicensable, non exclusive, irrevocable and unconditional license to exercise Affirmer's Copyright and Related Rights in the Work (i) in all territories worldwide, (ii) for the maximum duration provided by applicable law or treaty (including future time extensions), (iii) in any current or future medium and for any number of copies, and (iv) for any purpose whatsoever, including without limitation commercial, advertising or promotional purposes (the "License"). The License shall be deemed effective as of the date CC0 was applied by Affirmer to the Work. Should any part of the License for any reason be judged legally invalid or ineffective under applicable law, such partial invalidity or ineffectiveness shall not invalidate the remainder of the License, and in such case Affirmer hereby affirms that he or she will not (i) exercise any of his or her remaining Copyright and Related Rights in the Work or (ii) assert any associated claims and causes of action with respect to the Work, in either case contrary to Affirmer's express Statement of Purpose.
4. __Limitations and Disclaimers.__
a. No trademark or patent rights held by Affirmer are waived, abandoned, surrendered, licensed or otherwise affected by this document.
b. Affirmer offers the Work as-is and makes no representations or warranties of any kind concerning the Work, express, implied, statutory or otherwise, including without limitation warranties of title, merchantability, fitness for a particular purpose, non infringement, or the absence of latent or other defects, accuracy, or the present or absence of errors, whether or not discoverable, all to the greatest extent permissible under applicable law.
c. Affirmer disclaims responsibility for clearing rights of other persons that may apply to the Work or any use thereof, including without limitation any person's Copyright and Related Rights in the Work. Further, Affirmer disclaims responsibility for obtaining any necessary consents, permissions or other rights required for any use of the Work.
d. Affirmer understands and acknowledges that Creative Commons is not a party to this document and has no duty or obligation with respect to this CC0 or use of the Work.

19
LICENSE.zlib.md Executable file
View File

@ -0,0 +1,19 @@
Copyright (c) 2019 Emmanuel Marty
This software is provided 'as-is', without any express or implied warranty. In
no event will the authors be held liable for any damages arising from the use of
this software.
Permission is granted to anyone to use this software for any purpose, including
commercial applications, and to alter it and redistribute it freely, subject to
the following restrictions:
1. The origin of this software must not be misrepresented; you must not claim
that you wrote the original software. If you use this software in a product,
an acknowledgment in the product documentation would be appreciated but is
not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.

30
Makefile Executable file
View File

@ -0,0 +1,30 @@
CC=clang
CFLAGS=-O3 -fomit-frame-pointer -Isrc/libdivsufsort/include -Isrc -DHAVE_CONFIG_H
OBJDIR=obj
LDFLAGS=
STRIP=strip
$(OBJDIR)/%.o: src/../%.c
@mkdir -p '$(@D)'
$(CC) $(CFLAGS) -c $< -o $@
APP := lzsa
OBJS := $(OBJDIR)/src/main.o
OBJS += $(OBJDIR)/src/shrink.o
OBJS += $(OBJDIR)/src/expand.o
OBJS += $(OBJDIR)/src/libdivsufsort/lib/divsufsort.o
OBJS += $(OBJDIR)/src/libdivsufsort/lib/sssort.o
OBJS += $(OBJDIR)/src/libdivsufsort/lib/trsort.o
OBJS += $(OBJDIR)/src/libdivsufsort/lib/utils.o
all: $(APP)
$(APP): $(OBJS)
@mkdir -p ../../bin/posix
$(CC) $^ $(LDFLAGS) -o $(APP)
$(STRIP) $(APP)
clean:
@rm -rf $(APP) $(OBJDIR)

121
README.md Executable file
View File

@ -0,0 +1,121 @@
LZSA is a byte-aligned compression format that is specifically engineered for very fast decompression on 8-bit systems. It can compress files of any size by using blocks of a maximum size of 64 Kb with block-interdependent compression and up to 64 Kb of back-references for matches.
The LZSA compression tool uses an aggressive optimal packing strategy to try to find the sequence of commands that gives the smallest packed file that decompresses to the original while maintaining the maximum possible decompression speed.
Compression ratio comparison between LZSA and other optimal packers, for a workload composed of ZX Spectrum and C64 files:
ZX7 57,36% (entropy coding)
LZ5 1.4.1 59,82%
LZSA 60,84% <------ (single byte stream)
Lizard -29 64,14% (rep-match, 4 byte streams)
LZ4_HC -19 -B4 -BD 64,5% (single byte stream)
Uncompressed 100%
As an example of LZSA's simplicity, a size-optimized decompressor on 8088 has been implemented in 105 bytes.
The compressor is approximately 2X slower than LZ4_HC but compresses better while maintaining similar decompression speeds and decompressor simplicity.
The main differences with the LZ4 compression format are:
* The use of short (8-bit) match offsets where possible. The match-finder and optimizer cooperate to try and use the shortest match offsets possible.
* Shorter encoding of lengths. As blocks are maximum 64 Kb in size, lengths can only be up to 64 Kb.
* As a result of the smaller commands due to the possibly shorter match offsets, a minimum match size of 3 bytes instead of 4. The use of small matches is driven by the optimizer, and used where they provide gains.
Inspirations:
* [LZ4](https://github.com/lz4/lz4) by Yann Collet.
* [LZ5/Lizard](https://github.com/inikep/lizard) by Przemyslaw Skibinski and Yann Collet.
* The suffix array intervals in [Wimlib](https://wimlib.net/git/?p=wimlib;a=tree) by Eric Biggers.
License:
* The LZSA code is available under the Zlib license.
* The compressor (shrink.c) is available under the CC0 license due to using portions of code from Eric Bigger's Wimlib in the suffix array-based matchfinder.
# Stream format
The stream format is composed of:
* a header
* one or more frames
* a footer
# Header format
The header contains a signature and a traits byte:
0 1 2 3 4
0x7b 0x9e 0x0f 0xd7 0x00
<--- signature ---> <- traits ->
The traits are set to 0x00 for this version of the format.
# Frame format
Each frame contains a 3-byte length followed by block data that expands to up to 64 Kb of decompressed data.
0 1 2
DSZ0 DSZ1 U|DSZ2
* DSZ0 (length byte 0) contains bits 0-7 of the block data size
* DSZ1 (length byte 1) contains bits 8-15 of the block data size
* DSZ2 (bit 0 of length byte 2) contains bit 16 of the block data size
* U (bit 7 of length byte 2) is set if the block data is uncompressed, and clear if the block data is compressed.
* Bits 1..6 of length byte 2 are currently undefined and must be set to 0.
# Block data format
LZSA blocks are composed from consecutive commands. Each command follows this format:
* <token: O|LLL|MMMM>
* <optional extra literal length>
* <literal values>
* <match offset low>
* <optional match offset high>
* <optional extra match length>
**token**
The token byte is broken down into three parts:
7 6 5 4 3 2 1 0
O L L L M M M M
* O: set for a 2-byte match offset, clear for a 1-byte match offset
* L: 3-bit literals length (0-6, or 7 if extended). If the number of literals for this command is 0 to 6, the length is encoded in the token and no extra bytes are required. Otherwise, a value of 7 is encoded and extra bytes follow as 'optional extra literal length'
* M: 4-bit match length (0-14, or 15 if extended). Likewise, if the match length for this command is 0 to 14, it is directly encoded, otherwise 15 is stored and extra bytes follow as 'optional extra match length'.
**optional extra literal length**
If the literals length is 7 or more, the 'L' bits in the token form the value 7, and an extra byte follows here, with three possible types of value:
* 0-254: the value is added to the 7 stored in the token, to compose the final literals length. For instance a length of 206 will be stored as 7 in the token + a single byte with the value of 199, as 7 + 199 = 206.
* 254: a second byte follows. The final literals value is 7 + 254 + the second byte. For instance, a literals length of 499 is encoded as 7 in the token, a byte with the value of 254, and a final byte with teh value of 238, as 7 + 254 + 238 = 499.
* 255: a second and third byte follow, forming a little-endian 16-bit value. The final literals value is 7 + 255 + that 16-bit value. For instance, a literals length of 1024 is stored as 7 in the token, then byte values of 255, 250 and 2, as 7 + 255 + 250 + (2 * 256) = 1024.
**literal values**
Literal bytes, whose number is specified by the literals length, follow here. There can be zero literals in a command.
**match offset low**
The low 8 bits of the match offset follows.
**optional match offset high**
If the 'O' bit (bit 7) is set in the token, the high 8 bits of the match offset follow, otherwise they are understood to be all set to 0.
**important note regarding match offsets: off by 1**
Note that the match offset is *off by 1*: a value of 0 refers to the byte preceding the current output index (N-1). A value of 1 refers to tow bytes before the current output index (N-2) and so on. This is so that match offsets up to 256 can be encoded as a single byte, for extra compression.
**optional extra match length**
If the match length is 15 or more, the 'M' bits in the token form the value 15, and an extra byte follows here, with three possible types of value.
* 0-254: the value is added to the 15 stored in the token.
* 254: a second byte follows. The final match length is 15 + 254 + the second byte.
* 255: a second and third byte follow, forming a little-endian 16-bit value. The final match length is 15 + 255 + that 16-bit value.
# Footer format
The stream ends with an empty frame: the 3 length bytes are set to 0 and no block data follows.

125
asm/8088/decompress_small.S Executable file
View File

@ -0,0 +1,125 @@
; decompress_small.S - space-efficient decompressor implementation for 8088
;
; Copyright (C) 2019 Emmanuel Marty
;
; This software is provided 'as-is', without any express or implied
; warranty. In no event will the authors be held liable for any damages
; arising from the use of this software.
;
; Permission is granted to anyone to use this software for any purpose,
; including commercial applications, and to alter it and redistribute it
; freely, subject to the following restrictions:
;
; 1. The origin of this software must not be misrepresented; you must not
; claim that you wrote the original software. If you use this software
; in a product, an acknowledgment in the product documentation would be
; appreciated but is not required.
; 2. Altered source versions must be plainly marked as such, and must not be
; misrepresented as being the original software.
; 3. This notice may not be removed or altered from any source distribution.
segment .text
bits 16
; ---------------------------------------------------------------------------
; Decompress LZSA frame
; inputs:
; * ds:si: LZSA frame
; * es:di: output buffer
; output:
; * ax: decompressed size
; ---------------------------------------------------------------------------
lzsa_decompress:
push di ; remember decompression offset
cld ; make string operations (lods, movs, stos..) move forward
lodsw ; grab first 2 bytes of block size
mov dx,ax ; keep block size in dx
lodsb ; grab last byte of block size / compression flag
test al,al ; if the compressed data is larger than 65,535 bytes, or if it is uncompressed, bail
jne .done_decompressing
add dx,si ; point at the end of the compressed data
mov bp,dx ; remember it
xor cx,cx
.decode_token:
mov ax,cx ; clear ah - cx is zero from above or from after rep movsw in .copy_match
lodsb ; read token byte: O|LLL|MMMM
mov dx,ax ; keep token in dl
and al,070H ; isolate literals length in token (LLL)
mov cl,4
shr al,cl ; shift match length into place
mov cx,ax ; copy literals length into cx
cmp al,07H ; LITERALS_RUN_LEN?
jne .copy_literals ; no, we have the full literals count from the token, go copy
call .get_varlen ; get complete literals length
.copy_literals:
rep movsb ; copy cx literals from ds:si to es:di
cmp si,bp ; did we reach the end of the compressed data?
je .done_decompressing ; we did, bail. the last token does not include a match copy
test dl,dl ; check match offset size in token (O bit, ie. bit 7, the sign bit)
js .get_long_offset
xor ah,ah ; Get 1-byte match offset
lodsb
jmp short .get_match_length
.get_long_offset:
lodsw ; Get 2-byte match offset
.get_match_length:
inc ax ; the match offset is stored off-by-1, increase it
xchg ax,dx ; dx: match offset ax: original token
and al,0FH ; isolate match length in token (MMMM)
mov cx,ax ; copy match length into cx
cmp al,0FH ; MATCH_RUN_LEN?
jne .copy_match ; no, we have the full match length from the token, go copy
call .get_varlen ; get complete match length
.copy_match:
add cx,3 ; add MIN_MATCH_SIZE to get the final match length to copy
push ds ; save ds:si (current pointer to compressed data)
xchg si,ax
push es
pop ds
mov si,di ; ds:si now points at back reference in output data
sub si,dx
rep movsb ; copy match
xchg si,ax ; restore ds:si
pop ds
jmp short .decode_token ; go decode another token
.done_decompressing:
pop ax ; retrieve the original decompression offset
xchg ax,di ; compute decompressed size
sub ax,di
ret ; done
.get_varlen:
lodsb ; grab extra length byte
add cx,ax ; add extra length byte to length from token
cmp al,0FFH ; 3-byte extra length?
je .large_varlen ; yes, go grab it
cmp al,0FEH ; 2-byte extra length?
jne .varlen_done ; no, we have the full length now, bail
lodsb ; grab extra length byte
jmp short .add_and_varlen_done ; go add it and bail
.large_varlen:
lodsw ; grab 16-bit extra length
.add_and_varlen_done:
add cx,ax ; add to length from token
.varlen_done:
ret

230
src/expand.c Executable file
View File

@ -0,0 +1,230 @@
/*
* expand.c - block decompressor implementation
*
* Copyright (C) 2019 Emmanuel Marty
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the authors be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute it
* freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must not
* claim that you wrote the original software. If you use this software
* in a product, an acknowledgment in the product documentation would be
* appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must not be
* misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source distribution.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "format.h"
#include "expand.h"
#ifdef _MSC_VER
#define FORCE_INLINE __forceinline
#else /* _MSC_VER */
#define FORCE_INLINE __attribute__((always_inline))
#endif /* _MSC_VER */
static inline FORCE_INLINE int lzsa_expand_literals_slow(const unsigned char **ppInBlock, const unsigned char *pInBlockEnd, int nLiterals, unsigned char **ppCurOutData, const unsigned char *pOutDataEnd) {
const unsigned char *pInBlock = *ppInBlock;
unsigned char *pCurOutData = *ppCurOutData;
if (nLiterals == LITERALS_RUN_LEN) {
unsigned char nByte;
if (pInBlock >= pInBlockEnd) return -1;
nByte = *pInBlock++;
nLiterals += (int)((unsigned int)nByte);
if (nByte == 254) {
if (pInBlock >= pInBlockEnd) return -1;
nLiterals += (int)((unsigned int)*pInBlock++);
}
else if (nByte == 255) {
int nLargeLiterals;
if ((pInBlock + 1) >= pInBlockEnd) return -1;
nLargeLiterals = ((unsigned int)*pInBlock++);
nLargeLiterals |= (((unsigned int)*pInBlock++) << 8);
nLiterals += nLargeLiterals;
}
}
if (nLiterals != 0) {
if ((pInBlock + nLiterals) > pInBlockEnd ||
(pCurOutData + nLiterals) > pOutDataEnd) {
return -1;
}
memcpy(pCurOutData, pInBlock, nLiterals);
pInBlock += nLiterals;
pCurOutData += nLiterals;
}
*ppInBlock = pInBlock;
*ppCurOutData = pCurOutData;
return 0;
}
static inline FORCE_INLINE int lzsa_expand_match_slow(const unsigned char **ppInBlock, const unsigned char *pInBlockEnd, const unsigned char *pSrc, int nMatchLen, unsigned char **ppCurOutData, const unsigned char *pOutDataEnd, const unsigned char *pOutDataFastEnd) {
const unsigned char *pInBlock = *ppInBlock;
unsigned char *pCurOutData = *ppCurOutData;
if (nMatchLen == MATCH_RUN_LEN) {
unsigned char nByte;
if (pInBlock >= pInBlockEnd) return -1;
nByte = *pInBlock++;
nMatchLen += (int)((unsigned int)nByte);
if (nByte == 254) {
if (pInBlock >= pInBlockEnd) return -1;
nMatchLen += (int)((unsigned int)*pInBlock++);
}
else if (nByte == 255) {
int nLargeMatchLen;
if ((pInBlock + 1) >= pInBlockEnd) return -1;
nLargeMatchLen = ((unsigned int)*pInBlock++);
nLargeMatchLen |= (((unsigned int)*pInBlock++) << 8);
nMatchLen += nLargeMatchLen;
}
}
nMatchLen += MIN_MATCH_SIZE;
if ((pCurOutData + nMatchLen) > pOutDataEnd) {
return -1;
}
if ((pSrc + 1) == pCurOutData && nMatchLen >= 16) {
/* One-byte RLE */
memset(pCurOutData, *pSrc, nMatchLen);
pCurOutData += nMatchLen;
}
else {
/* Do a deterministic, left to right byte copy instead of memcpy() so as to handle overlaps */
int nMaxFast = nMatchLen;
if (nMaxFast > (pCurOutData - pSrc))
nMaxFast = (int)(pCurOutData - pSrc);
if ((pCurOutData + nMaxFast) > pOutDataFastEnd)
nMaxFast = (int)(pOutDataFastEnd - pCurOutData);
if (nMaxFast > 0) {
const unsigned char *pCopySrc = pSrc;
unsigned char *pCopyDst = pCurOutData;
const unsigned char *pCopyEndDst = pCurOutData + nMaxFast;
do {
memcpy(pCopyDst, pCopySrc, 16);
pCopySrc += 16;
pCopyDst += 16;
} while (pCopyDst < pCopyEndDst);
pCurOutData += nMaxFast;
pSrc += nMaxFast;
nMatchLen -= nMaxFast;
}
while (nMatchLen >= 4) {
*pCurOutData++ = *pSrc++;
*pCurOutData++ = *pSrc++;
*pCurOutData++ = *pSrc++;
*pCurOutData++ = *pSrc++;
nMatchLen -= 4;
}
while (nMatchLen > 0) {
*pCurOutData++ = *pSrc++;
nMatchLen--;
}
}
*ppInBlock = pInBlock;
*ppCurOutData = pCurOutData;
return 0;
}
int lzsa_expand_block(const unsigned char *pInBlock, int nBlockSize, unsigned char *pOutData, int nOutDataOffset, int nBlockMaxSize) {
const unsigned char *pInBlockEnd = pInBlock + nBlockSize;
const unsigned char *pInBlockFastEnd = pInBlock + nBlockSize - 16;
unsigned char *pCurOutData = pOutData + nOutDataOffset;
const unsigned char *pOutDataEnd = pCurOutData + nBlockMaxSize;
const unsigned char *pOutDataFastEnd = pOutDataEnd - 16;
/* Fast loop */
while (pInBlock < pInBlockFastEnd && pCurOutData < pOutDataFastEnd) {
const unsigned char token = *pInBlock++;
int nLiterals = (int)((unsigned int)((token & 0x70) >> 4));
if (nLiterals < LITERALS_RUN_LEN) {
memcpy(pCurOutData, pInBlock, 8);
pInBlock += nLiterals;
pCurOutData += nLiterals;
}
else {
if (lzsa_expand_literals_slow(&pInBlock, pInBlockEnd, nLiterals, &pCurOutData, pOutDataEnd))
return -1;
}
if ((pInBlock + 2) <= pInBlockEnd) { /* The last token in the block does not include match information */
int nMatchOffset;
nMatchOffset = ((unsigned int)*pInBlock++);
if (token & 0x80)
nMatchOffset |= (((unsigned int)*pInBlock++) << 8);
nMatchOffset++;
const unsigned char *pSrc = pCurOutData - nMatchOffset;
if (pSrc < pOutData)
return -1;
int nMatchLen = (int)((unsigned int)(token & 0x0f));
if (nMatchLen < (16 - MIN_MATCH_SIZE + 1) && (pSrc + MIN_MATCH_SIZE + nMatchLen) < pCurOutData) {
memcpy(pCurOutData, pSrc, 16);
pCurOutData += (MIN_MATCH_SIZE + nMatchLen);
}
else {
if (lzsa_expand_match_slow(&pInBlock, pInBlockEnd, pSrc, nMatchLen, &pCurOutData, pOutDataEnd, pOutDataFastEnd))
return -1;
}
}
}
/* Slow loop for the remainder of the buffer */
while (pInBlock < pInBlockEnd) {
const unsigned char token = *pInBlock++;
int nLiterals = (int)((unsigned int)((token & 0x70) >> 4));
if (lzsa_expand_literals_slow(&pInBlock, pInBlockEnd, nLiterals, &pCurOutData, pOutDataEnd))
return -1;
if ((pInBlock + 2) <= pInBlockEnd) { /* The last token in the block does not include match information */
int nMatchOffset;
nMatchOffset = ((unsigned int)*pInBlock++);
if (token & 0x80)
nMatchOffset |= (((unsigned int)*pInBlock++) << 8);
nMatchOffset++;
const unsigned char *pSrc = pCurOutData - nMatchOffset;
if (pSrc < pOutData)
return -1;
int nMatchLen = (int)((unsigned int)(token & 0x0f));
if (lzsa_expand_match_slow(&pInBlock, pInBlockEnd, pSrc, nMatchLen, &pCurOutData, pOutDataEnd, pOutDataFastEnd))
return -1;
}
}
return (int)(pCurOutData - (pOutData + nOutDataOffset));
}

28
src/expand.h Executable file
View File

@ -0,0 +1,28 @@
/*
* expand.h - block decompressor definitions
*
* Copyright (C) 2019 Emmanuel Marty
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the authors be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute it
* freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must not
* claim that you wrote the original software. If you use this software
* in a product, an acknowledgment in the product documentation would be
* appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must not be
* misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source distribution.
*/
#ifndef _EXPAND_H
#define _EXPAND_H
int lzsa_expand_block(const unsigned char *pInBlock, int nBlockSize, unsigned char *pOutData, int nOutDataOffset, int nBlockMaxSize);
#endif /* _EXPAND_H */

32
src/format.h Executable file
View File

@ -0,0 +1,32 @@
/*
* format.h - byte stream format definitions
*
* Copyright (C) 2019 Emmanuel Marty
*
* This software is provided 'as-is', without any express or implied
* warranty. In no event will the authors be held liable for any damages
* arising from the use of this software.
*
* Permission is granted to anyone to use this software for any purpose,
* including commercial applications, and to alter it and redistribute it
* freely, subject to the following restrictions:
*
* 1. The origin of this software must not be misrepresented; you must not
* claim that you wrote the original software. If you use this software
* in a product, an acknowledgment in the product documentation would be
* appreciated but is not required.
* 2. Altered source versions must be plainly marked as such, and must not be
* misrepresented as being the original software.
* 3. This notice may not be removed or altered from any source distribution.
*/
#ifndef _FORMAT_H
#define _FORMAT_H
#define MIN_MATCH_SIZE 3
#define MIN_OFFSET 1
#define MAX_OFFSET 0xffff
#define LITERALS_RUN_LEN 7
#define MATCH_RUN_LEN 15
#endif /* _FORMAT_H */

32
src/libdivsufsort/.gitignore vendored Normal file
View File

@ -0,0 +1,32 @@
# Object files
*.o
*.ko
*.obj
*.elf
# Precompiled Headers
*.gch
*.pch
# Libraries
*.lib
*.a
*.la
*.lo
# Shared objects (inc. Windows DLLs)
*.dll
*.so
*.so.*
*.dylib
# Executables
*.exe
*.out
*.app
*.i*86
*.x86_64
*.hex
# CMake files/directories
build/

View File

@ -0,0 +1,21 @@
# libdivsufsort Change Log
See full changelog at: https://github.com/y-256/libdivsufsort/commits
## [2.0.1] - 2010-11-11
### Fixed
* Wrong variable used in `divbwt` function
* Enclose some string variables with double quotation marks in include/CMakeLists.txt
* Fix typo in include/CMakeLists.txt
## 2.0.0 - 2008-08-23
### Changed
* Switch the build system to [CMake](http://www.cmake.org/)
* Improve the performance of the suffix-sorting algorithm
### Added
* OpenMP support
* 64-bit version of divsufsort
[Unreleased]: https://github.com/y-256/libdivsufsort/compare/2.0.1...HEAD
[2.0.1]: https://github.com/y-256/libdivsufsort/compare/2.0.0...2.0.1

View File

@ -0,0 +1,99 @@
### cmake file for building libdivsufsort Package ###
cmake_minimum_required(VERSION 2.4.4)
set(CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/CMakeModules")
include(AppendCompilerFlags)
## Project information ##
project(libdivsufsort C)
set(PROJECT_VENDOR "Yuta Mori")
set(PROJECT_CONTACT "yuta.256@gmail.com")
set(PROJECT_URL "https://github.com/y-256/libdivsufsort")
set(PROJECT_DESCRIPTION "A lightweight suffix sorting library")
include(VERSION.cmake)
## CPack configuration ##
set(CPACK_GENERATOR "TGZ;TBZ2;ZIP")
set(CPACK_SOURCE_GENERATOR "TGZ;TBZ2;ZIP")
include(ProjectCPack)
## Project options ##
option(BUILD_SHARED_LIBS "Set to OFF to build static libraries" ON)
option(BUILD_EXAMPLES "Build examples" ON)
option(BUILD_DIVSUFSORT64 "Build libdivsufsort64" OFF)
option(USE_OPENMP "Use OpenMP for parallelization" OFF)
option(WITH_LFS "Enable Large File Support" ON)
## Installation directories ##
set(LIB_SUFFIX "" CACHE STRING "Define suffix of directory name (32 or 64)")
set(CMAKE_INSTALL_RUNTIMEDIR "" CACHE PATH "Specify the output directory for dll runtimes (default is bin)")
if(NOT CMAKE_INSTALL_RUNTIMEDIR)
set(CMAKE_INSTALL_RUNTIMEDIR "${CMAKE_INSTALL_PREFIX}/bin")
endif(NOT CMAKE_INSTALL_RUNTIMEDIR)
set(CMAKE_INSTALL_LIBDIR "" CACHE PATH "Specify the output directory for libraries (default is lib)")
if(NOT CMAKE_INSTALL_LIBDIR)
set(CMAKE_INSTALL_LIBDIR "${CMAKE_INSTALL_PREFIX}/lib${LIB_SUFFIX}")
endif(NOT CMAKE_INSTALL_LIBDIR)
set(CMAKE_INSTALL_INCLUDEDIR "" CACHE PATH "Specify the output directory for header files (default is include)")
if(NOT CMAKE_INSTALL_INCLUDEDIR)
set(CMAKE_INSTALL_INCLUDEDIR "${CMAKE_INSTALL_PREFIX}/include")
endif(NOT CMAKE_INSTALL_INCLUDEDIR)
set(CMAKE_INSTALL_PKGCONFIGDIR "" CACHE PATH "Specify the output directory for pkgconfig files (default is lib/pkgconfig)")
if(NOT CMAKE_INSTALL_PKGCONFIGDIR)
set(CMAKE_INSTALL_PKGCONFIGDIR "${CMAKE_INSTALL_LIBDIR}/pkgconfig")
endif(NOT CMAKE_INSTALL_PKGCONFIGDIR)
## Build type ##
if(NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE "Release")
elseif(CMAKE_BUILD_TYPE STREQUAL "Debug")
set(CMAKE_VERBOSE_MAKEFILE ON)
endif(NOT CMAKE_BUILD_TYPE)
## Compiler options ##
if(MSVC)
append_c_compiler_flags("/W4" "VC" CMAKE_C_FLAGS)
append_c_compiler_flags("/Oi;/Ot;/Ox;/Oy" "VC" CMAKE_C_FLAGS_RELEASE)
if(USE_OPENMP)
append_c_compiler_flags("/openmp" "VC" CMAKE_C_FLAGS)
endif(USE_OPENMP)
elseif(BORLAND)
append_c_compiler_flags("-w" "BCC" CMAKE_C_FLAGS)
append_c_compiler_flags("-Oi;-Og;-Os;-Ov;-Ox" "BCC" CMAKE_C_FLAGS_RELEASE)
else(MSVC)
if(CMAKE_COMPILER_IS_GNUCC)
append_c_compiler_flags("-Wall" "GCC" CMAKE_C_FLAGS)
append_c_compiler_flags("-fomit-frame-pointer" "GCC" CMAKE_C_FLAGS_RELEASE)
if(USE_OPENMP)
append_c_compiler_flags("-fopenmp" "GCC" CMAKE_C_FLAGS)
endif(USE_OPENMP)
else(CMAKE_COMPILER_IS_GNUCC)
append_c_compiler_flags("-Wall" "UNKNOWN" CMAKE_C_FLAGS)
append_c_compiler_flags("-fomit-frame-pointer" "UNKNOWN" CMAKE_C_FLAGS_RELEASE)
if(USE_OPENMP)
append_c_compiler_flags("-fopenmp;-openmp;-omp" "UNKNOWN" CMAKE_C_FLAGS)
endif(USE_OPENMP)
endif(CMAKE_COMPILER_IS_GNUCC)
endif(MSVC)
## Add definitions ##
add_definitions(-DHAVE_CONFIG_H=1 -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS)
## Add subdirectories ##
add_subdirectory(pkgconfig)
add_subdirectory(include)
add_subdirectory(lib)
if(BUILD_EXAMPLES)
add_subdirectory(examples)
endif(BUILD_EXAMPLES)
## Add 'uninstall' target ##
CONFIGURE_FILE(
"${CMAKE_CURRENT_SOURCE_DIR}/CMakeModules/cmake_uninstall.cmake.in"
"${CMAKE_CURRENT_BINARY_DIR}/CMakeModules/cmake_uninstall.cmake"
IMMEDIATE @ONLY)
ADD_CUSTOM_TARGET(uninstall
"${CMAKE_COMMAND}" -P "${CMAKE_CURRENT_BINARY_DIR}/CMakeModules/cmake_uninstall.cmake")

View File

@ -0,0 +1,38 @@
include(CheckCSourceCompiles)
include(CheckCXXSourceCompiles)
macro(append_c_compiler_flags _flags _name _result)
set(SAFE_CMAKE_REQUIRED_FLAGS ${CMAKE_REQUIRED_FLAGS})
string(REGEX REPLACE "[-+/ ]" "_" cname "${_name}")
string(TOUPPER "${cname}" cname)
foreach(flag ${_flags})
string(REGEX REPLACE "^[-+/ ]+(.*)[-+/ ]*$" "\\1" flagname "${flag}")
string(REGEX REPLACE "[-+/ ]" "_" flagname "${flagname}")
string(TOUPPER "${flagname}" flagname)
set(have_flag "HAVE_${cname}_${flagname}")
set(CMAKE_REQUIRED_FLAGS "${flag}")
check_c_source_compiles("int main() { return 0; }" ${have_flag})
if(${have_flag})
set(${_result} "${${_result}} ${flag}")
endif(${have_flag})
endforeach(flag)
set(CMAKE_REQUIRED_FLAGS ${SAFE_CMAKE_REQUIRED_FLAGS})
endmacro(append_c_compiler_flags)
macro(append_cxx_compiler_flags _flags _name _result)
set(SAFE_CMAKE_REQUIRED_FLAGS ${CMAKE_REQUIRED_FLAGS})
string(REGEX REPLACE "[-+/ ]" "_" cname "${_name}")
string(TOUPPER "${cname}" cname)
foreach(flag ${_flags})
string(REGEX REPLACE "^[-+/ ]+(.*)[-+/ ]*$" "\\1" flagname "${flag}")
string(REGEX REPLACE "[-+/ ]" "_" flagname "${flagname}")
string(TOUPPER "${flagname}" flagname)
set(have_flag "HAVE_${cname}_${flagname}")
set(CMAKE_REQUIRED_FLAGS "${flag}")
check_cxx_source_compiles("int main() { return 0; }" ${have_flag})
if(${have_flag})
set(${_result} "${${_result}} ${flag}")
endif(${have_flag})
endforeach(flag)
set(CMAKE_REQUIRED_FLAGS ${SAFE_CMAKE_REQUIRED_FLAGS})
endmacro(append_cxx_compiler_flags)

View File

@ -0,0 +1,15 @@
include(CheckCSourceCompiles)
macro(check_function_keywords _wordlist)
set(${_result} "")
foreach(flag ${_wordlist})
string(REGEX REPLACE "[-+/ ()]" "_" flagname "${flag}")
string(TOUPPER "${flagname}" flagname)
set(have_flag "HAVE_${flagname}")
check_c_source_compiles("${flag} void func(); void func() { } int main() { func(); return 0; }" ${have_flag})
if(${have_flag} AND NOT ${_result})
set(${_result} "${flag}")
# break()
endif(${have_flag} AND NOT ${_result})
endforeach(flag)
endmacro(check_function_keywords)

View File

@ -0,0 +1,109 @@
## Checks for large file support ##
include(CheckIncludeFile)
include(CheckSymbolExists)
include(CheckTypeSize)
macro(check_lfs _isenable)
set(LFS_OFF_T "")
set(LFS_FOPEN "")
set(LFS_FSEEK "")
set(LFS_FTELL "")
set(LFS_PRID "")
if(${_isenable})
set(SAFE_CMAKE_REQUIRED_DEFINITIONS "${CMAKE_REQUIRED_DEFINITIONS}")
set(CMAKE_REQUIRED_DEFINITIONS ${CMAKE_REQUIRED_DEFINITIONS}
-D_LARGEFILE_SOURCE -D_LARGE_FILES -D_FILE_OFFSET_BITS=64
-D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS)
check_include_file("sys/types.h" HAVE_SYS_TYPES_H)
check_include_file("inttypes.h" HAVE_INTTYPES_H)
check_include_file("stddef.h" HAVE_STDDEF_H)
check_include_file("stdint.h" HAVE_STDINT_H)
# LFS type1: 8 <= sizeof(off_t), fseeko, ftello
check_type_size("off_t" SIZEOF_OFF_T)
if(SIZEOF_OFF_T GREATER 7)
check_symbol_exists("fseeko" "stdio.h" HAVE_FSEEKO)
check_symbol_exists("ftello" "stdio.h" HAVE_FTELLO)
if(HAVE_FSEEKO AND HAVE_FTELLO)
set(LFS_OFF_T "off_t")
set(LFS_FOPEN "fopen")
set(LFS_FSEEK "fseeko")
set(LFS_FTELL "ftello")
check_symbol_exists("PRIdMAX" "inttypes.h" HAVE_PRIDMAX)
if(HAVE_PRIDMAX)
set(LFS_PRID "PRIdMAX")
else(HAVE_PRIDMAX)
check_type_size("long" SIZEOF_LONG)
check_type_size("int" SIZEOF_INT)
if(SIZEOF_OFF_T GREATER SIZEOF_LONG)
set(LFS_PRID "\"lld\"")
elseif(SIZEOF_LONG GREATER SIZEOF_INT)
set(LFS_PRID "\"ld\"")
else(SIZEOF_OFF_T GREATER SIZEOF_LONG)
set(LFS_PRID "\"d\"")
endif(SIZEOF_OFF_T GREATER SIZEOF_LONG)
endif(HAVE_PRIDMAX)
endif(HAVE_FSEEKO AND HAVE_FTELLO)
endif(SIZEOF_OFF_T GREATER 7)
# LFS type2: 8 <= sizeof(off64_t), fopen64, fseeko64, ftello64
if(NOT LFS_OFF_T)
check_type_size("off64_t" SIZEOF_OFF64_T)
if(SIZEOF_OFF64_T GREATER 7)
check_symbol_exists("fopen64" "stdio.h" HAVE_FOPEN64)
check_symbol_exists("fseeko64" "stdio.h" HAVE_FSEEKO64)
check_symbol_exists("ftello64" "stdio.h" HAVE_FTELLO64)
if(HAVE_FOPEN64 AND HAVE_FSEEKO64 AND HAVE_FTELLO64)
set(LFS_OFF_T "off64_t")
set(LFS_FOPEN "fopen64")
set(LFS_FSEEK "fseeko64")
set(LFS_FTELL "ftello64")
check_symbol_exists("PRIdMAX" "inttypes.h" HAVE_PRIDMAX)
if(HAVE_PRIDMAX)
set(LFS_PRID "PRIdMAX")
else(HAVE_PRIDMAX)
check_type_size("long" SIZEOF_LONG)
check_type_size("int" SIZEOF_INT)
if(SIZEOF_OFF64_T GREATER SIZEOF_LONG)
set(LFS_PRID "\"lld\"")
elseif(SIZEOF_LONG GREATER SIZEOF_INT)
set(LFS_PRID "\"ld\"")
else(SIZEOF_OFF64_T GREATER SIZEOF_LONG)
set(LFS_PRID "\"d\"")
endif(SIZEOF_OFF64_T GREATER SIZEOF_LONG)
endif(HAVE_PRIDMAX)
endif(HAVE_FOPEN64 AND HAVE_FSEEKO64 AND HAVE_FTELLO64)
endif(SIZEOF_OFF64_T GREATER 7)
endif(NOT LFS_OFF_T)
# LFS type3: 8 <= sizeof(__int64), _fseeki64, _ftelli64
if(NOT LFS_OFF_T)
check_type_size("__int64" SIZEOF___INT64)
if(SIZEOF___INT64 GREATER 7)
check_symbol_exists("_fseeki64" "stdio.h" HAVE__FSEEKI64)
check_symbol_exists("_ftelli64" "stdio.h" HAVE__FTELLI64)
if(HAVE__FSEEKI64 AND HAVE__FTELLI64)
set(LFS_OFF_T "__int64")
set(LFS_FOPEN "fopen")
set(LFS_FSEEK "_fseeki64")
set(LFS_FTELL "_ftelli64")
set(LFS_PRID "\"I64d\"")
endif(HAVE__FSEEKI64 AND HAVE__FTELLI64)
endif(SIZEOF___INT64 GREATER 7)
endif(NOT LFS_OFF_T)
set(CMAKE_REQUIRED_DEFINITIONS "${SAFE_CMAKE_REQUIRED_DEFINITIONS}")
endif(${_isenable})
if(NOT LFS_OFF_T)
## not found
set(LFS_OFF_T "long")
set(LFS_FOPEN "fopen")
set(LFS_FSEEK "fseek")
set(LFS_FTELL "ftell")
set(LFS_PRID "\"ld\"")
endif(NOT LFS_OFF_T)
endmacro(check_lfs)

View File

@ -0,0 +1,38 @@
# If the cmake version includes cpack, use it
IF(EXISTS "${CMAKE_ROOT}/Modules/CPack.cmake")
SET(CPACK_PACKAGE_DESCRIPTION_SUMMARY "${PROJECT_DESCRIPTION}")
SET(CPACK_PACKAGE_VENDOR "${PROJECT_VENDOR}")
SET(CPACK_PACKAGE_DESCRIPTION_FILE "${CMAKE_CURRENT_SOURCE_DIR}/README.md")
SET(CPACK_RESOURCE_FILE_LICENSE "${CMAKE_CURRENT_SOURCE_DIR}/LICENSE")
SET(CPACK_PACKAGE_VERSION_MAJOR "${PROJECT_VERSION_MAJOR}")
SET(CPACK_PACKAGE_VERSION_MINOR "${PROJECT_VERSION_MINOR}")
SET(CPACK_PACKAGE_VERSION_PATCH "${PROJECT_VERSION_PATCH}")
# SET(CPACK_PACKAGE_INSTALL_DIRECTORY "${PROJECT_NAME} ${PROJECT_VERSION}")
SET(CPACK_SOURCE_PACKAGE_FILE_NAME "${PROJECT_NAME}-${PROJECT_VERSION_FULL}")
IF(NOT DEFINED CPACK_SYSTEM_NAME)
SET(CPACK_SYSTEM_NAME "${CMAKE_SYSTEM_NAME}-${CMAKE_SYSTEM_PROCESSOR}")
ENDIF(NOT DEFINED CPACK_SYSTEM_NAME)
IF(${CPACK_SYSTEM_NAME} MATCHES Windows)
IF(CMAKE_CL_64)
SET(CPACK_SYSTEM_NAME win64-${CMAKE_SYSTEM_PROCESSOR})
ELSE(CMAKE_CL_64)
SET(CPACK_SYSTEM_NAME win32-${CMAKE_SYSTEM_PROCESSOR})
ENDIF(CMAKE_CL_64)
ENDIF(${CPACK_SYSTEM_NAME} MATCHES Windows)
IF(NOT DEFINED CPACK_PACKAGE_FILE_NAME)
SET(CPACK_PACKAGE_FILE_NAME "${CPACK_SOURCE_PACKAGE_FILE_NAME}-${CPACK_SYSTEM_NAME}")
ENDIF(NOT DEFINED CPACK_PACKAGE_FILE_NAME)
SET(CPACK_PACKAGE_CONTACT "${PROJECT_CONTACT}")
IF(UNIX)
SET(CPACK_STRIP_FILES "")
SET(CPACK_SOURCE_STRIP_FILES "")
# SET(CPACK_PACKAGE_EXECUTABLES "ccmake" "CMake")
ENDIF(UNIX)
SET(CPACK_SOURCE_IGNORE_FILES "/CVS/" "/build/" "/\\\\.build/" "/\\\\.svn/" "~$")
# include CPack model once all variables are set
INCLUDE(CPack)
ENDIF(EXISTS "${CMAKE_ROOT}/Modules/CPack.cmake")

View File

@ -0,0 +1,36 @@
IF(NOT EXISTS "@CMAKE_CURRENT_BINARY_DIR@/install_manifest.txt")
MESSAGE(FATAL_ERROR "Cannot find install manifest: \"@CMAKE_CURRENT_BINARY_DIR@/install_manifest.txt\"")
ENDIF(NOT EXISTS "@CMAKE_CURRENT_BINARY_DIR@/install_manifest.txt")
FILE(READ "@CMAKE_CURRENT_BINARY_DIR@/install_manifest.txt" files)
STRING(REGEX REPLACE "\n" ";" files "${files}")
SET(NUM 0)
FOREACH(file ${files})
IF(EXISTS "$ENV{DESTDIR}${file}")
MESSAGE(STATUS "Looking for \"$ENV{DESTDIR}${file}\" - found")
SET(UNINSTALL_CHECK_${NUM} 1)
ELSE(EXISTS "$ENV{DESTDIR}${file}")
MESSAGE(STATUS "Looking for \"$ENV{DESTDIR}${file}\" - not found")
SET(UNINSTALL_CHECK_${NUM} 0)
ENDIF(EXISTS "$ENV{DESTDIR}${file}")
MATH(EXPR NUM "1 + ${NUM}")
ENDFOREACH(file)
SET(NUM 0)
FOREACH(file ${files})
IF(${UNINSTALL_CHECK_${NUM}})
MESSAGE(STATUS "Uninstalling \"$ENV{DESTDIR}${file}\"")
EXEC_PROGRAM(
"@CMAKE_COMMAND@" ARGS "-E remove \"$ENV{DESTDIR}${file}\""
OUTPUT_VARIABLE rm_out
RETURN_VALUE rm_retval
)
IF(NOT "${rm_retval}" STREQUAL 0)
MESSAGE(FATAL_ERROR "Problem when removing \"$ENV{DESTDIR}${file}\"")
ENDIF(NOT "${rm_retval}" STREQUAL 0)
ENDIF(${UNINSTALL_CHECK_${NUM}})
MATH(EXPR NUM "1 + ${NUM}")
ENDFOREACH(file)
FILE(REMOVE "@CMAKE_CURRENT_BINARY_DIR@/install_manifest.txt")

21
src/libdivsufsort/LICENSE Normal file
View File

@ -0,0 +1,21 @@
The MIT License (MIT)
Copyright (c) 2003 Yuta Mori All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

140
src/libdivsufsort/README.md Normal file
View File

@ -0,0 +1,140 @@
# libdivsufsort
libdivsufsort is a software library that implements a lightweight suffix array construction algorithm.
## News
* 2015-03-21: The project has moved from [Google Code](http://code.google.com/p/libdivsufsort/) to [GitHub](https://github.com/y-256/libdivsufsort)
## Introduction
This library provides a simple and an efficient C API to construct a suffix array and a Burrows-Wheeler transformed string from a given string over a constant-size alphabet.
The algorithm runs in O(n log n) worst-case time using only 5n+O(1) bytes of memory space, where n is the length of
the string.
## Build requirements
* An ANSI C Compiler (e.g. GNU GCC)
* [CMake](http://www.cmake.org/ "CMake") version 2.4.2 or newer
* CMake-supported build tool
## Building on GNU/Linux
1. Get the source code from GitHub. You can either
* use git to clone the repository
```
git clone https://github.com/y-256/libdivsufsort.git
```
* or download a [zip file](../../archive/master.zip) directly
2. Create a `build` directory in the package source directory.
```shell
$ cd libdivsufsort
$ mkdir build
$ cd build
```
3. Configure the package for your system.
If you want to install to a different location, change the -DCMAKE_INSTALL_PREFIX option.
```shell
$ cmake -DCMAKE_BUILD_TYPE="Release" \
-DCMAKE_INSTALL_PREFIX="/usr/local" ..
```
4. Compile the package.
```shell
$ make
```
5. (Optional) Install the library and header files.
```shell
$ sudo make install
```
## API
```c
/* Data types */
typedef int32_t saint_t;
typedef int32_t saidx_t;
typedef uint8_t sauchar_t;
/*
* Constructs the suffix array of a given string.
* @param T[0..n-1] The input string.
* @param SA[0..n-1] The output array or suffixes.
* @param n The length of the given string.
* @return 0 if no error occurred, -1 or -2 otherwise.
*/
saint_t
divsufsort(const sauchar_t *T, saidx_t *SA, saidx_t n);
/*
* Constructs the burrows-wheeler transformed string of a given string.
* @param T[0..n-1] The input string.
* @param U[0..n-1] The output string. (can be T)
* @param A[0..n-1] The temporary array. (can be NULL)
* @param n The length of the given string.
* @return The primary index if no error occurred, -1 or -2 otherwise.
*/
saidx_t
divbwt(const sauchar_t *T, sauchar_t *U, saidx_t *A, saidx_t n);
```
## Example Usage
```c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <divsufsort.h>
int main() {
// intput data
char *Text = "abracadabra";
int n = strlen(Text);
int i, j;
// allocate
int *SA = (int *)malloc(n * sizeof(int));
// sort
divsufsort((unsigned char *)Text, SA, n);
// output
for(i = 0; i < n; ++i) {
printf("SA[%2d] = %2d: ", i, SA[i]);
for(j = SA[i]; j < n; ++j) {
printf("%c", Text[j]);
}
printf("$\n");
}
// deallocate
free(SA);
return 0;
}
```
See the [examples](examples) directory for a few other examples.
## Benchmarks
See [Benchmarks](https://github.com/y-256/libdivsufsort/blob/wiki/SACA_Benchmarks.md) page for details.
## License
libdivsufsort is released under the [MIT license](LICENSE "MIT license").
> The MIT License (MIT)
>
> Copyright (c) 2003 Yuta Mori All rights reserved.
>
> Permission is hereby granted, free of charge, to any person obtaining a copy
> of this software and associated documentation files (the "Software"), to deal
> in the Software without restriction, including without limitation the rights
> to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> copies of the Software, and to permit persons to whom the Software is
> furnished to do so, subject to the following conditions:
>
> The above copyright notice and this permission notice shall be included in all
> copies or substantial portions of the Software.
>
> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> SOFTWARE.
## Author
* Yuta Mori

View File

@ -0,0 +1,23 @@
set(PROJECT_VERSION_MAJOR "2")
set(PROJECT_VERSION_MINOR "0")
set(PROJECT_VERSION_PATCH "2")
set(PROJECT_VERSION_EXTRA "-1")
set(PROJECT_VERSION "${PROJECT_VERSION_MAJOR}.${PROJECT_VERSION_MINOR}")
set(PROJECT_VERSION_FULL "${PROJECT_VERSION_MAJOR}.${PROJECT_VERSION_MINOR}.${PROJECT_VERSION_PATCH}${PROJECT_VERSION_EXTRA}")
set(LIBRARY_VERSION "3.0.1")
set(LIBRARY_SOVERSION "3")
## Git revision number ##
if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/.git")
execute_process(COMMAND git describe --tags HEAD
WORKING_DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}"
OUTPUT_VARIABLE GIT_DESCRIBE_TAGS ERROR_QUIET)
if(GIT_DESCRIBE_TAGS)
string(REGEX REPLACE "^v(.*)" "\\1" GIT_REVISION "${GIT_DESCRIBE_TAGS}")
string(STRIP "${GIT_REVISION}" GIT_REVISION)
if(GIT_REVISION)
set(PROJECT_VERSION_FULL "${GIT_REVISION}")
endif(GIT_REVISION)
endif(GIT_DESCRIBE_TAGS)
endif(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/.git")

View File

@ -0,0 +1,11 @@
## Add definitions ##
add_definitions(-D_LARGEFILE_SOURCE -D_LARGE_FILES -D_FILE_OFFSET_BITS=64)
## Targets ##
include_directories("${CMAKE_CURRENT_SOURCE_DIR}/../include"
"${CMAKE_CURRENT_BINARY_DIR}/../include")
link_directories("${CMAKE_CURRENT_BINARY_DIR}/../lib")
foreach(src suftest mksary sasearch bwt unbwt)
add_executable(${src} ${src}.c)
target_link_libraries(${src} divsufsort)
endforeach(src)

View File

@ -0,0 +1,220 @@
/*
* bwt.c for libdivsufsort
* Copyright (c) 2003-2008 Yuta Mori All Rights Reserved.
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
* OTHER DEALINGS IN THE SOFTWARE.
*/
#if HAVE_CONFIG_H
# include "config.h"
#endif
#include <stdio.h>
#if HAVE_STRING_H
# include <string.h>
#endif
#if HAVE_STDLIB_H
# include <stdlib.h>
#endif
#if HAVE_MEMORY_H
# include <memory.h>
#endif
#if HAVE_STDDEF_H
# include <stddef.h>
#endif
#if HAVE_STRINGS_H
# include <strings.h>
#endif
#if HAVE_SYS_TYPES_H
# include <sys/types.h>
#endif
#if HAVE_IO_H && HAVE_FCNTL_H
# include <io.h>
# include <fcntl.h>
#endif
#include <time.h>
#include <divsufsort.h>
#include "lfs.h"
static
size_t
write_int(FILE *fp, saidx_t n) {
unsigned char c[4];
c[0] = (unsigned char)((n >> 0) & 0xff), c[1] = (unsigned char)((n >> 8) & 0xff),
c[2] = (unsigned char)((n >> 16) & 0xff), c[3] = (unsigned char)((n >> 24) & 0xff);
return fwrite(c, sizeof(unsigned char), 4, fp);
}
static
void
print_help(const char *progname, int status) {
fprintf(stderr,
"bwt, a burrows-wheeler transform program, version %s.\n",
divsufsort_version());
fprintf(stderr, "usage: %s [-b num] INFILE OUTFILE\n", progname);
fprintf(stderr, " -b num set block size to num MiB [1..512] (default: 32)\n\n");
exit(status);
}
int
main(int argc, const char *argv[]) {
FILE *fp, *ofp;
const char *fname, *ofname;
sauchar_t *T;
saidx_t *SA;
LFS_OFF_T n;
size_t m;
saidx_t pidx;
clock_t start,finish;
saint_t i, blocksize = 32, needclose = 3;
/* Check arguments. */
if((argc == 1) ||
(strcmp(argv[1], "-h") == 0) ||
(strcmp(argv[1], "--help") == 0)) { print_help(argv[0], EXIT_SUCCESS); }
if((argc != 3) && (argc != 5)) { print_help(argv[0], EXIT_FAILURE); }
i = 1;
if(argc == 5) {
if(strcmp(argv[i], "-b") != 0) { print_help(argv[0], EXIT_FAILURE); }
blocksize = atoi(argv[i + 1]);
if(blocksize < 0) { blocksize = 1; }
else if(512 < blocksize) { blocksize = 512; }
i += 2;
}
blocksize <<= 20;
/* Open a file for reading. */
if(strcmp(argv[i], "-") != 0) {
#if HAVE_FOPEN_S
if(fopen_s(&fp, fname = argv[i], "rb") != 0) {
#else
if((fp = LFS_FOPEN(fname = argv[i], "rb")) == NULL) {
#endif
fprintf(stderr, "%s: Cannot open file `%s': ", argv[0], fname);
perror(NULL);
exit(EXIT_FAILURE);
}
} else {
#if HAVE__SETMODE && HAVE__FILENO
if(_setmode(_fileno(stdin), _O_BINARY) == -1) {
fprintf(stderr, "%s: Cannot set mode: ", argv[0]);
perror(NULL);
exit(EXIT_FAILURE);
}
#endif
fp = stdin;
fname = "stdin";
needclose ^= 1;
}
i += 1;
/* Open a file for writing. */
if(strcmp(argv[i], "-") != 0) {
#if HAVE_FOPEN_S
if(fopen_s(&ofp, ofname = argv[i], "wb") != 0) {
#else
if((ofp = LFS_FOPEN(ofname = argv[i], "wb")) == NULL) {
#endif
fprintf(stderr, "%s: Cannot open file `%s': ", argv[0], ofname);
perror(NULL);
exit(EXIT_FAILURE);
}
} else {
#if HAVE__SETMODE && HAVE__FILENO
if(_setmode(_fileno(stdout), _O_BINARY) == -1) {
fprintf(stderr, "%s: Cannot set mode: ", argv[0]);
perror(NULL);
exit(EXIT_FAILURE);
}
#endif
ofp = stdout;
ofname = "stdout";
needclose ^= 2;
}
/* Get the file size. */
if(LFS_FSEEK(fp, 0, SEEK_END) == 0) {
n = LFS_FTELL(fp);
rewind(fp);
if(n < 0) {
fprintf(stderr, "%s: Cannot ftell `%s': ", argv[0], fname);
perror(NULL);
exit(EXIT_FAILURE);
}
if(0x20000000L < n) { n = 0x20000000L; }
if((blocksize == 0) || (n < blocksize)) { blocksize = (saidx_t)n; }
} else if(blocksize == 0) { blocksize = 32 << 20; }
/* Allocate 5blocksize bytes of memory. */
T = (sauchar_t *)malloc(blocksize * sizeof(sauchar_t));
SA = (saidx_t *)malloc(blocksize * sizeof(saidx_t));
if((T == NULL) || (SA == NULL)) {
fprintf(stderr, "%s: Cannot allocate memory.\n", argv[0]);
exit(EXIT_FAILURE);
}
/* Write the blocksize. */
if(write_int(ofp, blocksize) != 4) {
fprintf(stderr, "%s: Cannot write to `%s': ", argv[0], ofname);
perror(NULL);
exit(EXIT_FAILURE);
}
fprintf(stderr, " BWT (blocksize %" PRIdSAINT_T ") ... ", blocksize);
start = clock();
for(n = 0; 0 < (m = fread(T, sizeof(sauchar_t), blocksize, fp)); n += m) {
/* Burrows-Wheeler Transform. */
pidx = divbwt(T, T, SA, m);
if(pidx < 0) {
fprintf(stderr, "%s (bw_transform): %s.\n",
argv[0],
(pidx == -1) ? "Invalid arguments" : "Cannot allocate memory");