The project goal is to build a data decompressor using a simple, lossless, (de)compression algorithm. A lossless compression algorithm takes some data and produces an alternative representation using fewer bits than the original representation. It must be possible to recover the full original representation of the data from the compressed version. We will be using a simplified version of the popular LZ77 algorithm.
Your task is to build a decompression program. This program will take the name of a compressed file as an argument and produce its decompressed version. It must read the compressed file, decompress it and generate an output file as a result. The contents of the output file must be equal to the contents of the original file, previous to compression.
Here follows the compressed data format and the procedure to decompress them.
The input (compressed) file contains bytes. Those bytes represent a sequence of symbols, which must be processed by the decompressor in order to recover the original (uncompressed) data.
Compressed data is a sequence of symbols coded as bytes. There are two types of symbols:
Symbols must be decoded in order and one by one, to recover the original data.
As an example, let us say we read the following sequence of
symbols from a compressed file: literal(3)
,
literal(4)
, literal(5)
,
literal(6)
, literal(7)
,
literal(8)
, repetition(offset=4,
length=3)
. Each literal symbol will be decoded as itself,
until literal(8)
, the contents of the uncompressed
file should be byte 3, 4, 5, 6, 7 and 8. Byte 8 is the more
recent and byte 3 is the oldest. Then the repetition symbol
repetition(offset=4, length=3)
comes, representing
bytes 5, 6 and 7 (step back 4 bytes, and repeat the following 3
bytes from there). The final sequence of uncompressed bytes will
be: 3, 4, 5, 6, 7, 8, 5, 6 and 7.
As files in a disk are coded as a sequence of bytes, the symbols used to compress bytes have to be encoded as bytes themselves when stored in a file. Therefore, you will need to read the symbols as a sequence of bytes from a file. Each symbol is encoded as follows:
literal(45)
is
encoded as byte 45.
offset = 1 +
D0 + (256*D1)
. As an example, D0=7 and D1=3 means an
offset of 1 + 7 + (256*3) = 776.
As an example, let us say we have the following bytes (compressed data): 1, 2, 3, 4, 5, 255, 255, 6, 7, 254, 253, 255, 0, 5, 0, 255, 5, 9, 0. The corresponding symbols would be:
literal(1)
.
literal(2)
.
literal(3)
.
literal(4)
.
literal(5)
.
literal(255)
.
literal(6)
.
literal(7)
.
literal(254)
.
literal(253)
.
repetition(offset=6, length=5)
.
repetition(offset=10, length=10)
.
Decompressing those symbols would produce the following uncompressed data: 1, 2, 3, 4, 5, 255, 6, 7, 254, 253, 5, 255, 6, 7, 254, 255, 6, 7, 254, 253, 5, 255, 6, 7 and 254.
The maximum offset we can use is 65536 (1 + 255 + (256 * 255)). Therefore, it makes no sense to look for byte sequence repetitions longer than 65536 bytes back. Thus, the decompressor must only have a buffer of enough size as to store the last 65536 decoded bytes. Initially, before decompressing the first byte, the decompressor must assume that the last 65536 bytes have a value of 0.
Write a program that decompresses an existing file and stores the
result in a newly created file. There must be a class called
Decompressor
with a main
method. This
method will receive two arguments from the command line: the name
of the (compressed) input file and the name of the (decompressed)
output file. If the output file already exists, the program will
overwrite it. This is how a typical invocation would look like:
java Decompressor compressed.dat uncompressed.dat
If there is some input/output error while reading the input file
or writing the output file, the program will exit with an
IOException
. If the input file cannot be
decompressed due to bad format or corrupted data (for instance, a byte 255 that starts a multibyte symbol is read but there are no more available bytes in the file), the program will exit using an ad-hoc exception.
A good design of the program will help you to get higher grades. On one hand, the teachers will be glad to read simple, clean and well organized code with a nice object oriented touch. On the other hand, a high quality code will help you in the project exam, where you will be asked to modify your own code for extended functionality in a limited time. Here we propose some tips for submitting quality code. Feel free to ignore them if you think of better designs (the teachers will be glad to grade you higher if your design is good).
Generally speaking, we advise you to distribute functionality among classes, like this:
Decompressed data can be stored in its own class, using an array
of 65536 bytes (byte buffer[65536]
). This class can
have methods for adding new data and reading old data from an
offset and a length.
A circular buffer can be useful here, as the one you saw for the implementation of queues with an array. When adding new data fills the array you should continue adding in the positions at the beginning, overwritting them, as no more than 65536 bytes are needed for decompressing the current symbol.
To help you check that your buffer is working fine, we suggest you to test smaller buffers first (8 or 16 bytes length), as it will be much easier to fill them with accountable data in your tests.
You can use Java's byte
data type for your
implementation, but there are some important details you must be
aware of.
First, a byte
in Java is an 8 bits numbers with
sign, meaning it stores numbers between -128 and 127
(inclusively), using two's
complement representation. While reading bytes from a
file, bytes between 128 and 255 will be stored in Java as
negative values. As an example, a 255 byte in the file will be
stored in a Java byte as an -1, a 254 will be stored as -2, etc.,
until 128, which is stored as -128. Bytes between 0 and 127 will
be stored in Java bytes as themselves. This means that checking
for a 255 on a file will require you to compare with -1 in Java.
This have implications also in the offset calculation (1 +
D0 + (256 * D1)
). A way to solve this is using Java's
bitwise operators, like this:
byte d0 = ... byte d1 = ... int offset = 1 + (d0 & 0xFF) + ((d1 & 0xFF) << 8);
Another important detail is when calling methods that has
byte
as arguments; if you want to pass literal
values you can do something like this:
public void method(byte value) { (...) } (...) method((byte) 5);
Here the explicit casting to byte
is required, as
the compiler will think that 5
is an
int
instead of a byte
.
In the following table you will find some examples of compressed files and their uncompressed counterparts. You can use these files to test your decompressor program. Again we suggest you to begin with the smaller files. You can also create your own test files using an hex editor.
Description | Compressed | Uncompressed | ||
---|---|---|---|---|
File | Size | File | Size | |
Only simple literals | literals-1.in | 6 | literals-1.out | 6 |
Only literals including 255 at the beginning | literals-2.in | 8 | literals-2.out | 7 |
Only literals including 255 at the end | literals-3.in | 8 | literals-3.out | 7 |
Only literals including 255 in the middle | literals-4.in | 8 | literals-4.out | 7 |
Only literals including adjacent 255 | literals-5.in | 10 | literals-5.out | 8 |
Only literals 1 | literals-6.in | 18 | literals-6.out | 12 |
Only literals 2 | literals-7.in | 4 | literals-7.out | 4 |
Project example | example.in | 19 | example.out | 25 |
Single repetition 1 | repetition-1.in | 10 | repetition-1.out | 12 |
Double repetition | repetition-2.in | 14 | repetition-2.out | 18 |
Prefetch repetition | repetition-3.in | 6 | repetition-3.out | 7 |
Triple repetition | repetition-4.in | 18 | repetition-4.out | 24 |
Single repetition 2 | repetition-5.in | 16 | repetition-5.out | 18 |
Buffer overflow 1 | overflow-1.in | 65538 | overflow-1.out | 65538 |
Buffer overflow 2 | overflow-2.in | 196609 | overflow-2.out | 196609 |
Buffer overflow + repetition 1 | overflow-3.in | 65542 | overflow-3.out | 65543 |
Buffer overflow + repetition 2 | overflow-4.in | 65542 | overflow-4.out | 65543 |
Incorrect literal | incorrect-1.in | 3 | (error expected) | - |
Incorrect repetition 1 | incorrect-2.in | 4 | (error expected) | - |
Incorrect repetition 2 | incorrect-3.in | 5 | (error expected) | - |
Simple example with all symbols | simple.txt.psz | 53 | simple.txt | 58 |
Don Quijote de La Mancha | quijote.txt.psz | 2702 | quijote.txt | 3128 |
Sample photo (BMP format) | photo.bmp.psz | 2023506 | photo.bmp | 2359418 |
The majority of the files above are binary. It is better to examine them with an hex editor. For your convenience, you can download a ZIP file with all the tests above through this link.
You must work in this project in groups of two, except if your teacher told you otherwise. You must submit your group composition to your teacher in time. All group members must belong to the same small group (lab group).
One (and only one) of the members of the group must submit the project, using the submission link that will be available in your AulaGlobal small group web interface. The submission date will be announced as an AulaGlobal notice. Submissions after the due date wil be ignored.
You must submit a single ZIP file. It will only contain your
project source code (.java
files) and a plain text
readme.txt
file. You must not submit compiled classes
(.class
files) or any other kind of file.
The readme.txt
file will include the group members'
NIA, surnames and names. Optionally, it can also include a brief
description (4 paragraphs or less) of your
program design. Any relevant fact that your teacher must know
about your project must be included here. For instance, if any of
the members of your group has not worked at all it the project, or
if you are aware of certain bugs in your program that you were not
able to fix in time (grades can be improved by showing a good
understanding of your bugs and how do you think they can be fixed,
actually fixing them will be graded higher, of course).
The name of the ZIP file must be
group-XX-YYYYYYYYY-ZZZZZZZZZ.zip
, with
XX
being your small group (lab group) number,
YYYYYYYYY
the smallest NIA of your project group
members and ZZZZZZZZZ
the NIA of the other
member. For example:
group-95-188888888-199999999.zip
.
Your project source code must be your own original work, not someone else's. If the teachers detect any kind of plagiarism in your code (from other students outside your project group, third-party academy teachers...) the plagiarism protocol will be called (ask your teacher for an English version, if you need it) and you will be graded with a 0 (failure). The teachers will be using plagiarism checking tools on your submitted code, that will detect common patterns in code between all projects of all groups and courses. These tools will detect plagiarism even if the code is later modified by changing comments, variable and methods names, whitespace changes, switching loop types or moving pieces of code around.