Hello Guillem,
Post by Guillem JoverTBH this smells like FUD. For example I've never heard of corruption in
.xz files due to non-robustness, I'd expect that corruption to come from
external forces, and that integrity would help or not detect it.
Sure it comes from external forces, but xz does something that no other
compressor does: even if the corruption does not affect the data and xz
is able to produce perfectly correct output, it will report "Compressed
data is corrupt" and will exit with non-zero status anyway. Just take
any xz file and append a null character to it. Bzip2, gzip and lzip
simply ignore the extra byte.
But not only that. Xz is the only format (of the four mentioned) whose
parts need to be aligned to a multiple of four bytes. The size of a xz
file must also be a multiple of four bytes. To achieve this, xz includes
padding everywhere; after headers, blocks, the index, and the whole
stream. The bad news is that if the (useless) padding is altered in any
way, "the decoder MUST indicate an error" according to the xz format
specification.
This is specially bad when xz is used with tar, making the whole command
to fail and the whole archive to be discarded as corrupt.
And this fragility is one of the perverse effects of the unbelievably
stupid design of xz; "It is possible that there is a new field present
which the decoder is not aware of, and can thus parse the Block Header
incorrectly[1]".
[1] http://tukaani.org/xz/xz-file-format.txt (see 3.1.6. Header Padding)
So yes, the xz format is objectively more fragile than the other three.
Post by Guillem JoverIn any case .xz supports CRC32, CRC64 and SHA-256 for integrity
checks, .lz only supports CRC32.
To begin with, the affirmation that lzip "only supports CRC32" is false.
Lzip provides a 4 factor integrity checking; CRC32, uncompressed size,
compressed size, and the value remaining in the range decoder after the
decoding of the end-of-stream marker.
Do you know of any case where bzip2, gzip or lzip silently produced
invalid output because of any weakness in their integrity checking?
Have you considered that maybe lzip provides optimal integrity checking,
while xz is just throwing buzzwords to the naive just like it did with
LZMA2[2]? Bigger not always means better.
[2] http://www.nongnu.org/lzip/lzip_benchmark.html (see Lzip vs xz)
Lzip is very good at detecting errors. You may have noticed that in case
of corruption, instead of the unhelpful "Compressed data is corrupt"
reported by xz, lzip says something like "Decoder error at pos 1234".
This leaves very little work for the CRC32 in the detection of errors.
Also, lzip reports mismatches in the four factors separately. This way
if one factor fails but the other three are ok, most probably the
corruption affects the file trailer and you can consider the data to be
intact. Lzip corruption detection is robust.
Lets make some tests. I have taken a small file (the COPYING file
distributed with the lzip source), compressed it, and then tried the
effect of all possible bit-flips on the decompression. (I used the
'unzcrash' tool distributed with lziprecover).
-rw-r--r-- 1 18025 Jun 16 2014 COPYING
-rw-r--r-- 1 6150 Jun 16 2014 COPYING.bz2
-rw-r--r-- 1 6839 Jun 16 2014 COPYING.gz
-rw-r--r-- 1 6507 Jun 16 2014 COPYING.lz
-rw-r--r-- 1 6544 Jun 16 2014 COPYING.xz
Bzip2 seems to depend entirely on the CRCs for the detection of errors,
but it provides 2 levels of CRCs; one for each block and one for the
whole stream. Of the 49200 bit-flips in COPYING.bz2, 29 were rejected
(bad magic), 49163 were caught by the CRC, and eight produced correct
output with status 0.
Gzip has 2 factor integrity checking and some ability to detect format
violations. Of the 54712 bit-flips in COPYING.gz, 16 were rejected (bad
magic), 5 failed because of bad flags, 53952 were caught by the CRC,
26207 were caught by the uncompresed length, 540 were caught as format
violations, 44 failed because of unexpected EOF, 8 failed because of
unknown compression method, and 116 produced correct output with status 0.
Lzip has 4 factor integrity checking and an excellent ability to detect
data errors. Of the 52056 bit-flips in COPYING.lz, 32 were rejected (bad
magic), 51171 were caught by the decoder, 19 were caught by the value
remaining in the range decoder, 652 failed because of unexpected EOF, 7
failed because of unsupported format version, 3 failed because of
invalid dictionary size, 32 reported bad CRC, 64 reported bad
uncompressed size, 64 reported bad compressed size, and 12 produced
correct output with status 0.
Note that the bad CRCs and sizes reported by lzip correspond to
bit-flips in the CRCs and sizes themselves. In this particular file all
the bit-flips in the compressed data were caught by the decoder. The
integrity information was not even needed.
Xz error messages appear particulary unhelpful about how the corruption
was detected. Of the 52352 bit-flips in COPYING.xz, 48 were rejected
(bad magic), 52299 reported "Compressed data is corrupt", 5 failed
because of unexpected EOF, and not even one produced correct output with
status 0.
It is not at all clear that xz could detect errors better than lzip, but
xz has a point of failure not present in the other formats. If the
corruption affects the stream flags, xz won't be able to know the size
of the checksum and won't be able to decode the stream.
Post by Guillem JoverMore over lzip was created to overcome limitations in the .lzma format,
.xz came later and fixed the limitations of the .lzma format too.
Lzip certainly overcame the limitations in the .lzma format, but in this
respect xz seems to just have changed the false negatives of lzma into
false positives. From not reporting the corruption to reporting it "just
in case" even if the decoding went well.
Post by Guillem Jover(And I could probably switch dpkg-deb's .xz integrity check to CRC64,
given that's the xz-utils command-line tool default.)
Have you verified if this really helps or makes things worse? Doubling
the size of the checksum also doubles the probability of false positives
produced by the corruption of the checksum.
Post by Guillem Joverreplacing xz with lzip on .deb or .dsc packages does not make any sense.
Why? Are .deb or .dsc packages using the filters of xz or something?
Post by Guillem JoverWhenever considering to add a new compressor, all surrounding tools need
The future is long. You can save a lot of work in the long term by
adding lzip and deprecating the rest.
Post by Guillem JoverCompressor formats are subject to network-effects like many other
file formats. In this case I think .xz "won" both because it was the
"official" successor from .lzma, and because it is superior to .lz.
You can repeat that .xz is superior to .lz as much as you want, but this
won't make it true. The xz format is so bad that it manages to be as bad
for long-term archiving as lzma-alone. Meanwhile lziprecover achieves
the unprecedented feat of repairing single-byte errors without the help
of any extra redundance. Could you tell us in what aspect is .xz
superior to .lz?
I am not here to "win", but to help people keep their data safe.
Remember that I am also the author of GNU ddrescue (whose data recovery
capabilities nicely complement those of lziprecover). Given that this is
Debian, I have the hope that you may think of the public interest and
eventually replace xz with lzip.
It would be specially adequate that Debian is the first distro
deprecating such bad format given that xz is also not copylefted. Just a
few days ago we were discussing in GNU about how easily non-copylefted
software can be rendered non-free by proprietery licenses like the
Android SDK anti-fork provision.
Best regards,
Antonio.
--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gnu.org