Adding support for LZIP to dpkg, using that instead of xz, archive wide

Post by Thomas Goirand
Is there any reason why we wouldn't do that?

It was already rejected by the dpkg maintainers twice.

https://bugs.debian.org/600094
https://bugs.debian.org/556960
--
bye,
pabs

https://wiki.debian.org/PaulWise

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/CAKTje6F51gA_R0oCCgxbS=V5nssBiEM5-***@mail.gmail.com

Thomas Goirand

2015-06-13 23:10:02 UTC

Post by Thomas Goirand
I've been using xz compression for a long time, but I see a big defect
which is today pushing me to turn it off for the .orig.tar file. The
issue is that depending on the version of xz-utils, it produces a
different output.
We use "git archive" within the PKG OpenStack team to generate this
tarball (which is more or less the same as pristine-tar, except we use
upstream tags rather than a pristine-tar branch). The fact that xz
produces a different result makes it not reproducible. As a
consequence, it is very hard for us to use this system across
distributions (ie: use that in both Debian and Ubuntu, or in Sid &
Jessie). We need consistency.
"This is a fundamental problem/defect with xz. This (and a lot of
other such defects, e.g. non-robustness of xz archives that easily
lead to file corruption etc) are the reason that there is lzip (and
which is why gnu.org has, on a technical basis, decided that lzip is
official gzip-successor for gnu software releases when they come in
tarballs).
So it'd be super nice to have LZIP support in dpkg, and use that
instead of xz, archive wide.
Your thoughts everyone? Is there any reason why we wouldn't do that?
Cheers,
Thomas Goirand (zigo)

It was already rejected by the dpkg maintainers twice.
https://bugs.debian.org/600094
https://bugs.debian.org/556960

Reading these bugs, am I right that the archive already supports lzip
for the orig.tar file? Because that's my issue: I don't really mind if
we use xz for the compression of the .deb files, but I need consistency
when generating the orig.tar.

Though, I had a try, and it doesn't look like dpkg-source -x supports
the .lz format unfortunately.

Now, regarding the fact that the maintainer closed the bugs, I see 2
issues the way he did it.

1/ First, he sites the fact that lzip isn't popular enough as the only
reason (did I miss another point of argumentation?). Well, it's
backed-up by the GNU project as the successor of gzip, and also, I
believe Debian is influential enough so that we may not have to care
about it. Also, a wise technical choice of this kind shouldn't be driven
by a popularity contest.

2/ Guillem wrote "that's at the maintainer's discretion" (ie: to close
the bug). Well, here, the whole of Debian is depending on this kind of
decision, so I don't agree that this decision is only at the discretion
of the maintainer.

Therefore, I'm tempted to raise this to the technical committee (putting
their list as Cc). Does anyone see a reason why I am mistaking here?

Cheers,

Thomas Goirand (zigo)

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@debian.org

Paul Wise

2015-06-14 03:20:01 UTC

Post by Thomas Goirand
Reading these bugs, am I right that the archive already supports lzip
for the orig.tar file?

AFAICT, there is no mention of .lz or lzip in the dak source code.
--
bye,
pabs

https://wiki.debian.org/PaulWise

Guillem Jover

2015-06-14 03:50:01 UTC

Hi,

Well if you want reproducible output, then use the same tool version.
That's the equivalent of expecting that using a different gcc version
will give you the same output.

As long as the bitstream is compatible with previous versions, I don't
see it as a problem, and I'd expect such changes to be beneficial,
because say, they might allow making the encoder faster, or compress
better, etc.

Post by Thomas Goirand
We use "git archive" within the PKG OpenStack team to generate this
tarball (which is more or less the same as pristine-tar, except we use
upstream tags rather than a pristine-tar branch). The fact that xz
produces a different result makes it not reproducible. As a
consequence, it is very hard for us to use this system across
distributions (ie: use that in both Debian and Ubuntu, or in Sid &
Jessie). We need consistency.

If you generate it once, as part of the release process, why do you
need to generate it on different systems with different versions? And
how does that have anything to do with what gets packaged in Debian.
For Debian you only need to generate it once, why would you want to
generate it anew every time you build a new Debian revision instead
of just reusing the same tarball that is on the archive, if you don't
keep source tarball releases around?

Post by Thomas Goirand
"This is a fundamental problem/defect with xz. This (and a lot of
other such defects, e.g. non-robustness of xz archives that easily
lead to file corruption etc) are the reason that there is lzip (and
which is why gnu.org has, on a technical basis, decided that lzip is
official gzip-successor for gnu software releases when they come in
tarballs).

TBH this smells like FUD. For example I've never heard of corruption in
.xz files due to non-robustness, I'd expect that corruption to come from
external forces, and that integrity would help or not detect it. In any
case .xz supports CRC32, CRC64 and SHA-256 for integrity checks, .lz only
supports CRC32. More over lzip was created to overcome limitations in the
.lzma format, .xz came later and fixed the limitations of the .lzma format
too.

(And I could probably switch dpkg-deb's .xz integrity check to CRC64,
given that's the xz-utils command-line tool default.)

Also many GNU projects do not release lzip tarballs, but do release bzip
or xz ones and there are very few that exclusively release lzip tarballs.
If that's the equivalent of bazaar being the official GNU VCS that most
of the GNU projects do not use, well…

Actually where is the gnu.org decision documented? I don't see it
neither in the GCS, the “Information for Maintainers of GNU Software”,
nor in the ftp.gnu.org site. And automake still defaults to dist-gz in
latest git.

<http://www.gnu.org/prep/standards/>
<http://www.gnu.org/prep/maintain/>

Post by Thomas Goirand
So it'd be super nice to have LZIP support in dpkg, and use that
instead of xz, archive wide.
Your thoughts everyone? Is there any reason why we wouldn't do that?

Yes, replacing xz with lzip on .deb or .dsc packages does not make any
sense. Adding lzip support for source packages *might* make some sense, as
I pointed out in the bug report. But doing so does have a very high cost:

<https://wiki.debian.org/Teams/Dpkg/FAQ#Q:_Can_we_add_support_for_new_compressors_for_.dsc_packages.3F>

Whenever considering to add a new compressor, all surrounding tools need
to be modified to support it as well:

<https://wiki.debian.org/Teams/Dpkg/DebSupport>
<https://wiki.debian.org/Teams/Dpkg/DscSupport>

That's a non-zero amount of work and time, and that does not take into
account external tools and users. It would also not be usable until the
next stable release. Also notice that for example there are still tools
that do not support data.tar.xz in .deb, which has been the default for
a while, which should give you an idea of what it takes.

Adding a new compressor, that does not bring any significant benefit in
compression ratio, speed or container format, that is either not widely
used or widely available in many systems, just for the benefit of very
few packages that might be releasing as well in other formats, or that
can be easily recompressed, still does not seem worth it, no.

I've yet to see an actual convincing argument why this would be worth
the effort and trouble.

Also not to mention that I was the first to also consider .lz when we
evaluated adding .xz support in dpkg back in 2009.

<https://lists.debian.org/debian-dpkg/2009/10/msg00029.html>

Post by Paul Wise
It was already rejected by the dpkg maintainers twice.
https://bugs.debian.org/600094
https://bugs.debian.org/556960

Nothing in the .deb/.dsc tooling supports lzip AFAIK. The archive does
not even support the .lzma format.

Post by Thomas Goirand
Now, regarding the fact that the maintainer closed the bugs, I see 2
issues the way he did it.

First, that was a bug report from *2009/2010*. I think I was clear in
my mail that I was open to reconsider if things changed in the future.

Post by Thomas Goirand
1/ First, he sites the fact that lzip isn't popular enough as the only
reason (did I miss another point of argumentation?). Well, it's
backed-up by the GNU project as the successor of gzip, and also, I
believe Debian is influential enough so that we may not have to care
about it. Also, a wise technical choice of this kind shouldn't be driven
by a popularity contest.

No, that's the summary that Antonio wrote. It's not the only reason
I gave in that mail, it's a significant one, given its implications
(see the FAQ entry above):

* There's already .xz support (as one of the lzma variants), .lzma is
now deprecated for .deb compression.
* I'd rather have consistency between source and binary compressors.
* For source packages high usage might be a more important reason to
_accept_ lzip (given that've got an equivalent or better lzma format
with .xz), than low usage for a _reject_ (if we didn't have .xz).

Compressor formats are subject to network-effects like many other
file formats. In this case I think .xz "won" both because it was the
"official" successor from .lzma, and because it is superior to .lz.

Depending on the context, availability and usage (or popularity if you
will), are quite important aspects when deciding when to support such
formats. In other cases, you really want to support more format, for
example on a GUI archiving program, or on something like automake.
Discounting this as a simple matter of "fashion" is not helpful.

Post by Thomas Goirand
2/ Guillem wrote "that's at the maintainer's discretion" (ie: to close
the bug). Well, here, the whole of Debian is depending on this kind of
decision, so I don't agree that this decision is only at the discretion
of the maintainer.

That was exclusively related to whether to keep a wishlist+wontfix report
open or closed. And of course the logical next step is instead to force
the issue through the ctte… while I've only seen lzip upstream and one
other person clamoring for lzip support, and no other dicussions in
debian-devel over this, since 2010.

Post by Thomas Goirand
Therefore, I'm tempted to raise this to the technical committee (putting
their list as Cc). Does anyone see a reason why I am mistaking here?

*Sigh* and yes…

Regards,
Guillem

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gaara.hadrons.org

Thomas Goirand

2015-06-14 12:10:02 UTC

Guillem,

First, thanks for your reply and taking the time to reply on every
point. This really is helpful.

While I believe all of your argumentation is correct, I am still not
convince about the reproducibility, which is my main issue here. Could
you please reply to that point, and that one only? I've removed from the
quote all what doesn't concern it, because it is my feeling that it is a
distraction in this thread.

Post by Guillem Jover
Hi,

Well if you want reproducible output, then use the same tool version.

That's not possible: Jessie, Sid and Trusty don't have the same version,
and we need to generate the orig.tar file in all of them. The
contributors for the Debian OpenStack packaging are mostly using Ubuntu,
and they need to keep a workflow with the orig.tar file in the Git
repository.

I did tell them to just get the file from the Debian archive, but it
doesn't work. One of the reason it doesn't is because sometimes, they
upload first to Ubuntu, and then I do in Debian, and we end up with
different orig.tar.xz files, meaning it's hard for them to sync back
with Debian. I would like this to not be an issue anymore.

Post by Guillem Jover
That's the equivalent of expecting that using a different gcc version
will give you the same output.

I fail to see what gcc and a lossless compressor have in common.

Post by Guillem Jover
As long as the bitstream is compatible with previous versions, I don't
see it as a problem

The problem, I just explained it: I can't use xz in a pristine-tar like
workflow, because it wouldn't reproduce the same output. And I'd like to
use something better than the 20 years old gzip.

If you generate it once, as part of the release process, why do you
need to generate it on different systems with different versions?

Because I'd like the Git repository to contain it, without the need to
pick-up the file from the Debian archive. And to be exact: that's mostly
a need from contributors, I could live with the issue, but they can't.
This is mostly a need expressed by Ubuntu/Canonical server team working
with me on OpenStack packaging on Alioth.

Post by Guillem Jover
And how does that have anything to do with what gets packaged in Debian.
For Debian you only need to generate it once, why would you want to
generate it anew every time you build a new Debian revision instead
of just reusing the same tarball that is on the archive, if you don't
keep source tarball releases around?

See above. It's a pristine-tar like workflow. Your question is
equivalent to: "why do people use pristine-tar?". The answer is: because
it's convenient to just use git, without having to look into the Debian
archive. And by the way, xz wouldn't be usable with pristine-tar for the
same reason.

Post by Thomas Goirand
So it'd be super nice to have LZIP support in dpkg, and use that
instead of xz, archive wide.
Your thoughts everyone? Is there any reason why we wouldn't do that?

Yes, replacing xz with lzip on .deb or .dsc packages does not make any
sense.

That isn't what I care about. I only care about the orig.tar file here.

Post by Guillem Jover
Adding lzip support for source packages *might* make some sense, as
<https://wiki.debian.org/Teams/Dpkg/FAQ#Q:_Can_we_add_support_for_new_compressors_for_.dsc_packages.3F>

I do understand the cost. But there's a valid reason. If you believe
there's something better than lz, with the same properties, and that we
had support for it, I'd happily adopt it. It is just that xz doesn't
work right now, and most likely will break again in future versions of
xz-utils.

Post by Guillem Jover
Whenever considering to add a new compressor, all surrounding tools need
<https://wiki.debian.org/Teams/Dpkg/DebSupport>
<https://wiki.debian.org/Teams/Dpkg/DscSupport>
That's a non-zero amount of work and time, and that does not take into
account external tools and users. It would also not be usable until the
next stable release. Also notice that for example there are still tools
that do not support data.tar.xz in .deb, which has been the default for
a while, which should give you an idea of what it takes.
Adding a new compressor, that does not bring any significant benefit in
compression ratio, speed or container format, that is either not widely
used or widely available in many systems, just for the benefit of very
few packages that might be releasing as well in other formats, or that
can be easily recompressed, still does not seem worth it, no.

Well, xz can't be used for pristine-tar, and gzip is old and doesn't
compress as well. This alone is IMO a good reason.

Thomas Goirand (zigo)

P.S: I'd prefer a consensus here than a CTTE bug.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@debian.org

Felipe Sateler

2015-06-14 14:10:03 UTC

And by the way, xz wouldn't be usable with pristine-tar for the same
reason.

Ehm. pristine-xz(1) would beg to disagree.

In the multimedia team, we use it for over 40 packages (where upstream
provides an xz file of course).

I guess you should have a script that does git archive ; pristine-tar
commit.

--
Saludos,
Felipe Sateler
--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/mlk1sh$v1b$***@ger.gmane.org

Thomas Goirand

2015-06-15 08:10:01 UTC

Post by Felipe Sateler

And by the way, xz wouldn't be usable with pristine-tar for the same
reason.

Ehm. pristine-xz(1) would beg to disagree.
In the multimedia team, we use it for over 40 packages (where upstream
provides an xz file of course).
I guess you should have a script that does git archive ; pristine-tar
commit.

Did you try using the same pristine-tar xz thing but with a different
version of xz-utils, for example the one in Trusty vs the one in Sid?

Cheers,

Thomas Goirand (zigo)

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@debian.org

Felipe Sateler

2015-06-15 12:50:02 UTC

Post by Felipe Sateler

And by the way, xz wouldn't be usable with pristine-tar for the same
reason.

Ehm. pristine-xz(1) would beg to disagree.
In the multimedia team, we use it for over 40 packages (where upstream
provides an xz file of course).
I guess you should have a script that does git archive ; pristine-tar
commit.

Did you try using the same pristine-tar xz thing but with a different
version of xz-utils, for example the one in Trusty vs the one in Sid?

I have just checked out a tarball checked in on 2013-05-20 using current
sid. SHAs match. I do not have a trusty system to check.

Russ Allbery

2015-06-16 04:20:02 UTC

Post by Thomas Goirand
Did you try using the same pristine-tar xz thing but with a different
version of xz-utils, for example the one in Trusty vs the one in Sid?

Yup, I do this all the time.

--
Russ Allbery (***@debian.org) <http://www.eyrie.org/~eagle/>
--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@hope.eyrie.org

Guillem Jover

2015-06-16 23:00:02 UTC

Post by Guillem Jover
Well if you want reproducible output, then use the same tool version.

That's not possible: Jessie, Sid and Trusty don't have the same version,
and we need to generate the orig.tar file in all of them. The
contributors for the Debian OpenStack packaging are mostly using Ubuntu,
and they need to keep a workflow with the orig.tar file in the Git
repository.
I did tell them to just get the file from the Debian archive, but it
doesn't work. One of the reason it doesn't is because sometimes, they
upload first to Ubuntu, and then I do in Debian, and we end up with
different orig.tar.xz files, meaning it's hard for them to sync back
with Debian. I would like this to not be an issue anymore.

From where I'm sitting it all pretty much looks like a self-inflicted
problem. Upstream does not believe in releasing source tarballs (so
each user has to generate them from git-archive, which should be
considered inherently non-reproducible across different versions), and
then for packaging you want to use a pristine-tar workflow w/o using
pristine-tar, nor by creating a single common upstream source tarball,
nor by communicating/coordinating the creation of the first upstream
source tarball and reusing that on the other distro.

Post by Guillem Jover
That's the equivalent of expecting that using a different gcc version
will give you the same output.

I fail to see what gcc and a lossless compressor have in common.

A lossless compressor is not defined in terms of it needing to
generate the same compressed bistream, but in that it should produce
the same output when uncompressed.

gcc and a lossless compressor are similar in that they might produce
different "bitstreams" as long as the result is correct, for example
by performing optimizations that improve speed or reduce size.

Post by Guillem Jover
As long as the bitstream is compatible with previous versions, I don't
see it as a problem

The problem, I just explained it: I can't use xz in a pristine-tar like
workflow, because it wouldn't reproduce the same output. And I'd like to
use something better than the 20 years old gzip.

See Felipe and Russ replies. And I don't think you've explained why you
cannot use pristine-tar, which was implemented precisely to overcome such
problems.

No, my question would actually be: why are you not using the tool that
was designed to do what you want to do?

Personally, which is besides a bit the point here, I think pristine-tar
is an interesting idea, but I find it pretty fragile and requiring an
ongoing amount of hacks to keep it going over time, so not something I'd
rely on for my data for long-term storage. For more details, see
pristine-tar O: bug #737871. I'm also part of the Cult of the Tarball,
and the Cult of no Autogenerated Cruft in a VCS.

Post by Thomas Goirand
And by the way, xz wouldn't be usable with pristine-tar for the
same reason.

If that is based on actual usage and pristine-tar is not working for
you, the correct solution is to fix it.

Post by Guillem Jover
Adding lzip support for source packages *might* make some sense, as
<https://wiki.debian.org/Teams/Dpkg/FAQ#Q:_Can_we_add_support_for_new_compressors_for_.dsc_packages.3F>

It's not a matter of there being something better than lzip, I think
your expectations are unrealistic. Either lzip has a guarantee that it
will not change its bitstream ever again (does it?), which implies that
part of the compressor cannot ever be improved anymore, which makes it
quite unintersting as it's stale on arrival, or would require users to
tune parameters manually, removing simplicity, one of it's claimed
advantages over xz-utils; or it can change and then it loses the
property you seem to believe that you want from it.

The same applies to gzip, bzip2, tar, or even git versions (see the
above mentioned prsintine-tar O: bug), all those have changed and can
change in the future.

Regards,
Guillem

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gaara.hadrons.org

Clint Adams

2015-06-18 02:40:01 UTC

Post by Guillem Jover
From where I'm sitting it all pretty much looks like a self-inflicted
problem. Upstream does not believe in releasing source tarballs (so
each user has to generate them from git-archive, which should be
considered inherently non-reproducible across different versions), and
then for packaging you want to use a pristine-tar workflow w/o using
pristine-tar, nor by creating a single common upstream source tarball,
nor by communicating/coordinating the creation of the first upstream
source tarball and reusing that on the other distro.

No, in this particular case, upstream IS releasing source tarballs and
the packagers are refusing to use them for reasons I find
incomprehensible.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@scru.org

Jeremy Stanley

2015-06-18 15:00:02 UTC

Post by Clint Adams
No, in this particular case, upstream IS releasing source tarballs
and the packagers are refusing to use them for reasons I find
incomprehensible.

Well, for some of the packages in question where I'm involved
upstream, we still aren't providing PGP-signatures for some of those
tarballs (not even PGP-signed checksum lists). Some are uploaded to
Launchpad and a release manager uploads a signature along with it,
some are auto-published to other places by our build systems and
sometimes a release manager sends a signed release announcement to a
few mailing lists hopefully including strong checksums of the
tarballs, but there are plenty where CI automation is building the
tarballs (based on signed tags in a VCS of course) and uploading
them without a corresponding signature.

I'm planning to rectify that to some extent by having trusted
systems in our build infrastructure create and upload signatures
with them, but depending on a package maintainers trust preferences
that may not be seen as a strong enough attestation. On the other
hand, I run Debian testing and unstable on a lot of systems and have
a fairly strong degree of faith in the automatic archive signing
keys... we'd definitely be following similar measures to cross-sign,
secure and rotate our automatic tarball signing keys.

--
Jeremy Stanley

Vincent Lefevre

2015-06-14 14:50:01 UTC

I'm currently using xz for my own files, but...

xz-utils (4.999.9beta-1) experimental; urgency=low

[ Jonathan Nieder ]
* New upstream release.
- Fix a data corruption in the compression code. (Closes: #544872)
[...]

But of course, this is old, and any compression software can have the
same kind of bug (possibly unless proved formally).

However lzip compresses better, sometimes much better:

-rw-r----- 1 vinc17 vinc17 822474 2015-04-26 00:45:51 mail.log.lz
-rw-r----- 1 vinc17 vinc17 915544 2015-04-26 00:45:51 mail.log.xz

(this example is a postfix mail log) and uses much less memory for
compression:

$ sh -c 'ulimit -v 200000; lzip -9 < mail.log > /dev/null'
$ sh -c 'ulimit -v 800000; xz -9 < mail.log > /dev/null'
xz: (stdin): Cannot allocate memory
$ sh -c 'ulimit -v 800000; xz -9 < /dev/null > /dev/null'
xz: (stdin): Cannot allocate memory

Note: see the 200000 for lzip and 800000 for xz.
--
Vincent Lefèvre <***@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@xvii.vinc17.org

Guillem Jover

2015-06-15 03:10:01 UTC

Hi!

xz-utils (4.999.9beta-1) experimental; urgency=low
[ Jonathan Nieder ]
* New upstream release.
- Fix a data corruption in the compression code. (Closes: #544872)
[...]
But of course, this is old,

Yes, that was even before dpkg started to use xz-utils to handle .xz
files.

Post by Vincent Lefevre
and any compression software can have the
same kind of bug (possibly unless proved formally).

And in any case I don't see how this is a "fundamental problem" with
the format, this is simply just a bug in a beta version, although an
unfortunate one.

Post by Vincent Lefevre
-rw-r----- 1 vinc17 vinc17 822474 2015-04-26 00:45:51 mail.log.lz
-rw-r----- 1 vinc17 vinc17 915544 2015-04-26 00:45:51 mail.log.xz

Oh, interesting, this didn't use to be the case when we added .xz
support to dpkg.

Post by Vincent Lefevre
(this example is a postfix mail log) and uses much less memory for
$ sh -c 'ulimit -v 200000; lzip -9 < mail.log > /dev/null'
$ sh -c 'ulimit -v 800000; xz -9 < mail.log > /dev/null'
xz: (stdin): Cannot allocate memory
$ sh -c 'ulimit -v 800000; xz -9 < /dev/null > /dev/null'
xz: (stdin): Cannot allocate memory
Note: see the 200000 for lzip and 800000 for xz.

The preset levels do not match between lzip and xz. For example for -9, xz
uses a dictionary size of 64 MiB, while lzip uses 32 MiB. Other parameters
are also probably quite different. In addition lzip seems to be
substantially slower (at least) when compressing compared to xz using the
same preset levels. With a small pdf file it took more than twice the time:

,---
$ cat Posix_1003.1e-990310.pdf >/dev/null
$ ls -la Posix_1003.1e-990310.pdf
-rw-r----- 1 guillem guillem 3486116 Feb 20 16:43 Posix_1003.1e-990310.pdf
$ /usr/bin/time xz -9k Posix_1003.1e-990310.pdf
1.24user 0.07system 0:01.31elapsed 99%CPU (0avgtext+0avgdata 98748maxresident)k
0inputs+3520outputs (0major+24291minor)pagefaults 0swaps
$ rm -f Posix_1003.1e-990310.pdf.xz
$ /usr/bin/time xz -9k Posix_1003.1e-990310.pdf
1.25user 0.06system 0:01.31elapsed 99%CPU (0avgtext+0avgdata 98952maxresident)k
0inputs+3520outputs (0major+24295minor)pagefaults 0swaps
$ ls -la Posix_1003.1e-990310.pdf.xz
-rw-r----- 1 guillem guillem 1801372 Feb 20 16:43 Posix_1003.1e-990310.pdf.xz
$ rm -f Posix_1003.1e-990310.pdf.xz
#
$ /usr/bin/time lzip -9k Posix_1003.1e-990310.pdf
2.93user 0.02system 0:02.96elapsed 99%CPU (0avgtext+0avgdata 37628maxresident)k
0inputs+3520outputs (0major+8957minor)pagefaults 0swaps
$ rm -f Posix_1003.1e-990310.pdf.lz
$ /usr/bin/time lzip -9k Posix_1003.1e-990310.pdf
2.94user 0.03system 0:02.98elapsed 99%CPU (0avgtext+0avgdata 37576maxresident)k
0inputs+3520outputs (0major+8955minor)pagefaults 0swaps
-rw-r----- 1 guillem guillem 1798338 Feb 20 16:43 Posix_1003.1e-990310.pdf.lz
$ rm -f Posix_1003.1e-990310.pdf.lz
`---

With the linux sources:

,---
$ cat linux_4.0.4.orig.tar >/dev/null
$ ls -la linux_4.0.4.orig.tar
-rw-r--r-- 1 guillem guillem 582932480 May 26 20:15 linux_4.0.4.orig.tar
$ /usr/bin/time lzip -k9 linux_4.0.4.orig.tar
619.52user 1.27system 10:21.95elapsed 99%CPU (0avgtext+0avgdata 363168maxresident)k
24inputs+156680outputs (0major+90387minor)pagefaults 0swaps
$ ls -la linux_4.0.4.orig.tar.lz
-rw-r--r-- 1 guillem guillem 80218126 May 26 20:15 linux_4.0.4.orig.tar.lz
$ rm -f linux_4.0.4.orig.tar.lz
$ /usr/bin/time lzip -k9 linux_4.0.4.orig.tar
618.94user 1.10system 10:21.02elapsed 99%CPU (0avgtext+0avgdata 363180maxresident)k
8inputs+156680outputs (0major+90389minor)pagefaults 0swaps
$ rm -f linux_4.0.4.orig.tar.lz
#
$ /usr/bin/time xz -k9 linux_4.0.4.orig.tar
514.76user 1.53system 8:37.22elapsed 99%CPU (0avgtext+0avgdata 691428maxresident)k
176inputs+156656outputs (1major+172417minor)pagefaults 0swaps
$ ls -la linux_4.0.4.orig.tar.xz
-rw-r--r-- 1 guillem guillem 80205900 May 26 20:15 linux_4.0.4.orig.tar.xz
$ rm -f linux_4.0.4.orig.tar.xz
$ /usr/bin/time xz -k9 linux_4.0.4.orig.tar
515.96user 1.62system 8:38.60elapsed 99%CPU (0avgtext+0avgdata 691328maxresident)k
56inputs+156656outputs (0major+172413minor)pagefaults 0swaps
$ rm -f linux_4.0.4.orig.tar.xz
`---

So the comparison does not seem entirely fair. And it seems to me to be
a matter of tradeoffs?

Thanks,
Guillem

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gaara.hadrons.org

Thomas Goirand

2015-06-15 08:20:02 UTC

Post by Guillem Jover
In addition lzip seems to be
substantially slower (at least) when compressing compared to xz using the
same preset levels.

I understand that some may care about it, but as for me, I couldn't care
less about the time taken for compressing. What I need is
reproducibility. Right now, I'm switching back to .gz, which is
disappointing.

Thomas

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@debian.org

Riku Voipio

2015-06-15 13:50:02 UTC

Post by Guillem Jover
So the comparison does not seem entirely fair. And it seems to me to be
a matter of tradeoffs?

Since both lzip and xz are implementations of same LZMA algorithm, it seems
lzip is just parametrized different. For some usecases, like plaintext mail
logfiles it seems better - but there is probably other places were the
tradeoff gives worse results.

I don't mind adding yet-another-compressor for source packages. But a
really hope this doesn't come part of binary debs - it would be
unnneccesary bloat to have dpkg depend on two different lzma
implementations.

Riku

Vincent Lefevre

2015-06-16 14:30:02 UTC

The preset levels do not match between lzip and xz. For example for -9, xz
uses a dictionary size of 64 MiB, while lzip uses 32 MiB. Other parameters
are also probably quite different.

I don't think the dictionary size alone really matters. What matters
is the size of the result. On the above mail.log example, lz both
compresses better and uses less memory. From the tests I've done
with -9, it seems that:

* lzip uses much less memory than xz;
* lzip often compresses better than xz (but slightly better most of
the time);
* lzip sometimes compresses much better than xz (mail.log example);
* xz sometimes compresses better than lzip.

So, I would say that there isn't an absolutely better compressor in
practice, but there are some good reasons that people may prefer lzip.

I haven't tested decompression time, but I have never heard of
complaints about it for lzip and xz.

Post by Guillem Jover
In addition lzip seems to be substantially slower (at least) when
compressing compared to xz using the same preset levels.

Yes, however this doesn't always matter, in particular in a compress
once / transfer often / uncompress often model.

And AFAIK, the request is not to drop xz support, just to add lzip
support (though the "instead of" in the subject could be ambiguous).

Concerning the reproducibility, I suppose that it may change for lzip
too if it gets a larger dictionary size in the future. Compressors
could allow users to ensure reproducibility by specifying parameters;
xz has lots of parameters, and I wonder whether the OP has used them.
--
Vincent Lefèvre <***@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@ypig.lip.ens-lyon.fr

Thomas Goirand

2015-06-16 22:50:02 UTC

Post by Vincent Lefevre
And AFAIK, the request is not to drop xz support, just to add lzip
support (though the "instead of" in the subject could be ambiguous).

Correct. I never thought lzip should replace xz completely, just an
option "instead" of xz... :)

Thomas

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@debian.org

Fabian Greffrath

2015-06-15 20:20:02 UTC

Post by Guillem Jover
Also many GNU projects do not release lzip tarballs, but do release bzip
or xz ones and there are very few that exclusively release lzip tarballs.
If that's the equivalent of bazaar being the official GNU VCS that most
of the GNU projects do not use, well

This is often the case when few "enlightened" people tell all the
others what they skould prefer by now.

I have found another opinion by a Gentoo dev that might shed some light
on the topic, or maybe not.

https://blogs.gentoo.org/mgorny/2014/02/22/a-few-words-on-lzip
-compressor/

- Fabian

Daniel Baumann

2015-06-16 00:40:01 UTC

Post by Fabian Greffrath
https://blogs.gentoo.org/mgorny/2014/02/22/a-few-words-on-lzip
-compressor/

i remember this has been discussed on the lzip mailinglist/irc when it
popped up and was mainly put down as FUD. a quick search didn't reveal
the "debunking" reply that was sent, though.
--
Address: Daniel Baumann, Donnerbuehlweg 3, CH-3012 Bern
Email: ***@progress-technologies.net
Internet: http://people.progress-technologies.net/~daniel.baumann/

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@progress-technologies.net

Russ Allbery

2015-06-16 04:50:03 UTC

Post by Fabian Greffrath
This is often the case when few "enlightened" people tell all the
others what they skould prefer by now.
I have found another opinion by a Gentoo dev that might shed some light
on the topic, or maybe not.
https://blogs.gentoo.org/mgorny/2014/02/22/a-few-words-on-lzip
-compressor/

It's depressing that it's 25 years later and this still feels like the ARJ
vs. ARC vs. ZIP wars in the DOS/Windows ecosystem. Not to mention HQZ and
various other things that we had to maintain decompressors for. Lots of
heat and smoke, not very much light, and way more different compression
formats than there are actually meaningful distinguishing features that
could only be implemented in that format, because everyone seems to want
to write their own thing.

Antonio Diaz Diaz

2015-07-26 12:20:02 UTC

Hello Guillem,

Post by Guillem Jover
TBH this smells like FUD. For example I've never heard of corruption in
.xz files due to non-robustness, I'd expect that corruption to come from
external forces, and that integrity would help or not detect it.

Sure it comes from external forces, but xz does something that no other
compressor does: even if the corruption does not affect the data and xz
is able to produce perfectly correct output, it will report "Compressed
data is corrupt" and will exit with non-zero status anyway. Just take
any xz file and append a null character to it. Bzip2, gzip and lzip
simply ignore the extra byte.

But not only that. Xz is the only format (of the four mentioned) whose
parts need to be aligned to a multiple of four bytes. The size of a xz
file must also be a multiple of four bytes. To achieve this, xz includes
padding everywhere; after headers, blocks, the index, and the whole
stream. The bad news is that if the (useless) padding is altered in any
way, "the decoder MUST indicate an error" according to the xz format
specification.

This is specially bad when xz is used with tar, making the whole command
to fail and the whole archive to be discarded as corrupt.

And this fragility is one of the perverse effects of the unbelievably
stupid design of xz; "It is possible that there is a new field present
which the decoder is not aware of, and can thus parse the Block Header
incorrectly[1]".

[1] http://tukaani.org/xz/xz-file-format.txt (see 3.1.6. Header Padding)

So yes, the xz format is objectively more fragile than the other three.

Post by Guillem Jover
In any case .xz supports CRC32, CRC64 and SHA-256 for integrity
checks, .lz only supports CRC32.

To begin with, the affirmation that lzip "only supports CRC32" is false.
Lzip provides a 4 factor integrity checking; CRC32, uncompressed size,
compressed size, and the value remaining in the range decoder after the
decoding of the end-of-stream marker.

Do you know of any case where bzip2, gzip or lzip silently produced
invalid output because of any weakness in their integrity checking?

Have you considered that maybe lzip provides optimal integrity checking,
while xz is just throwing buzzwords to the naive just like it did with
LZMA2[2]? Bigger not always means better.

[2] http://www.nongnu.org/lzip/lzip_benchmark.html (see Lzip vs xz)

Lzip is very good at detecting errors. You may have noticed that in case
of corruption, instead of the unhelpful "Compressed data is corrupt"
reported by xz, lzip says something like "Decoder error at pos 1234".
This leaves very little work for the CRC32 in the detection of errors.

Also, lzip reports mismatches in the four factors separately. This way
if one factor fails but the other three are ok, most probably the
corruption affects the file trailer and you can consider the data to be
intact. Lzip corruption detection is robust.

Lets make some tests. I have taken a small file (the COPYING file
distributed with the lzip source), compressed it, and then tried the
effect of all possible bit-flips on the decompression. (I used the
'unzcrash' tool distributed with lziprecover).

-rw-r--r-- 1 18025 Jun 16 2014 COPYING
-rw-r--r-- 1 6150 Jun 16 2014 COPYING.bz2
-rw-r--r-- 1 6839 Jun 16 2014 COPYING.gz
-rw-r--r-- 1 6507 Jun 16 2014 COPYING.lz
-rw-r--r-- 1 6544 Jun 16 2014 COPYING.xz

Bzip2 seems to depend entirely on the CRCs for the detection of errors,
but it provides 2 levels of CRCs; one for each block and one for the
whole stream. Of the 49200 bit-flips in COPYING.bz2, 29 were rejected
(bad magic), 49163 were caught by the CRC, and eight produced correct
output with status 0.

Gzip has 2 factor integrity checking and some ability to detect format
violations. Of the 54712 bit-flips in COPYING.gz, 16 were rejected (bad
magic), 5 failed because of bad flags, 53952 were caught by the CRC,
26207 were caught by the uncompresed length, 540 were caught as format
violations, 44 failed because of unexpected EOF, 8 failed because of
unknown compression method, and 116 produced correct output with status 0.

Lzip has 4 factor integrity checking and an excellent ability to detect
data errors. Of the 52056 bit-flips in COPYING.lz, 32 were rejected (bad
magic), 51171 were caught by the decoder, 19 were caught by the value
remaining in the range decoder, 652 failed because of unexpected EOF, 7
failed because of unsupported format version, 3 failed because of
invalid dictionary size, 32 reported bad CRC, 64 reported bad
uncompressed size, 64 reported bad compressed size, and 12 produced
correct output with status 0.

Note that the bad CRCs and sizes reported by lzip correspond to
bit-flips in the CRCs and sizes themselves. In this particular file all
the bit-flips in the compressed data were caught by the decoder. The
integrity information was not even needed.

Xz error messages appear particulary unhelpful about how the corruption
was detected. Of the 52352 bit-flips in COPYING.xz, 48 were rejected
(bad magic), 52299 reported "Compressed data is corrupt", 5 failed
because of unexpected EOF, and not even one produced correct output with
status 0.

It is not at all clear that xz could detect errors better than lzip, but
xz has a point of failure not present in the other formats. If the
corruption affects the stream flags, xz won't be able to know the size
of the checksum and won't be able to decode the stream.

Post by Guillem Jover
More over lzip was created to overcome limitations in the .lzma format,
.xz came later and fixed the limitations of the .lzma format too.

Lzip certainly overcame the limitations in the .lzma format, but in this
respect xz seems to just have changed the false negatives of lzma into
false positives. From not reporting the corruption to reporting it "just
in case" even if the decoding went well.

Post by Guillem Jover
(And I could probably switch dpkg-deb's .xz integrity check to CRC64,
given that's the xz-utils command-line tool default.)

Have you verified if this really helps or makes things worse? Doubling
the size of the checksum also doubles the probability of false positives
produced by the corruption of the checksum.

Post by Guillem Jover
replacing xz with lzip on .deb or .dsc packages does not make any sense.

Why? Are .deb or .dsc packages using the filters of xz or something?

Post by Guillem Jover
Whenever considering to add a new compressor, all surrounding tools need

The future is long. You can save a lot of work in the long term by
adding lzip and deprecating the rest.

Post by Guillem Jover
Compressor formats are subject to network-effects like many other
file formats. In this case I think .xz "won" both because it was the
"official" successor from .lzma, and because it is superior to .lz.

You can repeat that .xz is superior to .lz as much as you want, but this
won't make it true. The xz format is so bad that it manages to be as bad
for long-term archiving as lzma-alone. Meanwhile lziprecover achieves
the unprecedented feat of repairing single-byte errors without the help
of any extra redundance. Could you tell us in what aspect is .xz
superior to .lz?

I am not here to "win", but to help people keep their data safe.
Remember that I am also the author of GNU ddrescue (whose data recovery
capabilities nicely complement those of lziprecover). Given that this is
Debian, I have the hope that you may think of the public interest and
eventually replace xz with lzip.

It would be specially adequate that Debian is the first distro
deprecating such bad format given that xz is also not copylefted. Just a
few days ago we were discussing in GNU about how easily non-copylefted
software can be rendered non-free by proprietery licenses like the
Android SDK anti-fork provision.

Best regards,
Antonio.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gnu.org

Andrey Rahmatullin

2015-07-26 12:30:02 UTC

Sure it comes from external forces, but xz does something that no other
compressor does: even if the corruption does not affect the data and xz is
able to produce perfectly correct output, it will report "Compressed data is
corrupt" and will exit with non-zero status anyway. Just take any xz file
and append a null character to it. Bzip2, gzip and lzip simply ignore the
extra byte.
But not only that. Xz is the only format (of the four mentioned) whose parts
need to be aligned to a multiple of four bytes. The size of a xz file must
also be a multiple of four bytes. To achieve this, xz includes padding
everywhere; after headers, blocks, the index, and the whole stream. The bad
news is that if the (useless) padding is altered in any way, "the decoder
MUST indicate an error" according to the xz format specification.
This is specially bad when xz is used with tar, making the whole command to
fail and the whole archive to be discarded as corrupt.
And this fragility is one of the perverse effects of the unbelievably stupid
design of xz; "It is possible that there is a new field present which the
decoder is not aware of, and can thus parse the Block Header
incorrectly[1]".
[1] http://tukaani.org/xz/xz-file-format.txt (see 3.1.6. Header Padding)
So yes, the xz format is objectively more fragile than the other three.

Looks like you've completely missed the point and decided to refute a
straw man.

Post by Guillem Jover
Whenever considering to add a new compressor, all surrounding tools need

The future is long. You can save a lot of work in the long term by adding
lzip and deprecating the rest.

(or vice versa)

Post by Antonio Diaz Diaz
You can repeat that .xz is superior to .lz as much as you want, but this
won't make it true. The xz format is so bad that it manages to be as bad for
long-term archiving as lzma-alone. Meanwhile lziprecover achieves the
unprecedented feat of repairing single-byte errors without the help of any
extra redundance.

Why is this relevant on this thread?

Post by Antonio Diaz Diaz
I am not here to "win", but to help people keep their data safe.

Looks like you don't understand what use case is discussed here.

--
WBR, wRAR

Andrew Shadura

2015-07-26 15:10:01 UTC

Post by Antonio Diaz Diaz
I am not here to "win", but to help people keep their data safe.
Remember that I am also the author of GNU ddrescue (whose data recovery
capabilities nicely complement those of lziprecover). Given that this is
Debian, I have the hope that you may think of the public interest and
eventually replace xz with lzip.

I'm sorry to say this, but being an author of GNU ddrescue is not a good
point here. I hate this situation when there are two tools doing a
similar job (dd_rescue and ddrescue), and almost the same name, and the
better one (to my opinion) is removed from Debian as creating confusion
to the users.

Why haven't you just fixed dd_rescue instead of creating one more tool?
Same with XZ vs LZIP, why haven't you talked to XZ people instead of
creating Just One More LZMA-based compressor?

--
Cheers,
Andrew
--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@shadura.me

Antonio Diaz Diaz

2015-07-26 18:20:01 UTC

Dear Andrew,

Post by Andrew Shadura
Why haven't you just fixed dd_rescue instead of creating one more tool?

I wrote ddrescue instead of fixing dd_rescue because the algorithm of
ddrescue is orders of magnitude more complex than the simple linear read
performed by dd_rescue. Treating failing drives gently is a difficult
task[1].

[1] http://www.toad.com/gnu/sysadmin/index.html#ddrescue

The fact is that the algorithm of dd_rescue seems designed to break
failing drives instead of recovering data from them[2].

[2]http://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html#Algorithm

"Other programs [dd_rescue] read the data sequentially but switch to
small size reads when they find errors. This is a bad idea because it
means spending more time at error areas, damaging the surface, the heads
and the drive mechanics, instead of getting out of them as fast as
possible. This behavior reduces the chances of rescuing the remaining
good data."

I am very sorry that the creation of ddrescue inconvenienced you, and am
genuinely interested in knowing why you think that dd_rescue is better.
(No sarcasm intended. I'm open to any ideas that can make ddrescue better).

Post by Andrew Shadura
Same with XZ vs LZIP, why haven't you talked to XZ people instead of
creating Just One More LZMA-based compressor?

I talked to LZMA-utils people (xz didn't exist then), but they refused
to listen and finally released the monster container format that is xz.

The cases of dd_rescue/ddrescue and xz/lzip are almost identical except
in the time scale. When dd_rescue breaks your CD drive, it breaks it
fast (it broke mine, that is why I started ddrescue). But xz can take
years or decades to lose your data. (Or you to notice it). This is why
the dd_rescue/ddrescue situation evolved much faster than the xz/lzip
situation.

But, wait a moment. When I wrote lzip there existed a number of lzma
tools (LZMA-utils, lzma from Pavlov's SDK, lzmatools, easylzma), all of
them using the same substandard lzma-alone format. I was the first one
who wrote a LZMA tool with a decent format. Then, why are you angry
against me? Why don't you ask Lasse why he wrote xz instead of
contributing to lzip?

Best regards,
Antonio.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gnu.org

Vincent Lefevre

2015-07-26 19:30:02 UTC

Sure it comes from external forces, but xz does something that no other
compressor does: even if the corruption does not affect the data and xz is
able to produce perfectly correct output, it will report "Compressed data is
corrupt" and will exit with non-zero status anyway. Just take any xz file
and append a null character to it. Bzip2, gzip and lzip simply ignore the
extra byte.
But not only that. Xz is the only format (of the four mentioned) whose parts
need to be aligned to a multiple of four bytes. The size of a xz file must
also be a multiple of four bytes. To achieve this, xz includes padding
everywhere; after headers, blocks, the index, and the whole stream. The bad
news is that if the (useless) padding is altered in any way, "the decoder
MUST indicate an error" according to the xz format specification.
This is specially bad when xz is used with tar, making the whole command to
fail and the whole archive to be discarded as corrupt.
And this fragility is one of the perverse effects of the unbelievably stupid
design of xz; "It is possible that there is a new field present which the
decoder is not aware of, and can thus parse the Block Header
incorrectly[1]".
[1] http://tukaani.org/xz/xz-file-format.txt (see 3.1.6. Header Padding)
So yes, the xz format is objectively more fragile than the other three.

I completely disagree. IMHO, a decompressor should be very strict and
detect any suspicious modification. In case of error, it is better to
carefully check with a second source of the compressed file.
--
Vincent Lefèvre <***@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@zira.vinc17.org

Andrey Rahmatullin

2015-07-27 17:30:03 UTC

I guess we are thinking about different use cases here: verifying a package
that can be easily downloaded again in case of corruption, vs decompressing
the only copy of an irreplaceable file.

Indeed.

BTW, telling a user that the only surviving copy of his important data is
corrupt just because cp screwed it up and appended some garbage data at the
end of the file is as unfriendly as it can be.

How is this relevant to the use case we are discussing, again?

--
WBR, wRAR

Andrey Rahmatullin

2015-07-27 19:30:01 UTC

Post by Andrey Rahmatullin

I guess we are thinking about different use cases here: verifying a package
that can be easily downloaded again in case of corruption, vs decompressing
the only copy of an irreplaceable file.

Indeed.

So you agree that xz is a bad format but you don't mind because it does not
have bad consecuences for your use case. :-(

No, I didn't say anything like that.

It seems that the nature of xz does have some bad effects for at least one
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=499489
"Just comparing the number of options that might affect the output
in gzip with xz should give a good idea of the possible complexity of
doing this for xz. Hopefully many of the more esoteric options (like
compressor filter chains) are not used in producing many files.
In general, xz being a container format makes it much harder, I think."

pristine-tar is special.

I am not discussing a concrete use case.

Well, you've wrote so many words about it.

More than 80% of GNU packages do not release xz tarballs

That's an interesting twist of numbers.

--
WBR, wRAR

Antonio Diaz Diaz

2015-07-27 19:40:04 UTC

Post by Andrey Rahmatullin

I guess we are thinking about different use cases here: verifying a package
that can be easily downloaded again in case of corruption, vs decompressing
the only copy of an irreplaceable file.

Indeed.

So you agree that xz is a bad format but you don't mind because it does
not have bad consecuences for your use case. :-(

It seems that the nature of xz does have some bad effects for at least
one Debian package:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=499489

"Just comparing the number of options that might affect the output
in gzip with xz should give a good idea of the possible complexity of
doing this for xz. Hopefully many of the more esoteric options (like
compressor filter chains) are not used in producing many files.
In general, xz being a container format makes it much harder, I think."

Post by Andrey Rahmatullin

BTW, telling a user that the only surviving copy of his important data is
corrupt just because cp screwed it up and appended some garbage data at the
end of the file is as unfriendly as it can be.

How is this relevant to the use case we are discussing, again?

I am not discussing a concrete use case. I understand that the defects
in xz are usually not a big problem for Debian packages that can be
easily downloaded again in case of corruption.

What I find wrong here is that by using xz in its packaging system,
Debian is sending the message that xz is a good format to use, which is
not true for many other use cases.

Inversely, by not accepting .lz source tarballs Debian is sending the
message that lzip is not a good format to use, and is wasting time,
storage and bandwidth recompressing lzip sources:

http://http.debian.net/debian/pool/main/g/gddrescue/gddrescue_1.19.orig.tar.bz2

More than 80% of GNU packages do not release xz tarballs, and xz is the
format that most problems has created in GNU (with the possible
exception of lzma-alone).

Best regards,
Antonio.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gnu.org

Russ Allbery

2015-07-27 20:00:02 UTC

Post by Antonio Diaz Diaz
I am not discussing a concrete use case. I understand that the defects
in xz are usually not a big problem for Debian packages that can be
easily downloaded again in case of corruption.
What I find wrong here is that by using xz in its packaging system,
Debian is sending the message that xz is a good format to use, which is
not true for many other use cases.
Inversely, by not accepting .lz source tarballs Debian is sending the
message that lzip is not a good format to use,

While it's possible that people may decide to read such messages into our
decisions, that's not really something we have control over, and I don't
believe we should alter our decisions on that basis. We should choose the
best compression format (or formats) for our particular use case, not
choose a compression format on the basis of what sort of message other
people may read into that decision for other use cases.
This is probably the strongest argument that I've seen presented on this
thread for supporting lzip in *source packages*: not needing to
recompress upstream releases that use lzip exclusively (or use only lzip
plus other compression formats that we don't want to support). However,
it's not at all clear to me that there are enough of those packages to
make this use case particularly compelling.

There are two different things we can tune in the Debian packaging system
here: the compression formats supported for upstream tarballs in source
packages, and the compression format used for binary debs and for the
Debian-specific portions of source packages. Historically, we've used a
broader range of compression formats for the former than for the latter.
But supporting a compression format for upstream source should be driven
by how much packaging effort it saves, since compression ratios for the
source packages isn't all that exciting (they're not downloaded nearly as
often, and they're often smaller anyway).

For binary debs, we went through a long evaluation process that showed
some pretty clear benefits for xz over gzip, and then went through a long
transition process. Changing this is hard and time-consuming, and
therefore requires a lot of evidence that it's worth the change. When we
went from gzip to xz, we saw substantial improvements in deb size that
mattered for some use cases. I think to switch to lz compression would
require gathering a similar compelling case that lz is sufficiently better
than xz right now *for our use case* that it's worth changing. (I believe
the whole decision-making process for xz required a year or two.)

So, there are two things in Debian that could switch to lzip, and two
separate paths forward for making an argument for those two things. I
don't think this thread is really helping with either path, and would
recommend that you focus gathering of evidence (backed by real numbers and
reproducible statistics, the way the xz discussion previously was) for
either:

1. Support lzip for upstream tarballs based on the number of upstreams
that are using lzip as their preferred compression format and the gains
in ease of packaging of that software.

2. Support lzip for binary packages based on some comprehensive advantages
specifically for our binary package format over existing compression
methods.

Antonio Diaz Diaz

2015-07-29 22:40:03 UTC

Thanks for the detailed explanations.

Post by Russ Allbery

Post by Antonio Diaz Diaz
Inversely, by not accepting .lz source tarballs Debian is sending the
message that lzip is not a good format to use,

While it's possible that people may decide to read such messages into our
decisions, that's not really something we have control over, and I don't
believe we should alter our decisions on that basis.

IMHO, if two options are equally good for the Debian use case but one of
them is better for most of the users, Debian should choose the one that
is better for most of the users. Else, what is the use of the social
contract?

"We will be guided by the needs of our users and the free software
community. We will place their interests first in our priorities.
[...]
we will provide an integrated system of high-quality materials with no
legal restrictions that would prevent such uses of the system."

I can assure you that the first interest of most users is to not lose
their data. I am also pretty sure that neither lzma-alone nor xz count
as "high-quality materials".

Post by Russ Allbery
(I believe the whole decision-making process for xz required a year or
two.)

In one year or two nobody noticed (or cared about) the deficient design
of xz? Wow!

I agree with what you said in another message that it is depressing to
have so many compression formats. But the bad decisions of Debian (among
others) certainly contribute to this situation. A selection process that
chooses a defective format (lzma-alone), then skips over a flawless
format that improves on gzip and bzip2 (lzip), and later replaces the
first defective format with another defective format (xz), really needs
some improvement.

It is also depressing for me to have to point out the defects in both
the xz format and the Debian decision-making process. The current
situation should never have happened. Xz should have been evaluated by
an expert and rejected outright, instead of wasting one year or two in
popularity contests just to reach the wrong decision. If high-quality is
desired, then democracy is not the best way of deciding about technical
questions. Anybody thinking that popular equals good should be using
Windows instead of bringing crappy Windows formats to posix systems.

Post by Russ Allbery
So, there are two things in Debian that could switch to lzip, and two
separate paths forward for making an argument for those two things. I
don't think this thread is really helping with either path, and would
recommend that you focus gathering of evidence (backed by real numbers and
reproducible statistics, the way the xz discussion previously was)

I agree that this thread is not helping, but I think it is because
Debian is not the place I thought it was.

In particular I am very surprised by the insistence on statistics on
this thread. IMHO, statistics, specially of the "popularity contest"
kind, are out of place in this decision. This is an ethical and
technical decision about using in Debian packaging either:
a) a defective format implemented by public domain code that can be
easily made non-free, or
b) a flawless format implemented by truly free code that will remain
free[1].

[1] http://www.debian.org/intro/free

But I'll gladly provide some real numbers:

1) LZMA2 is just plain LZMA wrapped in yet another container format. The
variant of LZMA implemented by lzip is better than xz for
general-purpose compression (for example, of heterogeneous tarballs like
those distributed by Debian). Vincent Lefevre already provided a
summary[2] of the differences between lzip and xz. In the lzip
benchmark[3] can be found more examples from gnu.org.

[2] http://lists.debian.org/debian-devel/2015/06/msg00256.html
[3] http://www.nongnu.org/lzip/lzip_benchmark.html

Here are a couple tests from Debian packages:

-rw-r--r-- 1 2058 Jul 29 17:11 gddrescue_1.19-2_control.tar.gz
-rw-r--r-- 1 2038 Jul 29 17:11 gddrescue_1.19-2_control.tar.lz
-rw-r--r-- 1 101365 Jul 29 17:12 gddrescue_1.19-2_data.tar.lz
-rw-r--r-- 1 101608 Jul 29 17:12 gddrescue_1.19-2_data.tar.xz
-rw-r--r-- 1 1124 Jul 29 17:04 lzip_1.17-1_control.tar.gz
-rw-r--r-- 1 1125 Jul 29 17:04 lzip_1.17-1_control.tar.lz
-rw-r--r-- 1 70343 Jul 29 17:05 lzip_1.17-1_data.tar.lz
-rw-r--r-- 1 70440 Jul 29 17:05 lzip_1.17-1_data.tar.xz
-rw-r--r-- 1 60240 Jul 29 16:30 lzip_1.17.orig.tar.lz
-rw-r--r-- 1 60484 Jul 29 16:30 lzip_1.17.orig.tar.xz

2) As lzip requires by default much less memory than xz to decompress
small files, it would be possible for lzip to replace both gzip and xz
in .deb packages, simplifying the format and the tools managing it. The
memory required to decompress the files above is:

( 11 kB) gddrescue_1.19-2_control.tar.gz
( 82 kB) gddrescue_1.19-2_control.tar.lz
(377 kB) gddrescue_1.19-2_data.tar.lz
(8.4 MB) gddrescue_1.19-2_data.tar.xz
( 12 kB) lzip_1.17-1_control.tar.gz
( 82 kB) lzip_1.17-1_control.tar.lz
(180 kB) lzip_1.17-1_data.tar.lz
(8.4 MB) lzip_1.17-1_data.tar.xz
(377 kB) lzip_1.17.orig.tar.lz
(8.4 MB) lzip_1.17.orig.tar.xz

3) In spite of being the official lzma successor and receiving a lot of
publicity (xz has 2.5 pages in the Wikipedia), xz is not so much more
popular than lzip. For example, about 15% of GNU projects distribute xz
tarballs vs 5% that distribute lzip tarballs. If xz were any good, that
percentage would be 100%, and I would be using it just as everybody else.

In conclusion:

From 1) and 2) it can be concluded that lzip is ideal for the use case
of Debian packaging. Because lzip requires little memory to decompress
small files it could also replace gzip in other uses like compressing
man pages, maybe allowing Debian to use just one format for all its needs.

Best regards,
Antonio.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gnu.org

Russ Allbery

2015-07-30 00:30:03 UTC

Post by Antonio Diaz Diaz
IMHO, if two options are equally good for the Debian use case but one of
them is better for most of the users, Debian should choose the one that
is better for most of the users. Else, what is the use of the social
contract?

That may be a worthwhile factor to take into account *before* we decided
something. But that's not the current state. We've already decided on
something, and it's already supported by all of our tools. You're asking
for a change, which is a huge amount of work. This needs to add
significant value for our use case to warrant the effort. Providing a
better example for people with completely different use cases is a
marginal benefit at best.

Post by Antonio Diaz Diaz
In one year or two nobody noticed (or cared about) the deficient design
of xz? Wow!

Yup. To this point, while you're not the only person I've ever seen
complain about the design of xz, I can certainly count them on one hand.
And that's not just on Debian mailing lists, but also across a variety of
other technical mailing lists I follow.

That doesn't mean your objections are wrong, and I certainly haven't
looked at it in detail. But they don't seem to be widely shared.

Post by Antonio Diaz Diaz
A selection process that chooses a defective format (lzma-alone),

I don't think it's accurate to say that we chose lzma-alone. I think
there was some support for using it for upstream tarballs in some early
iterations of a new source package format, and then, IIRC, we removed it
again. We never, so far as I recall, used it as a compression format for
binary packages, and I don't think it was ever widely used.

For binary packages, we went from gzip to xz. For source packages, we
also support bzip2.

However, I wasn't directly involved in that evaluation, so it's possible
that I'm misremembering.

Post by Antonio Diaz Diaz
It is also depressing for me to have to point out the defects in both
the xz format and the Debian decision-making process. The current
situation should never have happened. Xz should have been evaluated by
an expert and rejected outright, instead of wasting one year or two in
popularity contests just to reach the wrong decision.

I realize that you're quite confident in your expertise here, and quite
possibly have reason to be confident, but it might be worth remembering
that, to the rest of us, you're just some random person on a mailing list
who has written some competing software and wants us to use it. No
offense, but we see a *lot* of people like that, and most of them are
significantly overstating their claims. So you're facing a fair bit of
natural skepticism.

Post by Antonio Diaz Diaz
If high-quality is desired, then democracy is not the best way of
deciding about technical questions.

No one made decisions via democracy. The decision to switch to xz was
made by consensus, which isn't the same thing. The major objections
raised at the time were about the amount of memory required for
decompression. I don't recall anyone raising the issues that you're
raising now, and there was quite an extended discussion process.

Post by Antonio Diaz Diaz
I agree that this thread is not helping, but I think it is because
Debian is not the place I thought it was.
In particular I am very surprised by the insistence on statistics on
this thread. IMHO, statistics, specially of the "popularity contest"
kind, are out of place in this decision. This is an ethical and
a) a defective format implemented by public domain code that can be
easily made non-free, or
b) a flawless format implemented by truly free code that will remain
free[1].

There are many people here who view software under BSD-style licenses to
be more free and hence preferrable than software under the GPL. And
others who do not. Therefore, our community welcomes both, and does not
react well to aggressive statements like this about how the other set of
beliefs is obviously wrong.

Also, anyone who describes their own format as flawless raises HUGE red
flags for me. It indicates some really scary hubris.

Antonio Diaz Diaz

2015-07-30 16:00:01 UTC

Post by Russ Allbery
That doesn't mean your objections are wrong, and I certainly haven't
looked at it in detail. But they don't seem to be widely shared.

It is software, a branch of mathematics, what is being discussed here.
In mathematics a proof outweights the opinion of the whole humankind. It
doesn't matter how much consensus you gather, pi won't become 3. My
objections to xz are based on proof that xz is wrong.

Post by Russ Allbery
I realize that you're quite confident in your expertise here, and quite
possibly have reason to be confident, but it might be worth remembering
that, to the rest of us, you're just some random person on a mailing list
who has written some competing software and wants us to use it. No
offense, but we see a *lot* of people like that, and most of them are
significantly overstating their claims. So you're facing a fair bit of
natural skepticism.

You continue speaking as if this were a political question that you can
win by obtaining the approval of the majority or something. There is no
place here neither for faith nor for scepticism. Just verify that my
affirmations are true, or refute them.

I am not writing here because I want you to use lzip. I am writing here
because I want you (Debian) to stop spreading FUD against lzip, like
".lz only supports CRC32" (implying that lzip integrity is weak), or
gratuitously affirming that ".xz is superior to .lz". I am still waiting
for anybody in this list to tell us in what aspect is .xz superior to .lz.

And BTW, I would be pretty happy if Debian switched from xz to bzip2.
Bzip2 is a much better format than xz.

Post by Russ Allbery
Therefore, our community welcomes both, and does not
react well to aggressive statements like this about how the other set of
beliefs is obviously wrong.

Aggressive statement? I guess your community should change its own
documentation. I merely copied the description from there:

http://www.debian.org/intro/free
"Truly free software is always free. Software that is placed in the
public domain can be snapped up and put into non-free programs. Any
improvements then made are lost to society. To stay free, software must
be copyrighted and licensed."

Post by Russ Allbery
Also, anyone who describes their own format as flawless raises HUGE red
flags for me. It indicates some really scary hubris.

"I gave desperate warnings against the obscurity, the complexity, and
overambition of the new design, but my warnings went unheeded. I
conclude that there are two ways of constructing a software design: One
way is to make it so simple that there are obviously no deficiencies and
the other way is to make it so complicated that there are no obvious
deficiencies."
-- C.A.R. Hoare

It is easy to design a flawless format. All one has to do is to make it
as simple as possible (but not simpler). The whole definition of the
lzip format is shorter than the table of contents of the xz format.

http://www.nongnu.org/lzip/manual/lzip_manual.html#File-format

Each lzip member has the following structure:

+--+--+--+--+--+--+===========+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ID string |VN|DS|Lzma stream| CRC32 | Data size | Member size |
+--+--+--+--+--+--+===========+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

I challenge you to find a flaw in the format above.

The really scary hubris is to think that the consensus of a group of
ignorants can produce better results than an expert. (Ignorant is not an
insult. All of us are ignorant on most matters).

Best regards,
Antonio.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gnu.org

Steve Langasek

2015-07-30 19:00:02 UTC

Post by Russ Allbery
That doesn't mean your objections are wrong, and I certainly haven't
looked at it in detail. But they don't seem to be widely shared.

It is software, a branch of mathematics, what is being discussed here. In
mathematics a proof outweights the opinion of the whole humankind. It
doesn't matter how much consensus you gather, pi won't become 3. My
objections to xz are based on proof that xz is wrong.

No. Computer science is mathematics. Algorithms are mathematics. Software
is something else. You cannot "prove" that a customer's priorities are
wrong.

You continue speaking as if this were a political question that you can win
by obtaining the approval of the majority or something. There is no place
here neither for faith nor for scepticism. Just verify that my affirmations
are true, or refute them.
I am not writing here because I want you to use lzip. I am writing here
because I want you (Debian) to stop spreading FUD against lzip, like ".lz
only supports CRC32" (implying that lzip integrity is weak), or gratuitously
affirming that ".xz is superior to .lz". I am still waiting for anybody in
this list to tell us in what aspect is .xz superior to .lz.

Please point us to where Debian is making these statements.

You are not going to get all Debian developers to stop disapproving of lzip
by having a protracted argument on debian-devel with some /other/ group of
Debian developers. If a particular Debian developer is making such
statements, you should probably take this up with them.

That aside, in this thread you certainly have done much more in this thread
than ask Debian to stop making unsubstantiated claims; you are insisting
that Debian should accept your position that lzip is superior, and you have
asserted that Debian should drop xz and adopt lzip. Denying that you have
done this does nothing to help you appear more reasonable.

Post by Russ Allbery
Therefore, our community welcomes both, and does not
react well to aggressive statements like this about how the other set of
beliefs is obviously wrong.

Aggressive statement? I guess your community should change its own
http://www.debian.org/intro/free
"Truly free software is always free. Software that is placed in the public
domain can be snapped up and put into non-free programs. Any improvements
then made are lost to society. To stay free, software must be copyrighted
and licensed."

Thanks for bringing this to our attention. This is not an official position
of the Debian project; I've reported this as a bug:
http://bugs.debian.org/794116

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
***@ubuntu.com ***@debian.org

Svante Signell

2015-07-30 22:30:03 UTC

Post by Steve Langasek

Post by Antonio Diaz Diaz
Aggressive statement? I guess your community should change its own
http://www.debian.org/intro/free
"Truly free software is always free. Software that is placed in the public
domain can be snapped up and put into non-free programs. Any improvements
then made are lost to society. To stay free, software must be copyrighted
and licensed."

Thanks for bringing this to our attention. This is not an official position
http://bugs.debian.org/794116

Incredible, Debian does no longer adopt to the world of free software
(not opensource) :( I wonder how RMS feels about this?

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gmail.com

Andrew Shadura

2015-07-30 22:40:02 UTC

Post by Svante Signell
Incredible, Debian does no longer adopt to the world of free software
(not opensource) :( I wonder how RMS feels about this?

Open source is free software. Free software is open source. No open
source software isn't also free software. No free software isn't also
open source.

For more information, see also:
http://jordi.inversethought.com/blog/5-things-we-have-forgotten-about-open-source/

--
Cheers,
Andrew
--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/CACujMDMciu4+cO13jgekpiWRaJaKpHvhxfQMrLjiWBkyue=***@mail.gmail.com

Eduard Bloch

2015-07-31 06:50:01 UTC

Hallo,

Post by Andrew Shadura

Post by Svante Signell
Incredible, Debian does no longer adopt to the world of free software
(not opensource) :( I wonder how RMS feels about this?

Open source is free software. Free software is open source. No open
source software isn't also free software. No free software isn't also
open source.
http://jordi.inversethought.com/blog/5-things-we-have-forgotten-about-open-source/

In an ideal world.

In ours, "Open" can be like OpenGL consortium or Open Automomotive
Alliance or Open Mobile Alliance. Yeah, fully open, just pay a little
fee and don't disclose anything unless they are also "true open"...

Regards,
Eduard.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@rotes76.wohnheim.uni-kl.de

Ian Jackson

2015-07-31 00:10:02 UTC

Post by Svante Signell

Post by Steve Langasek

Post by Antonio Diaz Diaz
http://www.debian.org/intro/free
"Truly free software is always free. Software that is placed in the public
domain can be snapped up and put into non-free programs. Any improvements
then made are lost to society. To stay free, software must be copyrighted
and licensed."

Thanks for bringing this to our attention. This is not an
official position of the Debian project; I've reported this as a
bug: http://bugs.debian.org/794116

Incredible, Debian does no longer adopt to the world of free software
(not opensource) :( I wonder how RMS feels about this?

Note that nowhere do Steve or Neil say that they disagree with the
quote. The quoted statement is one I might well have posted myself,
on my own website or in my own writing.

But Steve is quite right in everything he says in #794116 and Neil was
quite write to remove it. I applaud Steve's courteous, practical and
neutral comments in #794116, and Neil's prompt and appropriate
response.

I don't think there is much point having a conversation here on -devel
about the validity of the quote. And if you really want it put back,
please take it to -project, at the very least.

Ian.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@chiark.greenend.org.uk

Wouter Verhelst

2015-08-02 18:00:02 UTC

Post by Svante Signell
Incredible, Debian does no longer adopt to the world of free software
(not opensource) :( I wonder how RMS feels about this?

Regardless of whether the removed quote was correctly representing the
opinion of the Debian project, I fail to see how RMS' opinion is of any
relevance in that discussion? He's not a DD.
--
It is easy to love a country that is famous for chocolate and beer

-- Barack Obama, speaking in Brussels, Belgium, 2014-03-26

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@grep.be

Antonio Diaz Diaz

2015-08-02 21:20:02 UTC

Post by Steve Langasek
No. Computer science is mathematics. Algorithms are mathematics. Software
is something else. You cannot "prove" that a customer's priorities are
wrong.

Debian is not the customer, but the developer. It is compelled by its
social contract to provide high-quality materials to the real customers,
its users. This implies rejecting low-quality materials like xz. It
would be a shame for Debian if only one of its users in a remote
location, without internet connection, were unable to install a package
he needs just because the package format is more fragile than it could be.

Post by Steve Langasek

I am writing here because I want you (Debian) to stop spreading FUD
against lzip, like ".lz only supports CRC32" (implying that lzip
integrity is weak), or gratuitously affirming that ".xz is superior
to .lz". I am still waiting for anybody in this list to tell us in
what aspect is .xz superior to .lz.

Please point us to where Debian is making these statements.

In this same list. Guillem Jover made these statements[1]. Then Russ
Allbery affirmed that xz was chosen by consensus[2], and nobody objeted.
So this seems to be the collective opinion of the Debian developers,
transmitted to all readers of this list. The fact that lzip has been
rejected three times even for .orig.tar files seems to corroborate it. I
feel somewhat insulted every time a source tarball of mine is
recompressed to a larger size into a worse format.

[1] http://lists.debian.org/debian-devel/2015/06/msg00188.html
[2] http://lists.debian.org/debian-devel/2015/07/msg00634.html

Post by Steve Langasek
you are insisting that Debian should accept your position that lzip is
superior,

I didn't say that lzip is superior, but I agree because it is clear that
xz is substandard.

I have learned here that xz violates its own format specification to
overcome a bug in the format, causing silent data loss in the process. I
knew the xz format had a lot of flaws, but I hadn't considered that the
xz tool could make them worse. I have learned some things in this
thread. Most of them new faults in xz.

The xz format has many unjustified features, each one wasting space,
reducing efficiency and adding fragility. Lets see just a couple examples:

The 4 byte alignment of the fields in the format requires useless
padding. Alignment is justified as being perhaps able to increase speed
and compression ratio, but:

1) The only last filter in xz is LZMA2, which does not need any
alignment. Xz decompresses a 2% slower than lzip in the i586 from which
I'm writing this. If xz decompresses faster than lzip in your machine it
is because xz uses optimized assembler on some architectures, not
because of the alignment.

2) The output of the non-last filters in the chain is not stored in the
file. Therefore it can't be "later compressed with an external
compression tool" as stated in the specification.

3) 4 bytes are not enough. The IA64 filter has an alignment of 16 bytes.
Alignment should be a property of each filter, not of the whole stream.

Conclusion: the 4 byte alignment is a misfeature that wastes space and
adds fragility without producing any benefit at all.

Xz is a fantasy based on the false idea that better compression
algorithms can be mass-produced like cars in a factory. The xz format is
the most wasteful I have ever analyzed. It has room for 2^63 filters,
which can then be combined to make an even larger number of algorithms.
It reserves less than 0.8% of filter IDs for custom filters, but even
this small range provides about 8 million custom filter IDs for each
human inhabitant on earth.

The basic ideas of compression algorithms were discovered early in the
history of computer science. LZMA is based on ideas discovered in the
70s. Don't expect an algorithm much better than LZMA to appear anytime
soon, much less several of them in a row.

In almost 7 years not even one of the promises of xz has been fulfilled.
Lasse Collin once warned me that lzip would become stuck with LZMA while
others moved to LZMA2, LZMA3, LZMH, and other algorithms. Now xz is not
even able to match the compression ratio of lzip. Who wants a compressor
with a format orders of magnitude more complex that is not even able to
compress a little better?

Castles in the air like xz should be eradicated from posix systems as
bad practice. Studying the xz format I have understood much better the
quote from Hoare.

Post by Steve Langasek
and you have asserted that Debian should drop xz and adopt lzip.
Denying that you have done this does nothing to help you appear more
reasonable.

I don't deny it. I think Debian should drop xz, or else Debian should
remove the term "high-quality" from its social contract just as the
statement about "Truly free software" has been (IMHO correctly) removed
from the Debian definition of Free Software.

I think Debian should adopt lzip not only for the reasons already
exposed in this thread, but also because, surprisingly enough, it would
be much easier for lzip to seamlessly incorporate a new compression
algorithm than it would be for xz. The xz format lacks a format version
field. The only reliable way of knowing if your xz tool can decompress a
given file is by trial and error:

$ file COPYING.*
COPYING.lz: lzip compressed data, version: 1
COPYING.xz: XZ compressed data

It is true that the first byte of "stream flags" could be used in the
future to indicate a new stream version, but this says nothing about
what filters are used in a given file. Trial and error are still needed.

It is also true that "xz -vv --list" reports the minimum xz version
required to decompress the file, but:

1) The minimum xz version reported is just a guess. As stated above, the
xz format lacks a format version.

2) Only older versions of xz utils can be reported. If a newer version
of xz utils is required, it can't be known which one. The report is also
useless to know what version of other decompressors (for example 7zip)
could decompress the file.

Of course a much better algorithm will probably never be found. But if
it happens, lzip is better prepared for it than xz.

Adopting lzip means some work for the Debian developers, but if the
Debian developers are proud of their work, they would be willing to do
wathever work it takes to deliver the high-quality promised in the
social contract. Even more so if the cause of this extra work is that xz
was not properly evaluated before being adopted in Debian.

I am willing to help. I have reported here just a small subset of the
defects of xz in order to keep this message small, but I'll gladly
answer any question from Debian developers in this list, in the lzip
list or in private. I'll also gladly implement lzip support in any
Debian tool that may need it. I am the author of zutils[3] and have
experience with multiformat support in tools.

[3] http://www.nongnu.org/zutils/zutils.html

Post by Steve Langasek
Thanks for bringing this to our attention. This is not an official position
http://bugs.debian.org/794116

You are welcome.

Best regards,
Antonio.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gnu.org

Antonio Diaz Diaz

2015-07-27 17:30:04 UTC

Post by Antonio Diaz Diaz
the xz format is objectively more fragile than the other three.

I completely disagree. IMHO, a decompressor should be very strict and
detect any suspicious modification.

(In the following response I'll assume that by "modification" you mean
"corruption" (accidental modification). No decompressor can detect
intentional modifications, for example replacing a file with another).

In a well-defined format there are no such thing as "suspicious
modifications"; a modification either violates the format or not. It is
not the responsibility of the decompressor to detect modifications out
of the defined format.

For example, bzip2, gzip and lzip include in their documentation a
paragraph similar to the following:

"Lzip will correctly decompress a file which is the concatenation of two
or more compressed files. The result is the concatenation of the
corresponding uncompressed files. Integrity testing of concatenated
compressed files is also supported."

Whatever follows a file that is not a valid header is classified as
"trailing garbage" and ignored.

Xz has broken with this tradition and has included the padding in the
definition of the format. IMHO this is a bug, just as it would be a bug
to include in the definition of octal literals the characters following
the octal digits. Something like "\12aaaa" compiles but "\12a" fails.

This bug of xz limits the way in which xz streams can be embedded in
other formats.

BTW, xz is the only compressor that shows fragile behaviour with respect
to "trailing garbage". It is the only one that doesn't report the
addition of 4 null bytes to the file, but reports "Compressed data is
corrupt" without further explanation if 3 null bytes are appended.

In a well-designed format, all alterations that produce invalid output,
and only those, should be detected. Doing otherwise just prevents the
recovery of perfectly good data for no good reason. Some changes can't
be detected by any decompressor, for example a change in the amount of
padding/trailing garbage. Therefore, the only way to be sure that the
file has not been altered is to provide an external checksum.

Of course xz can't limit itself to detect alterations that produce
invalid output because its format is ill-defined. For example, it allows
the decoder to just indicate a warning if it can't verify the integrity
of the file (unsupported check). Xz is the less safe, the less friendly
and at the same time the less strict of all four, so I suppose you agree
that it should be replaced by lzip.

Post by Vincent Lefevre
In case of error, it is better to carefully check with a second
source of the compressed file.

Supposing that you have a second source available.

I guess we are thinking about different use cases here: verifying a
package that can be easily downloaded again in case of corruption, vs
decompressing the only copy of an irreplaceable file.

BTW, telling a user that the only surviving copy of his important data
is corrupt just because cp screwed it up and appended some garbage data
at the end of the file is as unfriendly as it can be.

But, as stated above, in both cases the only way to be sure that the
file is intact is to provide an external checksum. No amount of
"strictness" in the decompressor can replace an external checksum.

Best regards,
Antonio.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gnu.org

Jakub Wilk

2015-08-03 17:30:02 UTC

Post by Antonio Diaz Diaz
"Lzip will correctly decompress a file which is the concatenation of
two or more compressed files. The result is the concatenation of the
corresponding uncompressed files. Integrity testing of concatenated
compressed files is also supported."
Whatever follows a file that is not a valid header is classified as
"trailing garbage" and ignored.

Sounds like a serious design flaw that could lead to data loss.

The attached files differ only by one bit. The output for the corrupted
file is truncated, yet there is no error or warning:

$ cmp -l good.lz corrupted.lz
41 114 115

$ lzip -d < good.lz; echo $?
foo
bar

0

$ lzip -d < corrupted.lz; echo $?
foo

0

Post by Antonio Diaz Diaz
Xz has broken with this tradition

Glad to hear that.

--
Jakub Wilk

Antonio Diaz Diaz

2015-08-04 18:20:02 UTC

Post by Jakub Wilk

Post by Antonio Diaz Diaz
"Lzip will correctly decompress a file which is the concatenation
of two or more compressed files. The result is the concatenation of
the corresponding uncompressed files. Integrity testing of
concatenated compressed files is also supported."
Whatever follows a file that is not a valid header is classified as
"trailing garbage" and ignored.

Sounds like a serious design flaw that could lead to data loss.

You are about right. IMHO, this is a (not so) serious design flaw of
gzip, improved somewhat by bzip2 and lzip, worsened by xz, but never
properly addressed. Except in the case of xz (see below), this "flaw" is
not related to any format, but just to what should the decompressor do
in a situation that may involve a corrupt header or just trailing garbage.

Post by Jakub Wilk
The attached files differ only by one bit. The output for the

Just use "lzip -vvvv" to see the warning:

When decompressing or testing, further -v's (up to 4) increase the
verbosity level, showing status, compression ratio, dictionary
size, trailer contents (CRC, data size, member size), and up to 6
bytes of trailing garbage (if any).

BTW, lzip is the only one that shows the "trailing garbage", allowing
you to determine if it is really garbage or not. In this case the
"garbage" is awfully similar to a lzip signature (4C 5A 49 50):

$ lzip -tvvvv corrupted.lz
corrupted.lz: dictionary size 4 KiB. 0.100:1, 80.000 bits/byte,
-900.00% saved. data CRC 7E3265A8, data size 4, member size
40. ok
corrupted.lz: first bytes of trailing garbage found = 4D 5A 49 50 01 0C

I see that the bit-flip in corrupted.lz affects one of the magic bytes
in the second member of the file.

The probability of corruption happening in the magic bytes of the second
or successive members/streams is (except in the case of xz) about 4
times smaller than the probability of getting a false positive caused by
the corruption of the integrity information itself. It can be considered
to be under the noise level. This along with the fact that human
judgement is needed to tell garbage from a corrupt header are probably
the causes why AFAIK nobody has never cared about it so much as to write
a feature request in bug-gzip or lzip-bug.

Post by Jakub Wilk

Post by Antonio Diaz Diaz
Xz has broken with this tradition

Glad to hear that.

Don't be so glad about xz breaking the tradition. Xz did it because its
probability of truncating the output is the highest of all, both because
of its longer magic string and because of possible corruption in stream
padding. (The stream padding of xz is optional, but its size has no limit).

bzip2/gzip/lzip
+========+========+========+
| member | member | member |
+========+========+========+

xz
+========+=========+========+=========+========+=========+
| stream | padding | stream | padding | stream | padding |
+========+=========+========+=========+========+=========+

Bzip2 and lzip behave optimally in the most frequent case of files with
just one member/stream, where trailing garbage can't make the decoder
produce incorrect output, and there is no risk in ignoring it by
default. (This is, I think, the case of Debian packages). In the four
examples below tar extracted the files correctly, but only bzip2 and
lzip returned with 0 status (the string "garbage" was appended to each
tarball):

$ tar -xf garbage_added.tar.bz2 ; echo $?
bzip2: (stdin): trailing garbage after EOF ignored

0

$ tar -xf garbage_added.tar.gz ; echo $?
gzip: stdin: decompression OK, trailing garbage ignored
tar: Child returned status 2
tar: Error is not recoverable: exiting now
2

$ tar -xf garbage_added.tar.lz ; echo $?

0

$ tar -xf garbage_added.tar.xz ; echo $?
xz: (stdin): Unexpected end of input
tar: Child returned status 1
tar: Error is not recoverable: exiting now
2

Note the contradictory messages in the gzip example: "decompression OK"
vs "Error is not recoverable". Xz missed the point entirely.

For more advanced (but less frequent) uses like multimember or
concatenated files I propose the following change:

1) Ignore trailing garbage by default, as bzip2 and lzip do now.

2) Add an option (say --trailing-error) that forces the decompressor to
exit with nonzero status if any remaining input is detected after the
last member.

The proposed option would catch the improbable case of corruption in the
magic bytes of the second or successive members, but there is nothing
the decompressor can do to catch the similarly improbable case of file
truncation just after the last byte of a member/stream.

I suggest any replies to this message to be made in lzip-bug. I guess
discussing the behaviour of decompressors in corner cases like this is
off-topic in debian-devel.

Best regards,
Antonio.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gnu.org

Antonio Diaz Diaz

2015-07-28 22:40:01 UTC

Post by Antonio Diaz Diaz
the xz format is objectively more fragile than the other three.

I completely disagree. IMHO, a decompressor should be very strict and
detect any suspicious modification.

Agreed, but I'll try to make it clear how much the "strictness" of the
xz format is brain damaged.

A compressed file is like an envelope with a message inside. The
objective of the decompressor is to extract the message and deliver it
intact to the user.

It makes sense to be strict with the integrity of the message and lax
with the condition of the envelope. It is reasonable to do so, and it is
what postmen do. A blot in a corner of the envelope does not compromise
the integrity of the message.

For example, neither bzip2 nor lzip "protect" with a checksum the block
size (bzip2), or the dictionary size (lzip). An alteration in these
fields just produces a change in the amount of memory used for
decompression. It can make the decompression fail, but can't alter the
message in any way.

OTOH, bzip2, gzip and lzip are strict with the integrity of the message.
They won't deliver a message that does not pass the test.

Contrarily to the other three formats, and against common sense, xz is
strict with the integrity of the envelope but lax with the integrity of
the message.

If there is just the excrement of a fly adhered to a corner of the
envelope (a null byte appended to an otherwise intact file, for
example), xz will report that the data is corrupt and will not deliver
the message. This test is inescapable. All you can do is to send the
output to stdout and hope that the cause of the problem is not some
useless padding in the header, in which case you may recover nothing.

OTOH, xz provides at least three ways of ignoring the integrity of the
message and happily deliver a corrupt message, exiting with zero status.

Just see the two attached files. 'good.xz' is created with the command
'xz -9 -Cnone'. The corrupt version 'bad.xz' is created by changing a
couple bits in 'good.xz'.

Xz is unable to detect the corruption in 'bad.xz'. In fact xz is unable
to detect ANY corruption that may happen to the payload of 'good.xz'.
But if you just try to append a null byte to 'good.xz' it will report
corrupt data!

Lzma-alone was simply a toy format lacking fundamental features. Xz is
willfully designed to allow the maximization of the probability of
losing user's data.

Best regards,
Antonio.

Nicholas Breen

2015-07-29 23:10:02 UTC

Post by Antonio Diaz Diaz
If there is just the excrement of a fly adhered to a corner of the
envelope (a null byte appended to an otherwise intact file, for
example), xz will report that the data is corrupt and will not
deliver the message. This test is inescapable.

^^^^^^^^^^^^^^^^^^^^^^^^
--single-stream
Decompress only the first .xz stream, and silently ignore pos‐
sible remaining input data following the stream.

Post by Antonio Diaz Diaz
Just see the two attached files. 'good.xz' is created with the
command 'xz -9 -Cnone'. The corrupt version 'bad.xz' is created by
changing a couple bits in 'good.xz'.
Xz is unable to detect the corruption in 'bad.xz'.

Only because your example uses "-Cnone" to disable integrity checking, which is
not the default, and which the documentation explicitly warns against unless a
tool other than xz is being used to provide such checks. (Obviously, this
option is not used by dpkg-deb.)

--
Nicholas Breen
***@debian.org
--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@ofb.net

Antonio Diaz Diaz

2015-07-30 11:40:02 UTC

Post by Nicholas Breen

^^^^^^^^^^^^^^^^^^^^^^^^
--single-stream
Decompress only the first .xz stream, and silently ignore pos-
sible remaining input data following the stream.

Just try this command:

xz -d --single-stream file_with_many_streams_and_null_byte.xz

Xz decompresses the first stream in the file, then deletes the file, and
finally exits with 0 status.

I was wrong. In some cases the "Compressed data is corrupt" message can
be silenced... at the cost of silent massive data loss.

Best regards,
Antonio.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gnu.org

Robert Nelson

2015-07-30 12:10:02 UTC

Post by Nicholas Breen

^^^^^^^^^^^^^^^^^^^^^^^^
--single-stream
Decompress only the first .xz stream, and silently ignore pos-
sible remaining input data following the stream.

xz -d --single-stream file_with_many_streams_and_null_byte.xz
Xz decompresses the first stream in the file, then deletes the file, and
finally exits with 0 status.
I was wrong. In some cases the "Compressed data is corrupt" message can be
silenced... at the cost of silent massive data loss.

http://linux.die.net/man/1/xz

-k, --keep
Keep (don't delete) the input files.

Regards,

--
Robert Nelson
https://rcn-ee.com/
--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@mail.gmail.com

Vincent Lefevre

2015-08-02 18:40:01 UTC

A compressed file is like an envelope with a message inside. The objective
of the decompressor is to extract the message and deliver it intact to the
user.

The problem is that data could have been appended to a compressed file
(thanks Firefox!), and one wants to detect that and not lose such data,
i.e. after the envelope is not necessarily garbage, it may be important
data.
--
Vincent Lefèvre <***@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@ioooi.vinc17.net

Russ Allbery

2015-08-02 18:50:01 UTC

Post by Antonio Diaz Diaz
A compressed file is like an envelope with a message inside. The
objective of the decompressor is to extract the message and deliver it
intact to the user.

There were a few long messages to this thread that I didn't absorb in
their entirety, so apologies if this is a repeat. But another angle of
this is that the discussion is about using lzip *for Debian packages*. In
that context, being tolerant of appended data, or *any* other form of
modification to the file, is basically pointless. Debian packages are
authenticated and protected via cryptographic signatures, which will not
match if there are any changes at all to the file, even appending a nul
byte. And if the signature doesn't verify, one should treat the package
with extreme suspicion, and certainly should not be installing it on a
system except in a very controlled environment for investigative purposes.

So regardless of the merits or drawbacks of such a feature, it's rather
irrelevant to the discussion that we're having here.

Vincent Lefevre

2015-08-02 20:50:02 UTC

Post by Russ Allbery
There were a few long messages to this thread that I didn't absorb in
their entirety, so apologies if this is a repeat. But another angle of
this is that the discussion is about using lzip *for Debian packages*. In
that context, being tolerant of appended data, or *any* other form of
modification to the file, is basically pointless.

I don't think that it is pointless. I would say that it must *not*
be tolerant to appended data, because...

Post by Russ Allbery
Debian packages are
authenticated and protected via cryptographic signatures, which will not
match if there are any changes at all to the file, even appending a nul
byte. And if the signature doesn't verify, one should treat the package
with extreme suspicion, and certainly should not be installing it on a
system except in a very controlled environment for investigative purposes.

The purpose of adding garbage could be to make a modified tarball
match the signature. Of course, this would mean that the system
would no longer be crytographically safe in general, but it might
still be safe for some class of files with a fixed structure, such
as xz. And not every one would render a vulnerability public...
So, it is safer not to accept garbage when decoding.
--
Vincent Lefèvre <***@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@ioooi.vinc17.net

David Weinehall

2015-08-06 18:50:01 UTC

I don't think that it is pointless. I would say that it must *not*
be tolerant to appended data, because...

As soon as someone can generate hash collisions we're screwed anyway.

Regards, David
--
/) David Weinehall <***@debian.org> /) Rime on my window (\
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ // Diamond-white roses of fire //
\) http://www.acc.umu.se/~tao/ (/ Beautiful hoar-frost (/

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@suiko.acc.umu.se

Philipp Kern

2015-08-06 19:40:01 UTC

Post by Vincent Lefevre
The purpose of adding garbage could be to make a modified tarball
match the signature.

Which is why we also supply the length.

Kind regards
Philipp Kern

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@thaum.home.philkern.de

Jakub Wilk

2015-08-06 21:20:02 UTC

Post by Philipp Kern

Post by Vincent Lefevre
The purpose of adding garbage could be to make a modified tarball
match the signature.

Which is why we also supply the length.

I thought the idea was to create a smaller malicious tarball, then
append "garbage" until the size and the hash match.

But let's go back to reality:

If the decompressor ignores trailing garbage, then it's slightly easier
to perform chosen-prefix collision attack for tarballs[0]. You don't
have to worry about compressor's CRCs or where to hide collision blocks
from the sight of an attentive code reviewer.

[0] https://lists.debian.org/***@jwilk.net

--
Jakub Wilk
--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@jwilk.net

Antonio Diaz Diaz

2015-08-07 14:10:02 UTC

Post by Jakub Wilk

Post by Philipp Kern

Post by Vincent Lefevre
The purpose of adding garbage could be to make a modified tarball
match the signature.

Which is why we also supply the length.

I thought the idea was to create a smaller malicious tarball, then
append "garbage" until the size and the hash match.

With xz you don't need trailing garbage to match the size and the hash.
Xz allows you to insert as much garbage inside the file as you want. Xz
is an ideal vector for malware because it is strict with the envelope
but lax with the message.

I have no experience at all rigging tarballs, but it took me just
minutes to obtain two xz compressed tarballs with very different
contents that match in size and sum(1). I did it just with an editor,
ddrescue and data from /dev/urandom, by brute force, without any
knowledge about the algorithm of sum. And I did it not once, but twice.

The original tarballs are 1 and 2. 1b and 2b are the altered versions
yielding the same sum as the opposite original tarball:

-rw-r--r-- 1 10292 2015-08-07 11:52 collision1.tar.xz
-rw-r--r-- 1 10292 2015-08-07 13:32 collision1b.tar.xz
-rw-r--r-- 1 10292 2015-08-07 11:53 collision2.tar.xz
-rw-r--r-- 1 10292 2015-08-07 13:04 collision2b.tar.xz

$ sum collision*.tar.xz
42870 11 collision1.tar.xz
53341 11 collision1b.tar.xz
53341 11 collision2.tar.xz
42870 11 collision2b.tar.xz

$ xz -t collision*.tar.xz ; echo $?

0

$ tar -tf collision1.tar.xz ; echo $?
configure

0
$ tar -tf collision1b.tar.xz ; echo $?
configure

0

$ tar -tf collision2.tar.xz ; echo $?
Makefile

0
$ tar -tf collision2b.tar.xz ; echo $?
Makefile

0

If a weak hash is used, or if a way of creating hash collisions is
found, xz makes it easy to create altered tarballs with the same hash
and size. Just try to do the same with bzip2, gzip or lzip without
adding trailing garbage.

Best regards,
Antonio.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gnu.org

Vincent Lefevre

2015-08-07 19:30:02 UTC

Post by Antonio Diaz Diaz
I have no experience at all rigging tarballs, but it took me just
minutes to obtain two xz compressed tarballs with very different
contents that match in size and sum(1). I did it just with an
editor, ddrescue and data from /dev/urandom, by brute force, without
any knowledge about the algorithm of sum. And I did it not once, but
twice.

sum(1) just gives a 16-bit checksum! So, it suffices to generate
N*65536 random compressed tarballs to get around N collisions with
a given file. Then the only problem is to get the right size, but
if one has random input, it is (almost) not compressible, so that
one will get "almost" the same size for each tarball. By controlling
how compression is done to reach the right size, this should even be
easier.
--
Vincent Lefèvre <***@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@zira.vinc17.org

Vincent Lefevre

2015-08-07 20:20:02 UTC

The following script gives lzip collisions after a few seconds between
arbitrary lzip tarballs. This is easier that a collision with a fixed
tarball because of the birthday paradox. But one can do something
similar by going to at most a few millions to get a collision with
some given tarball of about 64 KB.

A real test should be based on a hash for which one knows to build
collisions only by using well-chosen garbage, like MD5.

If xz can contain arbitrary garbage without affecting other parts
of the file (while still keeping its validity), this is quite bad,
but still safer than garbage at the end, which is IMHO the worst
for incremental hashes.

#!/usr/bin/env zsh

typeset -A a

r=test-random
head -c 65536 /dev/urandom > $r

rm -f foo*.tar.lz

for i in {1..1000}
do
file=file$i
tarf=foo$i.tar
ln -f $r $file
tar cf $tarf $file
lzip $tarf
rm $file
tarz=$tarf.lz
s="$(sum $tarz | cut -d' ' -f1) $(command stat -c %s $tarz)"
if [[ -n $a[$s] ]] then
echo $a[$s]
echo $tarz
break
fi
a[$s]=$tarz
done
--
Vincent Lefèvre <***@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@zira.vinc17.org

Joerg Jaspert

2015-08-07 21:00:02 UTC

Post by Antonio Diaz Diaz
contents that match in size and sum(1). I did it just with an
editor, ddrescue and data from /dev/urandom, by brute force, without
any knowledge about the algorithm of sum. And I did it not once, but
twice.

Is it only me thinking it now or is this really gone over way into the
comedy section? Why isn't this on -curiosa?

Person 1: xz is shit, use lzip, change around all your tools, invest
tons of work

Debian: Na, we selected this, we stay with it, unless proven it gains
something in OUR use case

Person 1: But xz is shit!

Debian: It doesn't matter in our use case.

Person 1: But xz is shit! [loads of examples of things that may go
wrong]

[repeat forever]

This is not only getting annoying and ensuring that we won't ever switch
to lzip - the only way people now think about it is "thats the tool that
has this really annoying supporter" - it is also not getting anywhere.

If one wants to switch something in Debian one does not demand that
Debian switches $foo - one starts by doing as much of the work as can be
done. And providing good reasons why Debian should switch. Reasons that
consider the way we use it and all the tools around it.

Could we please end this thread? Thanks.

--
bye, Joerg
Just because I don't care doesn't mean I don't understand.
--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@gkar.ganneff.de

Philip Hands

2015-08-07 21:40:03 UTC

Post by Joerg Jaspert

Is it only me thinking it now or is this really gone over way into the
comedy section? Why isn't this on -curiosa?

NO, it definitely isn't just you -- and the rest of your post was spot on too.

Please stop this nonsense forthwith.

Cheers, Phil.

--
|)| Philip Hands [+44 (0)20 8530 9560] HANDS.COM Ltd.
|-| http://www.hands.com/ http://ftp.uk.debian.org/
|(| Hugo-Klemm-Strasse 34, 21075 Hamburg, GERMANY

Don Armstrong

2015-06-14 15:20:02 UTC

Post by Thomas Goirand
Therefore, I'm tempted to raise this to the technical committee
(putting their list as Cc). Does anyone see a reason why I am
mistaking here?

Does a patch exist which can enable lz for orig.tar?

Otherwise, I guess some of us could be involved to help clarify
communication, but anyone can do that, really.

--
Don Armstrong http://www.donarmstrong.com
--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@qor.donarmstring.com

Thomas Goirand

2015-06-15 09:10:02 UTC

Post by Don Armstrong

Post by Thomas Goirand
Therefore, I'm tempted to raise this to the technical committee
(putting their list as Cc). Does anyone see a reason why I am
mistaking here?

Does a patch exist which can enable lz for orig.tar?

Isn't it what this is doing?

https://bugs.debian.org/600094
https://bugs.debian.org/556960

Post by Don Armstrong
Otherwise, I guess some of us could be involved to help clarify
communication, but anyone can do that, really.

I'm not at a stage where I want to involve the CTTE right now. I still
would prefer to gather opinions and see where it goes.

Thomas Goirand (zigo)

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@debian.org

Aron Xu

2015-06-15 09:50:01 UTC

Post by Don Armstrong

Post by Thomas Goirand
Therefore, I'm tempted to raise this to the technical committee
(putting their list as Cc). Does anyone see a reason why I am
mistaking here?

Does a patch exist which can enable lz for orig.tar?

Isn't it what this is doing?
https://bugs.debian.org/600094
https://bugs.debian.org/556960

Post by Don Armstrong
Otherwise, I guess some of us could be involved to help clarify
communication, but anyone can do that, really.

I'm not at a stage where I want to involve the CTTE right now. I still
would prefer to gather opinions and see where it goes.

I don't hold a view on whether we want lz support in dpkg/dak, but it
could be a pity if we really involve CTTE for such an issue. To me,
it's sorta abusing the escalation process if every individual
developer raise an issue and seek for overruling when his opinion were
n'acked by the maintainer of relevant part... But this is just my own,
secret, POV, please don't start a flame war for it.

Thanks,
Aron

Sam Hartman

2015-06-15 12:30:03 UTC

Aron> I don't hold a view on whether we want lz support in dpkg/dak,
Aron> but it could be a pity if we really involve CTTE for such an
Aron> issue. To me, it's sorta abusing the escalation process if
Aron> every individual developer raise an issue and seek for
Aron> overruling when his opinion were n'acked by the maintainer of
Aron> relevant part... But this is just my own, secret, POV, please
Aron> don't start a flame war for it.

hi. Speaking as a member of the TC, I'd really like to encourage people
to not think of coming to the TC only to get stuff overruled. we can be
a resource for helping people understand why communication is not
working and helping people understand each other. If someone feels that
their input was not heard, I'd rather them ask for help with that than
have lingering frustrations build up. Not being heard is different from
not being agreed with. When I'm not heard I'm typically frustrated
because I don't have confidence that the people making the decision
considered what I was saying or frustrated because the rationale for the
decision doesn't address some objection I raised.
That tends to cause me to think about whether I should spend my time
elsewhere.
In contrast, I end up not being agreed with all the time. I can
understand the tradeoffs people are making, and can value the decision
and process even when it is not one I'd make.

Now, it sounds like involving the TC is premature here on either ground,
but we can help do more than just overrule stuff.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/0000014df726c676-b67b0bd3-267c-4725-966c-ca7b7536220d-***@email.amazonses.com

Marco d'Itri

2015-06-15 13:00:02 UTC

Post by Thomas Goirand
I'm not at a stage where I want to involve the CTTE right now. I still
would prefer to gather opinions and see where it goes.

My opinion is that you have not proved either that lz is widely used or
that it is "better" than xz.

--
ciao,
Marco

Daniel Baumann

2015-06-16 00:40:01 UTC

[ jftr: i'm not involved/related to this thread other than to comment on
a few things since i'm using lzip since a while and happen to maintain
it in debian; also, i have no interest in any CTTE involvement at all
and trust the dpkg maintainers to add lzip eventually when they think
it's right to do so, or not. however... ]

Post by Don Armstrong
Does a patch exist which can enable lz for orig.tar?

i maintain a patched dpkg version as a proof-of-concept. for jessie:
http://sources.progress-linux.org/gitweb/?p=releases/cairon/packages/dpkg.git
--
Address: Daniel Baumann, Donnerbuehlweg 3, CH-3012 Bern
Email: ***@progress-technologies.net
Internet: http://people.progress-technologies.net/~daniel.baumann/

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@progress-technologies.net

Jonathan Dowland

2015-06-15 10:10:02 UTC

Post by Thomas Goirand
Therefore, I'm tempted to raise this to the technical committee (putting
their list as Cc). Does anyone see a reason why I am mistaking here?

Well, both bugs are over 5 years old. It would be probably wise to have a
more modern dialogue with the maintainer before considering the tech-ctte.

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@chew.redmars.org

Thomas Goirand

2015-06-15 14:40:03 UTC

Post by Jonathan Dowland

Post by Thomas Goirand
Therefore, I'm tempted to raise this to the technical committee (putting
their list as Cc). Does anyone see a reason why I am mistaking here?

Well, both bugs are over 5 years old. It would be probably wise to have a
more modern dialogue with the maintainer before considering the tech-ctte.

Which is what I'm doing right now in this thread.

Thomas

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@debian.org

Florian Weimer

2015-07-19 10:10:02 UTC

Post by Thomas Goirand
"This is a fundamental problem/defect with xz. This (and a lot of other
such defects, e.g. non-robustness of xz archives that easily lead to
file corruption etc)

Corruption breaks signatures, making the file unusable, so that's not
really an issue for dpkg.

Post by Thomas Goirand
are the reason that there is lzip (and which is why
gnu.org has, on a technical basis, decided that lzip is official
gzip-successor for gnu software releases when they come in tarballs).

The only GNU projects currently releasing .lz tarballs are gmp,
ddrescue, rcs, autoconf-archive, guile-sdl, ocrad, moe, gawk, ed,
gettext. Several more projects have stopped providing .lz files, but
did so in the past.

Reading <http://www.nongnu.org/lzip/lzip.html>, I see no commitment
towards convergence. In fact, the web page gives the impression that
further compression algorithms might be added in the future, breaking
your use case (“a much more elaborated way of finding coding sequences
of minimum size than the one currently used by lzip could be
developed”).

--
To UNSUBSCRIBE, email to debian-devel-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Archive: https://lists.debian.org/***@mid.deneb.enyo.de

Bernd Zeimetz

2015-07-30 22:40:01 UTC