A 2025 NewYear present: make dpkg --force-unsafe-io the default?

Discussion:

A 2025 NewYear present: make dpkg --force-unsafe-io the default?

(too old to reply)

Michael Tokarev

2024-12-24 10:00:01 UTC

Hi!

The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
issues, where a power-cut in the middle of filesystem metadata
operation (which dpkg does a lot) might result in in unconsistent
filesystem state. This workaround slowed down dpkg operations
quite significantly (and has been criticised due to that a lot,
the difference is really significant).

The workaround is to issue fsync() after almost every filesystem
operation, instead of after each transaction as dpkg did before.

Once again: dpkg has always been doing "safe io", the workaround
was needed for ext2fs only, - it was the filesystem which was
broken, not dpkg.

Today, doing an fsync() really hurts, - with SSDs/flash it reduces
the lifetime of the storage, for many modern filesystems it is a
costly operation which bloats the metadata tree significantly,
resulting in all further operations becomes inefficient.

How about turning this option - force-unsafe-io - to on by default
in 2025? That would be a great present for 2025 New Year! :)

Thanks,

/mjt

Julien Plissonneau Duquène

2024-12-24 10:20:01 UTC

Hi,

Post by Michael Tokarev
How about turning this option - force-unsafe-io - to on by default
in 2025? That would be a great present for 2025 New Year! :)

That sounds like a sensible idea to me.

Cheers, and best wishes,

--
Julien Plissonneau Duquène

Leandro Cunha

2024-12-24 11:50:01 UTC

Hi,

Post by Michael Tokarev
Hi!
The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
issues, where a power-cut in the middle of filesystem metadata
operation (which dpkg does a lot) might result in in unconsistent
filesystem state. This workaround slowed down dpkg operations
quite significantly (and has been criticised due to that a lot,
the difference is really significant).
The workaround is to issue fsync() after almost every filesystem
operation, instead of after each transaction as dpkg did before.
Once again: dpkg has always been doing "safe io", the workaround
was needed for ext2fs only, - it was the filesystem which was
broken, not dpkg.
Today, doing an fsync() really hurts, - with SSDs/flash it reduces
the lifetime of the storage, for many modern filesystems it is a
costly operation which bloats the metadata tree significantly,
resulting in all further operations becomes inefficient.
How about turning this option - force-unsafe-io - to on by default
in 2025? That would be a great present for 2025 New Year! :)
Thanks,
/mjt

It sounds like a really interesting idea.
Merry Christmas and a Happy New Year!

https://wiki.debian.org/Teams/Dpkg/FAQ#:~:text=Use%20the%20dpkg%20%2D%2Dforce,losing%20data%2C%20use%20with%20care.

--
Cheers,
Leandro Cunha

Hakan Bayındır

2024-12-24 12:10:01 UTC

Hi Michael,

That sounds like a neat idea. Especially, with the proliferation of more complex filesystems like BTRFS, the penalty of calling fsyc() a lot becomes very visible. I’m not a BTRFS user myself, but I always hear comments and discuss about it.

Removing this workaround can help to remove the myth that apt is slow.

Happy new year,

Cheers,

Hakan

Post by Michael Tokarev
Hi!
The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
issues, where a power-cut in the middle of filesystem metadata
operation (which dpkg does a lot) might result in in unconsistent
filesystem state. This workaround slowed down dpkg operations
quite significantly (and has been criticised due to that a lot,
the difference is really significant).
The workaround is to issue fsync() after almost every filesystem
operation, instead of after each transaction as dpkg did before.
Once again: dpkg has always been doing "safe io", the workaround
was needed for ext2fs only, - it was the filesystem which was
broken, not dpkg.
Today, doing an fsync() really hurts, - with SSDs/flash it reduces
the lifetime of the storage, for many modern filesystems it is a
costly operation which bloats the metadata tree significantly,
resulting in all further operations becomes inefficient.
How about turning this option - force-unsafe-io - to on by default
in 2025? That would be a great present for 2025 New Year! :)
Thanks,
/mjt

Simon Richter

2024-12-24 14:20:01 UTC

Hi,

Post by Michael Tokarev
The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
issues, where a power-cut in the middle of filesystem metadata
operation (which dpkg does a lot) might result in in unconsistent
filesystem state.

The thing it protects against is a missing ordering of write() to the
contents of an inode, and a rename() updating the name referring to it.

These are unrelated operations even in other file systems, unless you
use data journaling ("data=journaled") to force all operations to the
journal, in order. Normally ("data=ordered") you only get the metadata
update marking the data valid after the data has been written, but with
no ordering relative to the file name change.

The order of operation needs to be

1. create .dpkg-new file
2. write data to .dpkg-new file
3. link existing file to .dpkg-old
4. rename .dpkg-new file over final file name
5. clean up .dpkg-old file

When we reach step 4, the data needs to be written to disk and the
metadata in the inode referenced by the .dpkg-new file updated,
otherwise we atomically replace the existing file with one that is not
yet guaranteed to be written out.

We get two assurances from the file system here:

1. the file will not contain garbage data -- the number of bytes marked
valid will be less than or equal to the number of bytes actually
written. The number of valid bytes will be zero initially, and only
after data has been written out, the metadata update to change it to the
final value is added to the journal.

2. creation of the inode itself will be written into the journal before
the rename operation, so the file never vanishes.

What this does not protect against is the file pointing to a zero-size
inode. The only way to avoid that is either data journaling, which is
horribly slow and creates extra writes, or fsync().

Post by Michael Tokarev
Today, doing an fsync() really hurts, - with SSDs/flash it reduces
the lifetime of the storage, for many modern filesystems it is a
costly operation which bloats the metadata tree significantly,
resulting in all further operations becomes inefficient.

This should not make any difference in the number of write operations
necessary, and only affect ordering. The data, metadata journal and
metadata update still have to be written.

The only way this could be improved is with a filesystem level
transaction, where we can ask the file system to perform the entire
update atomically -- then all the metadata updates can be queued in RAM,
held back until the data has been synchronized by the kernel in the
background, and then added to the journal in one go. I would expect that
with such a file system, fsync() becomes cheap, because it would just be
added to the transaction, and if the kernel gets around to writing the
data before the entire transaction is synchronized at the end, it
becomes a no-op.

This assumes that maintainer scripts can be included in the transaction
(otherwise we need to flush the transaction before invoking a maintainer
script), and that no external process records the successful execution
and expects it to be persisted (apt makes no such assumption, because it
reads the dpkg status, so this is safe, but e.g. puppet might become
confused if an operation it marked as successful is rolled back by a
power loss).

What could make sense is more aggressively promoting this option for
containers and similar throwaway installations where there is a
guarantee that a power loss will have the entire workspace thrown away,
such as when working in a CI environment.

However, even that is not guaranteed: if I create a Docker image for
reuse, Docker will mark the image creation as successful when the
command returns. Again, there is no ordering guarantee between the
container contents and the database recording the success of the
operation outside.

So no, we cannot drop the fsync(). :\

Simon

Michael Tokarev

2024-12-26 09:10:01 UTC

Hi,

Post by Michael Tokarev
The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
issues, where a power-cut in the middle of filesystem metadata
operation (which dpkg does a lot) might result in in unconsistent
filesystem state.

The thing it protects against is a missing ordering of write() to the contents of an inode, and a rename() updating the name referring to it.
These are unrelated operations even in other file systems, unless you use data journaling ("data=journaled") to force all operations to the journal,
in order. Normally ("data=ordered") you only get the metadata update marking the data valid after the data has been written, but with no ordering
relative to the file name change.
The order of operation needs to be
1. create .dpkg-new file
2. write data to .dpkg-new file
3. link existing file to .dpkg-old
4. rename .dpkg-new file over final file name
5. clean up .dpkg-old file
When we reach step 4, the data needs to be written to disk and the metadata in the inode referenced by the .dpkg-new file updated, otherwise we
atomically replace the existing file with one that is not yet guaranteed to be written out.

This brings up a question: how dpkg worked before ext2fs started showing this
zero-length files behavior? Iirc it was rather safe, no?

What you're describing seems reasonable. But I wonder if we can do better here.

How about doing steps 1..3 for *all* files in the package, and only
after that, do a single fsync() and do remaining steps 4..5, again,
for all files?

Thanks,

/mjt

Julien Plissonneau Duquène

2024-12-26 09:40:01 UTC

Hi,

Post by Simon Richter
This should not make any difference in the number of write operations
necessary, and only affect ordering. The data, metadata journal and
metadata update still have to be written.

I would expect that some reordering makes it possible for fewer actual
physical write operations to happen, i.e. writes to same/neighbouring
blocks get merged/grouped (eventually by the hardware if not the kernel)
which would make a difference on both spinning devices performance (less
seeks) and solid state devices longevity (as these have larger physical
blocks), but I don't know if that's actually how it works in that case.

One way to know would be to bench what actually happens nowadays with
and without --force-unsafe-io to get some actual numbers to weigh in the
decision to make the change or not.

It would be surprising though that the dpkg man pages (among other
places) talks about performance degradations if these were not real.

Post by Simon Richter
The only way this could be improved is with a filesystem level
transaction, where we can ask the file system to perform the entire
update atomically -- then all the metadata updates can be queued in
RAM, held back until the data has been synchronized by the kernel in
the background, and then added to the journal in one go. I would expect
that with such a file system, fsync() becomes cheap, because it would
just be added to the transaction, and if the kernel gets around to
writing the data before the entire transaction is synchronized at the
end, it becomes a no-op.

That sounds interesting. But — do we have filesystems on Linux that can
do that already, or is this still a wishlist item? Also worth noting, at
least one well-known implementation in another OS was deprecated [1]
citing complexity and lack of popularity as the reasons for that
decision, and the feature is missing in their next-gen FS. So maybe it's
not that great after all?

Anyway in the current toolbox besides --force-unsafe-io we also have:
- volume or FS snapshots, for similar or better safety but not the
automatic performance gains; probably not (yet?) available on most
systems
- the auto_da_alloc ext4 mount option that AIUI should do The Right
Thing in dpkg's use case even without the fsync, actual reliability and
performance impact unknown; appears to be set by default on trixie
- eatmydata
- io_uring that allows asynchronous file operations; implementation
would require important changes in dpkg; potential performance gains in
dpkg's use case are not yet evaluated AFAIK but it looks like the right
solution for that use case.

BTW for those interested in reading a bit more about the historical and
current context around this issue aka O_PONIES I'm adding a few links at
[2].

Post by Simon Richter
but e.g. puppet might become confused

Heh. Ansible wins again.

Post by Simon Richter
So no, we cannot drop the fsync(). :\

Nowadays, most machines are unlikely to be subject to power failures at
the worst time: laptops or other mobile devices that have batteries have
replaced desktop PCs in many workplaces and homes, and machines in
datacenters usually have redundant power supplies and
batteries+generators backups. And the default filesystem for new
installations, ext4, is mounted with auto_da_alloc by default which
should make this drop safe, but whether that will result in significant
performance gains is IMO something to be tested.

If the measured performance gain makes it interesting to drop the fsync,
maybe this could become a configuration item that is set automatically
in most cases by detecting the machine type (battery, dual PSU,
container, VM => drop fsync) and filesystem (safe fs and mount options
=> drop fsync) or by asking the user in other cases or in expert install
mode, defaulting to the safer --no-force-unsafe-io.

Cheers,

[1]:
https://learn.microsoft.com/en-us/windows/win32/fileio/deprecation-of-txf
[2]: https://lwn.net/Articles/351422/
https://lwn.net/Articles/322823/
https://lwn.net/Articles/1001770/

--
Julien Plissonneau Duquène

nick black

2024-12-26 11:20:02 UTC

- io_uring that allows asynchronous file operations; implementation would
require important changes in dpkg; potential performance gains in dpkg's use
case are not yet evaluated AFAIK but it looks like the right solution for
that use case.

i've got a lot of experience with io_uring [0] [1], and have been
closely tracking it, and would be interested in helping out here if it's
thought that such an experiment would be useful and welcomed. be
aware that taking full advantage of io_uring, especially in a
program with concurrent computation, can require some pretty
substantial restructuring. if the issue could be solved with
e.g. parallel fsync operations, that might be a much quicker way
to pick up much of the potential performance advantages.

it seems outside dpkg's province to manipulate queue depth, but
that would be an important parameter for using this most
effectively.

chained operations will likely be useful here.

--nick

[0] https://nick-black.com/dankwiki/index.php/Io_uring
[1] https://nick-black.com/dankwiki/index.php/Io_uring_and_xdp_enter_2024

--
nick black -=- https://nick-black.com
to make an apple pie from scratch,
you need first invent a universe.

Julien Plissonneau Duquène

2024-12-26 11:40:01 UTC

So making any assumptions like we did with spinning drives is mostly
moot at this point, and the industry is very opaque about that layer.

That's one of the reasons why I think benchmarking would help here. I
would expect fewer but larger write operations to help with the wear
issue though, most FTLs especially the ones on cheaper media are
probably not too smart and may end up erasing blocks more frequently
than what is actually necessary with many scattered small writes.

Let's not forget that any server running with a RAID controller will
already have a battery backed or non-volatile cache on the card, plus
new SSDs (esp. higher end consumer (i.e. Samsung 6xx, 8xx, 9xx and
similar) and enterprise drives have unexpected power loss mitigations
in hardware. Let it be supercapacitors or non-volatile caches.

Sure, but the issue at stake here is that in some cases the expected
data hasn't even been sent to the hardware when the power loss (or
system crash) occurs. So while the features above help improving
performance in general (which in turn may contribute to reduce the
window of time in which the system is vulnerable to a power loss) they
do not resolve the issue.

Cheers,

--
Julien Plissonneau Duquène

Simon Richter

2024-12-26 12:30:02 UTC

Hi,

Post by Julien Plissonneau DuquÃ¨ne

Post by Simon Richter
This should not make any difference in the number of write operations
necessary, and only affect ordering. The data, metadata journal and
metadata update still have to be written.

I would expect that some reordering makes it possible for fewer actual
physical write operations to happen, i.e. writes to same/neighbouring
blocks get merged/grouped (eventually by the hardware if not the kernel)
which would make a difference on both spinning devices performance (less
seeks) and solid state devices longevity (as these have larger physical
blocks), but I don't know if that's actually how it works in that case.

On SSDs, it does not matter, both because modern media lasts longer than
the rest of the computer now, and because the load balancer will largely
ignore the logical block addresses when deciding where to put data into
the physical medium anyway.

On harddisks, it absolutely makes a noticeable difference, but so does
journaling.

Post by Julien Plissonneau DuquÃ¨ne
It would be surprising though that the dpkg man pages (among other
places) talks about performance degradations if these were not real.

ext4's delayed allocations mainly mean that the window where the inode
is zero sized is larger (can be a few seconds after dpkg exits with
--force-unsafe-io), so the problem is more observable, while on other
file systems, you more often get lucky and your files are filled with
the desired data instead of garbage.

The delayed allocations, on the other hand, allow the file system to
merge the entire allocation for the file, instead of gradually extending
it (but that can be easily fixed by using fallocate(2) ).

[filesystem level transactions]

Post by Julien Plissonneau DuquÃ¨ne
That sounds interesting. But — do we have filesystems on Linux that can
do that already, or is this still a wishlist item? Also worth noting, at
least one well-known implementation in another OS was deprecated [1]
citing complexity and lack of popularity as the reasons for that
decision, and the feature is missing in their next-gen FS. So maybe it's
not that great after all?

It is complex to the extent that it requires the entire file system to
be designed around it, including the file system API -- suddenly you get
things like isolation levels and transaction conflicts that programs
need to be at least vaguely aware of.

It would be easier to do in Linux than in Windows, certainly, because on
Windows, file contents bypass the file system drivers entirely, and
there are additional APIs like transfer offload that would interact
badly with a transactional interface, and that would be sorely missed by
people using a SAN as storage backend.

Post by Julien Plissonneau DuquÃ¨ne
- volume or FS snapshots, for similar or better safety but not the
automatic performance gains; probably not (yet?) available on most systems

Snapshots only work if there is a way to merge them back afterwards.

What the systemd people are doing with immutable images basically goes
in the direction of snapshots -- you'd unpack the files using "unsafe"
I/O, then finally create an image, fsync() that, and then update the OS
metadata which image to load at boot.

Post by Julien Plissonneau DuquÃ¨ne
- the auto_da_alloc ext4 mount option that AIUI should do The Right
Thing in dpkg's use case even without the fsync, actual reliability and
performance impact unknown; appears to be set by default on trixie

Yes, that inserts the missing fsync(). :>

I'd expect it to perform a little bit better than the explicit fsync()
though, because that does not impose an order of operation between
files. The downside is that it also does not force an order between the
file system updates and the rewrite of the dpkg status file.

What I could see working in dpkg would be delaying the fsync() call
until right before the rename(), which is in a separate "cleanup" round
of operations anyway for the cases that matter. The difficulty there is
that we'd have to keep the file descriptor open until then, which would
need careful management or a horrible hack so we don't run into the user
or system-wide limit for open file descriptors, and recover if we do.

Post by Julien Plissonneau DuquÃ¨ne
- eatmydata

That just neuters fsync().

Post by Julien Plissonneau DuquÃ¨ne
- io_uring that allows asynchronous file operations; implementation
would require important changes in dpkg; potential performance gains in
dpkg's use case are not yet evaluated AFAIK but it looks like the right
solution for that use case.

That would be Linux specific, though.

Post by Julien Plissonneau DuquÃ¨ne
Nowadays, most machines are unlikely to be subject to power failures at

Yes, but we have more people running nVidia's kernel drivers now, so it
all evens out.

The decision when it is safe to skip fsync() is mostly dependent on
factors that are not visible to the dpkg process, like "will the result
of this operation be packed together into an image afterwards?", so I
doubt there is a good heuristic.

My feeling is that this is becoming less and less relevant though,
because it does not matter with SSDs.

Simon

Julien Plissonneau Duquène

2024-12-26 15:30:02 UTC

Post by Simon Richter
On SSDs, it does not matter, both because modern media lasts longer
than the rest of the computer now, and because the load balancer will
largely ignore the logical block addresses when deciding where to put
data into the physical medium anyway.

I'm not so sure about that, especially on cheaper and smaller storage.
There are still recent reports of people being able to wear e.g. SD
cards to the point of failure in weeks or months though that's certainly
not with system updates alone. This matters more on embedded devices
where the storage is not always (easily) replaceable, and some of these
devices may have fairly long lifespans.

[transactional FS]

Post by Simon Richter
It would be easier to do in Linux than in Windows

... but is sounds very much like we are not anywhere near there yet,
while others had it working and are now running away from it.

Post by Simon Richter
Snapshots only work if there is a way to merge them back afterwards.
What the systemd people are doing with immutable images basically goes
in the direction of snapshots -- you'd unpack the files using "unsafe"
I/O, then finally create an image, fsync() that, and then update the OS
metadata which image to load at boot.

For integration with dpkg I think the reverse approach would work
better: the snapshot would only be used to perform a rollback while
rebooting after a system crash. In the nominal case it would just be
deleted automatically at the end of the update procedure, after
confirming that everything is actually written on the medium.

Anyway currently that option is unavailable on most installed systems.

Post by Simon Richter

Post by Julien Plissonneau DuquÃ¨ne
- io_uring

That would be Linux specific, though.

Not an issue IMO. On systems that can't have it or another similar API
dpkg could just fall back to using the good old synchronous API, with
the same performance we have today.

Post by Simon Richter
My feeling is that this is becoming less and less relevant though,
because it does not matter with SSDs.

A volunteer it still needed to bench a few runs of large system updates
on ext4/SSD with and without --force-unsafe-io to sort that out ;)

Cheers,

--
Julien Plissonneau Duquène

Chris Hofstaedtler

2024-12-26 16:30:01 UTC

My feeling is that this is becoming less and less relevant though, because
it does not matter with SSDs.

This might be true on SSDs backing a single system, but on
(otherwise well-dimensioned) SANs the I/O-spikes are still very much
visible. Same is true for various container workloads.

(And yeah, there are strategies to improve these scenarios, but it's
not "it just works" territory.)

Chris

Michael Stone

2024-12-26 18:30:01 UTC

Post by Simon Richter
My feeling is that this is becoming less and less relevant though,
because it does not matter with SSDs.

To summarize: this thread was started with a mistaken belief that the
current behavior is only important on ext2. In reality the "excessive"
fsync's are the correct behavior to guarantee atomic replacement of
files. You can skip all that and in 99.9% of cases you'll be fine, but
I've seen what happens in the last .1% if the machine dies at just the
wrong time--and it isn't pretty. In certain situations, with certain
filesystems, you can rely on implicit behaviors which will mitigate
issues, but that will fail in other situations on other filesystems
without the same implicit guarantees. The right way to get better
performance is to get a reliable SSD or NVRAM cache. If you have a slow
disk there are options you can use which will make things noticeably
faster, and you will get to keep all the pieces if the system blows up
while you're using those options. Each person should make their own
decision about whether they want that, and the out-of-box default should
be the most reliable option.

Further reading: look at the auto_da_alloc option in ext4. Note that it
says that doing the rename without the sync is wrong, but there's now a
heuristic in ext4 that tries to insert an implicit sync when that
anti-pattern is used (because so much data got eaten when people did the
wrong thing). By leaning on that band-aid dpkg might get away with
skipping the sync, but doing so would require assuming a filesystem for
which that implicit guarantee is available. If you're on a different
filesystem or a different kernel all bets would be off. I don't know
how much difference skipping the fsync's makes these days if they
get done implicitly.

Theodore Ts'o

2024-12-29 01:30:01 UTC

Further reading: look at the auto_da_alloc option in ext4. Note that it says
that doing the rename without the sync is wrong, but there's now a heuristic
in ext4 that tries to insert an implicit sync when that anti-pattern is used
(because so much data got eaten when people did the wrong thing). By leaning
on that band-aid dpkg might get away with skipping the sync, but doing so
would require assuming a filesystem for which that implicit guarantee is
available. If you're on a different filesystem or a different kernel all
bets would be off. I don't know how much difference skipping the fsync's
makes these days if they get done implicitly.

Note that it's not a sync, but rather, under certain circumstances, we
initiate writeback --- but we don't wait for it to complete before
allowing the close(2) or rename(2) to complete. For close(2), we will
initiate a writeback on a close if the file descriptor was opened
using O_TRUNC and truncate took place to throw away the previous
contents of the file. For rename(2), if you rename on top of a
previously existing file, we will initiate the writeback right away.
This was a tradeoff between safety and performance, and this was done
because there was an awful lot of buggy applications out there which
didn't use fsync, and the number of application programmers greatly
outnumbered the file system programmer. This was a compromise that
was discussed at a Linux Storage, File Systems, and Memory Management
(LSF/MM) conference many years ago, and I think other file systems
like btrfs and xfs had agreed in principle that this was a good thing
to do --- but I can't speak to whether they actually implemented it.

It's very likely though that file systems that didn't exist at that
time frame, or by programmers who care a lot more about absolute
performance than say, usability in real world circumstances, wouldn't
have implemented this workaround. And so both between the fact that
it's not perfect (it narrows the window of vulnerability from 30
seconds to a fraction of a second, but it's certainly not perfect) and
the fact that not all file systems will implement this (I'd be shocked
if bcachefs had this feature), are both good reasons not to depend on
it. Of course, if you use crappy applications, you as a user may very
well be depending on it without knowing it --- which is *why*
auto_da_alloc exists. :-)

That being said, there are things you could do to speed up dpkg which
are both 100% safe; the trade-off, as always is implementation
complexity. (The reason why many application programs opened with
O_TRUNC and rewrote a file was so they wouldn't have to copy over
extended attributes and Posix ACL's, because that was Too Hard and Too
Complicated.) So what what dpkg could do is whenever there is a file
that dpkg would need to overwrite, to write it out to
"filename.dpkg-new-$pid" and keep a list of all the files. After all
of the files are written out, call syncfs(2) --- on Linux, syncfs(2)
is synchronous, although POSIX does not guarantee that the writes will
be written and stable at the time that syncfs(2) returns. But that
should be OK, since Debian GNU/kFreeBSD is no longer a thing. Only
after syncfs(2) returns, do you rename all of the dpkg-new files to
the final location on disk.

This is much faster, since you're not calling fsync(2) for each file,
but only forcing a file system commit operation just once. The cost
is more implementation complexity in dpkg. I'll let other people
decide how to trade off implemetation complexity, performance, and
safety.

Cheers,

- Ted

Michael Stone

2024-12-29 17:00:02 UTC

Post by Theodore Ts'o
Note that it's not a sync, but rather, under certain circumstances, we
initiate writeback --- but we don't wait for it to complete before
allowing the close(2) or rename(2) to complete. For close(2), we will
initiate a writeback on a close if the file descriptor was opened
using O_TRUNC and truncate took place to throw away the previous
contents of the file. For rename(2), if you rename on top of a
previously existing file, we will initiate the writeback right away.
This was a tradeoff between safety and performance, and this was done
because there was an awful lot of buggy applications out there which
didn't use fsync, and the number of application programmers greatly
outnumbered the file system programmer. This was a compromise that
was discussed at a Linux Storage, File Systems, and Memory Management
(LSF/MM) conference many years ago, and I think other file systems
like btrfs and xfs had agreed in principle that this was a good thing
to do --- but I can't speak to whether they actually implemented it.

xfs is actually where I first encountered this issue with dpkg. I think
it was on an alpha system not long after xfs was released for linux,
which was not necessarily the most stable combination. The machine
crashed during a big dpkg run and on reboot the machine had quite a lot
of empty files where it should have had executables and libraries. I
think this was somewhat known in that time frame (1999/2000) but it was
written off as xfs being buggy and I don't recall it getting a lot of
attention. (Though I still run into people who insist xfs is prone to
file corruption based on experiences like this from 25 years ago.) Also,
xfs wasn't being used for / much, and was mostly found on the kind of
systems that didn't lose power and with the kind of apps that either
didn't care about partially written files or were more careful about how
they wrote--so the number of people affected was pretty small. When ext
starting exhibiting similar behavior it suddenly became a much bigger
deal.

Florian Weimer

2025-01-05 09:00:01 UTC

Post by Theodore Ts'o

Further reading: look at the auto_da_alloc option in ext4. Note that it says
that doing the rename without the sync is wrong, but there's now a heuristic
in ext4 that tries to insert an implicit sync when that anti-pattern is used
(because so much data got eaten when people did the wrong thing). By leaning
on that band-aid dpkg might get away with skipping the sync, but doing so
would require assuming a filesystem for which that implicit guarantee is
available. If you're on a different filesystem or a different kernel all
bets would be off. I don't know how much difference skipping the fsync's
makes these days if they get done implicitly.

Note that it's not a sync, but rather, under certain circumstances, we
initiate writeback --- but we don't wait for it to complete before
allowing the close(2) or rename(2) to complete. For close(2), we will
initiate a writeback on a close if the file descriptor was opened
using O_TRUNC and truncate took place to throw away the previous
contents of the file. For rename(2), if you rename on top of a
previously existing file, we will initiate the writeback right away.
This was a tradeoff between safety and performance, and this was done
because there was an awful lot of buggy applications out there which
didn't use fsync, and the number of application programmers greatly
outnumbered the file system programmer. This was a compromise that
was discussed at a Linux Storage, File Systems, and Memory Management
(LSF/MM) conference many years ago, and I think other file systems
like btrfs and xfs had agreed in principle that this was a good thing
to do --- but I can't speak to whether they actually implemented it.

As far as I know, XFS still truncates files with pending writes during
mount if the file system was not unmounted cleanly. This means that
renaming for atomic replacement does not work reliably without fsync.
(But I'm not a file system developer.)

Post by Theodore Ts'o
So what what dpkg could do is whenever there is a file that dpkg
would need to overwrite, to write it out to "filename.dpkg-new-$pid"
and keep a list of all the files. After all of the files are
written out, call syncfs(2) --- on Linux, syncfs(2) is synchronous,
although POSIX does not guarantee that the writes will be written
and stable at the time that syncfs(2) returns. But that should be
OK, since Debian GNU/kFreeBSD is no longer a thing. Only after
syncfs(2) returns, do you rename all of the dpkg-new files to the
final location on disk.

Does syncfs work for network file systems?

Maybe a more targeted approach with a first pass of sync_file_range
with SYNC_FILE_RANGE_WRITE, followed by a second pass with fsync would
work?

Salvo Tomaselli

2025-01-14 23:10:01 UTC

Post by Simon Richter
On SSDs, it does not matter, both because modern media lasts longer than
the rest of the computer now,

That has not been my experience.

--
Salvo Tomaselli

"Io non mi sento obbligato a credere che lo stesso Dio che ci ha dotato di
senso, ragione ed intelletto intendesse che noi ne facessimo a meno."
-- Galileo Galilei

https://ltworf.codeberg.page/

Julian Andres Klode

2024-12-27 08:20:01 UTC

Post by Simon Richter
Hi,

Post by Michael Tokarev
The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
issues, where a power-cut in the middle of filesystem metadata
operation (which dpkg does a lot) might result in in unconsistent
filesystem state.

The thing it protects against is a missing ordering of write() to the
contents of an inode, and a rename() updating the name referring to it.
These are unrelated operations even in other file systems, unless you use
data journaling ("data=journaled") to force all operations to the journal,
in order. Normally ("data=ordered") you only get the metadata update marking
the data valid after the data has been written, but with no ordering
relative to the file name change.
The order of operation needs to be
1. create .dpkg-new file
2. write data to .dpkg-new file
3. link existing file to .dpkg-old
4. rename .dpkg-new file over final file name
5. clean up .dpkg-old file
When we reach step 4, the data needs to be written to disk and the metadata
in the inode referenced by the .dpkg-new file updated, otherwise we
atomically replace the existing file with one that is not yet guaranteed to
be written out.
1. the file will not contain garbage data -- the number of bytes marked
valid will be less than or equal to the number of bytes actually written.
The number of valid bytes will be zero initially, and only after data has
been written out, the metadata update to change it to the final value is
added to the journal.
2. creation of the inode itself will be written into the journal before the
rename operation, so the file never vanishes.
What this does not protect against is the file pointing to a zero-size
inode. The only way to avoid that is either data journaling, which is
horribly slow and creates extra writes, or fsync().

Post by Michael Tokarev
Today, doing an fsync() really hurts, - with SSDs/flash it reduces
the lifetime of the storage, for many modern filesystems it is a
costly operation which bloats the metadata tree significantly,
resulting in all further operations becomes inefficient.

This should not make any difference in the number of write operations
necessary, and only affect ordering. The data, metadata journal and metadata
update still have to be written.
The only way this could be improved is with a filesystem level transaction,
where we can ask the file system to perform the entire update atomically --
then all the metadata updates can be queued in RAM, held back until the data
has been synchronized by the kernel in the background, and then added to the
journal in one go. I would expect that with such a file system, fsync()
becomes cheap, because it would just be added to the transaction, and if the
kernel gets around to writing the data before the entire transaction is
synchronized at the end, it becomes a no-op.
This assumes that maintainer scripts can be included in the transaction
(otherwise we need to flush the transaction before invoking a maintainer
script), and that no external process records the successful execution and
expects it to be persisted (apt makes no such assumption, because it reads
the dpkg status, so this is safe, but e.g. puppet might become confused if
an operation it marked as successful is rolled back by a power loss).
What could make sense is more aggressively promoting this option for
containers and similar throwaway installations where there is a guarantee
that a power loss will have the entire workspace thrown away, such as when
working in a CI environment.
However, even that is not guaranteed: if I create a Docker image for reuse,
Docker will mark the image creation as successful when the command returns.
Again, there is no ordering guarantee between the container contents and the
database recording the success of the operation outside.
So no, we cannot drop the fsync(). :\

I do have a plan, namely merge the btrfs snapshot integration into apt,
if we took a snapshot, we run dpkg with --force-unsafe-io.

The cool solution would be to take the snapshot, run dpkg inside it,
and then switch it but one step after the other, that's still very
much WIP.

--
debian developer - deb.li/jak | jak-linux.org - free software dev
ubuntu core developer i speak de, en

Geert Stappers

2024-12-27 12:40:01 UTC

Post by Julian Andres Klode

Post by Simon Richter
....
So no, we cannot drop the fsync(). :\

I do have a plan, namely merge the btrfs snapshot integration into apt,
if we took a snapshot, we run dpkg with --force-unsafe-io.
The cool solution would be to take the snapshot, run dpkg inside it,
and then switch it but one step after the other, that's still very
much WIP.

Hi Julian,

Hello All,

How would that work for non-BTRFS systems, and if not, will that make Debian
a BTRFS-only system?
I'm personally fine with "This works faster in BTRFS, because we implemented
X", but not with "Debian only works on BTRFS".

Yeah, it feels wrong that dpkg gets file system code, gets code for one
particular file system.

Most likely I don't understand the proposal of Julian
and hope for further information.

Cheers,
Hakan

@Hakan, please made reading in the discussion order possible.

Groeten
Geert Stappers

--
Silence is hard to parse

Jonathan Kamens

2024-12-27 18:10:01 UTC

Post by Geert Stappers
Yeah, it feels wrong that dpkg gets file system code, gets code for one
particular file system.

I disagree. If there is a significant optimization that dpkg can
implement that is only available for btrfs, and if enough people use
btrfs that there would be significant communal benefit in that
optimization being implemented, and if it is easiest to implement the
optimization within dpkg as seems to be the case here (indeed, it may
/only/ be possible to implement the optimization within dpkg), then it
is perfectly reasonable to implement the optimization in dpkg. Dpkg is a
low-level OS-level utility, it is entirely reasonable for it to have
OS-level optimizations implemented within it.

Â jik

Hakan Bayındır

2024-12-27 20:30:02 UTC

Post by Geert Stappers
Yeah, it feels wrong that dpkg gets file system code, gets code for one
particular file system.

I disagree. If there is a significant optimization that dpkg can implement that is only available for btrfs, and if enough people use btrfs that there would be significant communal benefit in that optimization being implemented, and if it is easiest to implement the optimization within dpkg as seems to be the case here (indeed, it may only be possible to implement the optimization within dpkg), then it is perfectly reasonable to implement the optimization in dpkg. Dpkg is a low-level OS-level utility, it is entirely reasonable for it to have OS-level optimizations implemented within it.

I’m also on the same boat with Geert. I don’t think dpkg is the correct place to integrate fs-specific code. It smells like a clear boundary/responsibility violation and opens a big can of worms for future of dpkg.

Maybe, a wrapper (or more appropriately a pre-post hook) around dpkg which takes these snapshots, calls dpkg with appropriate switches and makes the switch can be implemented, but integrating it into dpkg doesn’t makes sense.

In ideal case, even that shouldn’t be done, because preferential treatment and proliferation of edge cases make maintenance very hard and unpleasant, and dpkg is critical infrastructure for Debian.

Cheers,

Hakan

jik

Aurélien COUDERC

2024-12-27 23:20:01 UTC

Post by Geert Stappers
Yeah, it feels wrong that dpkg gets file system code, gets code for one
particular file system.

I disagree. If there is a significant optimization that dpkg can implement that is only available for btrfs,

Julian was talking about apt, but that doesn't fundamentally change the argument.

and if enough people use btrfs that there would be significant communal benefit in that optimization being implemented, and if it is easiest to implement the optimization within dpkg as seems to be the case here (indeed, it may /only/ be possible to implement the optimization within dpkg), then it is perfectly reasonable to implement the optimization in dpkg.

Totally agreed : yes it would be extremely useful to have some snapshotting feature for apt operations, and no we're never going to get there if we wait for every single filesystem on every kernel to implement it. So if this has to start with btrfs then… great news and super cool !

--
Aurélien

Marc Haber

2024-12-28 09:50:01 UTC

On Sat, 28 Dec 2024 00:13:02 +0100, Aurélien COUDERC

Post by AurÃ©lien COUDERC
Totally agreed : yes it would be extremely useful to have some snapshotting feature for apt operations, and no we're never going to get there if we wait for every single filesystem on every kernel to implement it. So if this has to start with btrfs then… great news and super cool !

Do we have data about how many of our installation would be eligible
to profit from this new invention? I might think it would be better to
spend time on features that all users benefit from (in the case of
dpkg, it would be for example a big overhaul of the conffile handling
code). But on the other hand, even our dpkg and apt developers are
doing splendid work and are still volunteers, so I think they'd get to
choose what they do with their babies.

If we had Technical Leadership, things would be different, but we
deliberately chose to have not.

Greetings
Marc

--
----------------------------------------------------------------------------
Marc Haber | " Questions are the | Mailadresse im Header
Rhein-Neckar, DE | Beginning of Wisdom " |
Nordisch by Nature | Lt. Worf, TNG "Rightful Heir" | Fon: *49 6224 1600402

David Kalnischkies

2024-12-28 15:40:01 UTC

On Sat, 28 Dec 2024 00:13:02 +0100, AurÃ©lien COUDERC

Totally agreed : yes it would be extremely useful to have some snapshotting feature for apt operations, and no we're never going to get there if we wait for every single filesystem on every kernel to implement it. So if this has to start with btrfs thenâŠ great news and super cool !

Do we have data about how many of our installation would be eligible
to profit from this new invention? I might think it would be better to

fwiw this "new invention" isn't one at all.
Julian was talking about the more than a decade old
https://launchpad.net/apt-btrfs-snapshot

But yeah, most of the concerns Guillem has for dpkg apply to apt also,
as it would be kinda sad if a failed unattended upgrade in the
background resets your DebConf presentation slides to a previous
snapshot (aka: empty), so that kinda requires a particular setup
and configuration. Not something you can silently roll out on the
masses as the new and only default in Debian zurg.

spend time on features that all users benefit from (in the case of

YeahâŠ no. I am hard pressed to name a single feature that benefits "all
users". You might mean "that benefits folks similar to me" given your
example is conf files but that isn't even close to "all".

I would even suspect most apt runs being under the influence of
DEBCONF_FRONTEND=noninteractive and some --force-conf* given its
prevalence in containers and upgrade infrastructure and so your "all"
might not even mean "the majority" aka a minority use caseâŠ

But don't worry, with some luck we might even work on your fringe use
cases some day. Sooner if you help, of course.

Fun fact: apt has code specifically for a single filesystem already:
JFFS2. The code makes it so that you can run apt on systems that used
that filesystem out of the box like the OpenMoko Freerunner.
(Okay, the code is not specific for that filesystem, it is just the only
known filesystem that lacks the feature apt uses otherwise: mapping
a file â its binary cache â into shared memory, see mmap(2)).
And yet, somehow, more than a decade later, people still use apt on other
filesystems (I kinda suspect "only" nowadays actually).

Best regards

David Kalnischkies

Guillem Jover

2024-12-28 14:30:01 UTC

Hi!

Post by Geert Stappers
Yeah, it feels wrong that dpkg gets file system code, gets code for one
particular file system.

I disagree. If there is a significant optimization that dpkg can implement
that is only available for btrfs, and if enough people use btrfs that there
would be significant communal benefit in that optimization being
implemented, and if it is easiest to implement the optimization within dpkg
as seems to be the case here (indeed, it may /only/ be possible to implement
the optimization within dpkg), then it is perfectly reasonable to implement
the optimization in dpkg. Dpkg is a low-level OS-level utility, it is
entirely reasonable for it to have OS-level optimizations implemented within
it.

Port-specific or hardware specific optimizations might make sense in
dpkg, but that depends on the type, semantics, testability and
intrusiveness, among other things.

In this case (filesystem snapshotting), I do think dpkg is (currently at
least) really the wrong place, for at least the following reasons:

* No overall transaction visibility:

Frontends, such as apt, split installation (and configuration) in
multiple dpkg invocations. Installation at least AFAIR, into one per
Essential:yes package or pre-depended package (group?). So dpkg does
not currently have whole visibility of the transaction going on.

* No filesystem layout awareness:

dpkg has no idea (and should not need to have it) of the current
filesystem layout, and would need to start scanning all current
mount points, discern on what filesystems it can use snapshotting
(as in where the feature is present), and then enable that only on
the ones where the .deb might end up writing into (before having
unpacked it!), and not enable that on the ones where only user data
might reside (say /home, if that is even on a different filesystem).
Consider dpkg needs to be able to operate on chroots.

Enabling filesystem snapshotting on all mount points that support
it seems potentially dangerous, as I don't think dpkg should be
placed in a position where it needs to decide whether to rollback
to get back into a good system state vs not rolling back to avoid
losing user data (say from /home, or given that this can be user
dependent, then check all current users on the system to check any
other user home location, where this is not a system user).

* No trivial testing:

Even if the above would be non-issues, I'd be very uncomfortable
having this code in dpkg for something this central to its operation,
where I personally would not be exercising it daily, and where
testing this would imply having to perform VM installations with
such filesystems, and then having to force system crashes, reboots,
etc. Which while this is all certainly doable, it's going to be
rather slow, and thus painful as some kind of integration tests
and CI pipeline.

But other types of optimizations do make sense in dpkg, even when they
are port or hardware specific. For example I've got queued a branch
to add a selectable feature to stop ordering database loads (the .list
files) based on physical offsets (currently through Linux's fiemap),
which no longer make sense on non mechanical disks. This will require
for now enabling/disabling it explicitly (depending on the intended
sense of the option) to not regress existing installations, but
perhaps the sense could be inverted in the future to assume by default
non-mechanical disks are in use.

Thanks,
Guillem

Marco d'Itri

2025-02-14 13:50:01 UTC

In addition, I do not see how snapshotting of full FS can be correctly supported, unless all other softwares are stopped while dpkg is running.
What if a database records some transactions while dpkg is running. What would happen at rollback ?

This is why merged-/usr was so much important.
(Yes, I know about conffiles. This is why my work on having /etc more
empty by default is important.)

--
ciao,
Marco

Simon McVittie

2025-02-15 22:20:01 UTC

Post by Guillem Jover
In this case (filesystem snapshotting), I do think dpkg is (currently at

I read somewhere that Fedora can do full rollbacks because the make use
of "ostree". See [1] for example. We have ostree in our Archive [2].
I'm not familiar with it yet, but had this bookmarked as something to
look into at some point.

ostree is not used for package management or system rollbacks in "normal",
RPM-based Fedora (the closest equivalent of a typical Debian system in the
Fedora ecosystem).

ostree is primarily designed as a way to deploy a complete /usr tree
encoded into a series of files: conceptually similar to storing the
"before" and "after" as tarballs, but more efficient. It's heavily
inspired by the way git provides content-addressed storage of source code,
but for content-addressed storage of OS binaries, and it works best when
there is a single stream of updates to a well-defined core OS or a small
number of flavours (for example "default installation of Debian 12 with
GNOME", "default installation of Debian 12 with KDE Plasma", etc. would
be fine, but arbitrary variations like "Debian 12 with GNOME except we
removed Firefox and added Chromium" don't really scale).

The most obvious way to use ostree for an OS is an appliance-like
system where installing, upgrading and removing individual packages
is not possible, and you can only upgrade or roll back the entire OS
atomically as a unit, more like an Android device than Debian. This
is far from ideal for systems that are "pets", like a typical Debian
developer's laptop/desktop system or a server that has been extensively
configured, but it's more useful for systems whose only management or
configuration is that they are deployed, upgraded, and wiped/replaced.

For example Endless OS is a Debian derivative that uses ostree this way,
Fedora Silverblue is a Fedora variant that delivers the OS and the GNOME
desktop as a monolithic ostree image rather than a series of separate RPM
packages, and it's an approach that was considered for the Arch-based
OS on the Steam Deck games console (although in fact the Deck ended
up using filesystem images like Android does, rather than ostree). The
actual software that makes up an ostree-based OS often *is* initially
delivered as packages with apt/dpkg or RPM or equivalent, but they're
centrally composed into one big filesystem tree for delivery to end users,
who never interact with individual packages.

In these OSs it isn't usually straightforward (perhaps not even possible)
to add individual packages to the OS, which results in encouraging
higher-layer mechanisms "above" the base OS, like Flatpak, Docker and
Podman/Toolbx, for user-installable packages instead.

Fedora Silverblue and other RPM-based ostree OSs use rpm-ostree,
which does allow advanced users to modify the OS locally by building a
variation of the official OS image with packages added/removed/changed,
and then doing an atomic upgrade to that modified image instead of to
the original image from the vendor. I'm not sure whether an equivalent
for apt/dpkg exists; if it does, it would most likely come from Endless.

I maintain libostree in Debian because it's also used by Flatpak as its
way to store apps and runtimes - it stores a tree of binary blobs and
doesn't need to know what those binaries actually are, so it's just as
applicable to Flatpak runtimes as it is to OSs. I could also imagine it
being useful as a more efficient alternative to a series of tarballs if
you want to store the history of some other tree of arbitrary blobs, like
a series of versions of the output of debootstrap or sbuild-createchroot,
or if you want to store several similar trees efficiently, like having a
single repository with the latest minbase and sbuild trees for bookworm,
trixie and sid.

Unfortunately I'm not able to spend time on developing a dpkg version
of rpm-ostree or other more complex infrastructure around it, but if
anyone wants to co-maintain ostree and build things like that around it,
they would be very welcome to do so.

smcv

Nikolaus Rath

2024-12-30 20:50:01 UTC

Simon Richter <***@debian.org> writes:
The order of operation needs to be

Post by Simon Richter
1. create .dpkg-new file
2. write data to .dpkg-new file
3. link existing file to .dpkg-old
4. rename .dpkg-new file over final file name
5. clean up .dpkg-old file
When we reach step 4, the data needs to be written to disk and the metadata in
the inode referenced by the .dpkg-new file updated, otherwise we atomically
replace the existing file with one that is not yet guaranteed to be written out.

If a system crashed while dpkg was installing a package, then my
assumption has always been that it's possible that at least this package
is corrupted.

You seem to be saying that dpkg needs to make sure that the package is
installed correctly even when this happens. Is that right?

If so, is dpkg also doing something to prevent a partial update across
multiple files (i.e., some files in the package are upgraded while
others are not)? If not, then I wonder why having an empty file is worse
than having one with outdated contents?

Are these guarantees documented somewhere and I've just never read it?
Or is everyone else expecting more reliability from dpkg than I do by
default?

Best,
-Nikolaus

Julien Plissonneau Duquène

2024-12-30 21:30:02 UTC

Hi,

Post by Nikolaus Rath
If a system crashed while dpkg was installing a package, then my
assumption has always been that it's possible that at least this package
is corrupted.

The issue here is that without the fsync there is a risk that such
corruption occurs even if the system crashes _after_ dpkg has finished
(or finished installing a package).

What happens in that case is that the metadata (file/link creations,
renames, unlinks) can be written to the filesystem journal several
seconds before the data is written to its destination blocks. But for
security reasons the length of the created file is only updated after
the data is actually written. This is why instead of getting files with
random corrupted data you get truncated files if the crash or power loss
occurs between both writes.

There is no way to know which are the "not fully written" packages in
these cases, short of verifying all installed files of all
(re)installed/down/upgraded packages of recent runs of dpkg (which could
be a feature worth having on a recovery bootable image).

Cheers,

--
Julien Plissonneau Duquène

Nikolaus Rath

2025-01-01 15:20:01 UTC

Hi,

Post by Nikolaus Rath
If a system crashed while dpkg was installing a package, then my
assumption has always been that it's possible that at least this package
is corrupted.

The issue here is that without the fsync there is a risk that such corruption
occurs even if the system crashes _after_ dpkg has finished (or finished
installing a package).

That is not my understanding of the issue. The proposal was to disable
fsync after individual files have been unpacked, i.e. multiple times per
package. Not about one final fsync just before dpkg exits.

Best,
-Nikolaus

Michael Stone

2025-01-01 17:10:01 UTC

Post by Nikolaus Rath

Post by Nikolaus Rath
If a system crashed while dpkg was installing a package, then my
assumption has always been that it's possible that at least this package
is corrupted.

The issue here is that without the fsync there is a risk that such corruption
occurs even if the system crashes _after_ dpkg has finished (or finished
installing a package).

That is not my understanding of the issue. The proposal was to disable
fsync after individual files have been unpacked, i.e. multiple times per
package. Not about one final fsync just before dpkg exits.

You seem to be assuming that dpkg is only processing a single package at
a time? Doing so would not be an efficiency gain, IMO.

Julien Plissonneau Duquène

2025-01-02 08:00:02 UTC

Hi,

Post by Nikolaus Rath
That is not my understanding of the issue. The proposal was to disable
fsync after individual files have been unpacked, i.e. multiple times per
package. Not about one final fsync just before dpkg exits.

The way fsync works, that would still be multiple times per package
anyway, as fsync has to be called for every file to be written. The way
dpkg works, there is no such thing as "one final fsync before exit":
dpkg processes packages sequentially and commits the writes after
processing each package [1]. There are however already some existing
optimizations that have been reported in this thread [2], notably this

Post by Nikolaus Rath
* Then we reworked the code to defer and batch all the fsync()s for
a specific package after all the file writes, and before the
renames,
which was a bit better but not great.

The code in question is there [3] btw, if anyone wants to take a look.
After reading that and current ext4 features and default mount options
it seems now likely to me that not much (if any) performance
improvements or write amplification reductions are to be expected from
--force-unsafe-io alone. I'm now waiting for our very welcome volunteer
to come back with numbers that will hopefully end that cliffhanger.

Cheers,

[1] and there is potential for optimizations there, but getting them 1.
to just work and then 2. to be at least as safe as the current code is
not exactly going to be trivial.
[2]: https://lists.debian.org/debian-devel/2024/12/msg00597.html
[3]:
https://sources.debian.org/src/dpkg/1.22.11/src/main/archives.c/#L1159

--
Julien Plissonneau Duquène

Michael Stone

2024-12-31 14:10:01 UTC

Post by Nikolaus Rath
If a system crashed while dpkg was installing a package, then my
assumption has always been that it's possible that at least this package
is corrupted.
You seem to be saying that dpkg needs to make sure that the package is
installed correctly even when this happens. Is that right?

dpkg tries really hard to make sure the system is *still functional*
when this happens. If you skip the syncs you may end up with a system
where (for example) libc6 is effectively gone and the system won't boot
back up. There may certainly still be issues where an upgrade is in
progress and some of the pieces don't interact properly, but the intent
is that should result in a system which can be fixed by completing the
install, vs issues which prevent the system from getting to the point of
completing the install. Skip enough syncs and you may not even be able
to recover via a rescue image without totally reinstalling the system
because dpkg wouldn't even know the state of the packages.

Michael Stone

2024-12-31 17:20:02 UTC

It feels wrong to me to justify such a heavy performance penalty this way if

Well, I guess we'd have to agree on the definition of "heavy performance
penalty". I have not one debian system where dpkg install time is a
bottleneck.

Soren Stoutner

2024-12-31 17:40:01 UTC

Post by Michael Stone

It feels wrong to me to justify such a heavy performance penalty this way if

Well, I guess we'd have to agree on the definition of "heavy performance
penalty". I have not one debian system where dpkg install time is a
bottleneck.

On my system, which has a Western Digital Black SN850X NVMe (PCIe 4) formatted
ext4, dpkg runs really fast (and feels like it runs faster than it did a few
years ago on similar hardware). There has been much talk on this list about
performance penalties with dpkgâs current configuration, and some requests for
actual benchmark data showing those performance penalties. So far, nobody has
produced any numbers showing that those penalties exist or how significant they
are. As I donât experience anything I could describe as a performance problem
on any of my systems, I think the burden of proof is on those who are
experiencing those problems to demonstrate them concretely before we need to
spend effort trying to figure out what changes should be made to address them.

--
Soren Stoutner
***@debian.org

Marc Haber

2024-12-31 17:50:01 UTC

Post by Soren Stoutner
On my system, which has a Western Digital Black SN850X NVMe (PCIe 4) formatted
ext4, dpkg runs really fast (and feels like it runs faster than it did a few
years ago on similar hardware). There has been much talk on this list about
performance penalties with dpkg’s current configuration, and some requests for
actual benchmark data showing those performance penalties.

Doing fsyncs to often after tiny writes will also cause write
amplification on the SSD.

I should use eatmydata more often.

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

Soren Stoutner

2024-12-31 18:00:01 UTC

Post by Marc Haber

Post by Soren Stoutner
On my system, which has a Western Digital Black SN850X NVMe (PCIe 4)
formatted ext4, dpkg runs really fast (and feels like it runs faster than
it did a few years ago on similar hardware). There has been much talk on
this list about performance penalties with dpkgâs current configuration,
and some requests for actual benchmark data showing those performance
penalties.

Doing fsyncs to often after tiny writes will also cause write
amplification on the SSD.
I should use eatmydata more often.

As pointed out earlier in the thread, the answer to the question of how fsyncs
affect SSD lifespan can vary a lot across SSD hardware controllers, because not
every fsync results in actual writes to the flash storage, and the SSD hardware
controller is often the one who decided if they do or not.

In the case of my SSD, total TB written so far over the lifetime of the drive
is 4.25 TB.

This drive is about 5 months old. It has a rated lifetime endurance of 1,200
TBW (Terabytes Written). So, assuming that rating is accurate, it can run
under the current load for 117 years.

--
Soren Stoutner
***@debian.org

Michael Stone

2024-12-31 20:50:01 UTC

Post by Marc Haber

Post by Soren Stoutner
On my system, which has a Western Digital Black SN850X NVMe (PCIe 4) formatted
ext4, dpkg runs really fast (and feels like it runs faster than it did a few
years ago on similar hardware). There has been much talk on this list about
performance penalties with dpkg’s current configuration, and some requests for
actual benchmark data showing those performance penalties.

Doing fsyncs to often after tiny writes will also cause write
amplification on the SSD.

The two year old NVMe drive in my primary desktop (which follows sid and
is updated at least once per day--far more dpkg activity than any normal
system) reports 21TB written/3% of the drive's expected endurance. There
is no possibility that I will hit that limit before the drive becomes
completely obsolete e-waste.

For this to be an actual problem rather than a (questionable)
theoretical issue would require someone to be doing continuous dpkg
upgrades to a low-write-endurance SD card...which AFAIK isn't a thing
actual people do. dpkg simply isn't the kind of tool which will cause
issues on an ssd in any reasonable scenario. If this is really a concern
for you, look for tools doing constant syncs (a good example is older
browsers which constantly saved small changes to configuration databases
which could amount to 10s of GB per day); don't look at a tool which in
typical operation doesn't write more than a few megabytes per day.

Bálint Réczey

2025-01-01 20:40:01 UTC

Hi,

Post by Marc Haber

Post by Soren Stoutner
On my system, which has a Western Digital Black SN850X NVMe (PCIe 4) formatted
ext4, dpkg runs really fast (and feels like it runs faster than it did a few
years ago on similar hardware). There has been much talk on this list about
performance penalties with dpkg’s current configuration, and some requests for
actual benchmark data showing those performance penalties.

Doing fsyncs to often after tiny writes will also cause write
amplification on the SSD.
I should use eatmydata more often.

I also use eatmydata time to time where it is safe, but sometimes I
forget, this is why I packaged the snippet to make all apt runs use
eatmydata automatically:
https://salsa.debian.org/debian/apt-eatmydata/-/blob/master/debian/control?ref_type=heads

I'll upload it when apt also gets a necessary fix to make removing the
snippet safe:
https://salsa.debian.org/apt-team/apt/-/merge_requests/419

There is an equivalent simple solution for GitHub Actions as well:
https://github.com/marketplace/actions/apt-eatmydata

I'll write a short blog post about those when apt-eatmydata gets
accepted to the archive.

Happy New Year!

Cheers,
Balint

Bálint Réczey

2025-02-13 07:20:01 UTC

Hi All,

Post by BÃ¡lint RÃ©czey
Hi,

Post by Marc Haber

Post by Soren Stoutner
On my system, which has a Western Digital Black SN850X NVMe (PCIe 4) formatted
ext4, dpkg runs really fast (and feels like it runs faster than it did a few
years ago on similar hardware). There has been much talk on this list about
performance penalties with dpkg’s current configuration, and some requests for
actual benchmark data showing those performance penalties.

Doing fsyncs to often after tiny writes will also cause write
amplification on the SSD.
I should use eatmydata more often.

I also use eatmydata time to time where it is safe, but sometimes I
forget, this is why I packaged the snippet to make all apt runs use
https://salsa.debian.org/debian/apt-eatmydata/-/blob/master/debian/control?ref_type=heads
I'll upload it when apt also gets a necessary fix to make removing the
https://salsa.debian.org/apt-team/apt/-/merge_requests/419
https://github.com/marketplace/actions/apt-eatmydata
I'll write a short blog post about those when apt-eatmydata gets
accepted to the archive.

I could not make it for New Year's Eve, buth thanks to FTP Masters
kindly accepting the package it arrived for Valentine's day. :-)
https://balintreczey.hu/blog/supercharge-your-installs-with-apt-eatmydata-because-who-needs-crash-safety-anyway/
It really shines with pure data packages not triggering any extra work
at install time, especially on Ubuntu, where zstd saves extraction
time, too.

Cheers,
Balint

Aurélien COUDERC

2025-01-01 18:30:02 UTC

Post by Soren Stoutner

Post by Michael Stone

It feels wrong to me to justify such a heavy performance penalty this way

if

Post by Michael Stone
Well, I guess we'd have to agree on the definition of "heavy performance
penalty". I have not one debian system where dpkg install time is a
bottleneck.

So far, nobody has
produced any numbers showing that those penalties exist or how significant they
are. As I don’t experience anything I could describe as a performance problem
on any of my systems, I think the burden of proof is on those who are
experiencing those problems to demonstrate them concretely before we need to
spend effort trying to figure out what changes should be made to address them.

Here’s a quick « benchmark » in a sid Plasma desktop qemu VM where I had a snapshot of up-to-date sid from Nov 24th, upgrading to today’s sid :

Summary:
Upgrading: 658, Installing: 304, Removing: 58, Not Upgrading: 2
Download size: 0 B / 1 032 MB
Space needed: 573 MB / 9 051 MB available

# time apt -y full-upgrade

real 9m49,143s
user 2m16,581s
sys 1m17,361s

# time eatmydata apt -y full-upgrade

real 3m25,268s
user 2m26,820s
sys 1m16,784s

That’s close to a *3 times faster* wall clock time when run with eatmydata.

The measurements are done after first running apt --download-only and taking the VM snapshot to avoid network impact.
The VM installation is running plain ext4 with 4 vCPU / 4 GiB RAM.
The host was otherwise idle. It runs sid on btrfs with default mount options on top of LUKS with the discard flag set. The VM’s qcow2 file is flagged with the C / No_COW xattr. It’s a recent Ryzen system with plenty of free RAM / disk space.

While I don’t have a setup to quickly reproduce an upgrade on the bare metal host in my experience I see comparable impacts. And I’ve experienced similar behaviours on other machines.

I won’t pretend I know what I’m doing, so I’m probably doing it wrong and my installs are probably broken in some obvious way. You were asking for data so here you go with a shiny data point. :)

Happy new year,
--
Aurélien

Soren Stoutner

2025-01-01 18:40:03 UTC

Post by Soren Stoutner

Post by Michael Stone

It feels wrong to me to justify such a heavy performance penalty this way

if

Post by Michael Stone
Well, I guess we'd have to agree on the definition of "heavy performance
penalty". I have not one debian system where dpkg install time is a
bottleneck.

So far, nobody has
produced any numbers showing that those penalties exist or how significant
they are. As I donât experience anything I could describe as a

performance

Post by Soren Stoutner
problem on any of my systems, I think the burden of proof is on those who
are experiencing those problems to demonstrate them concretely before we
need to spend effort trying to figure out what changes should be made to
address them.

Hereâs a quick Â«â¯benchmarkâ¯Â» in a sid Plasma desktop qemu VM where I had a
Upgrading: 658, Installing: 304, Removing: 58, Not Upgrading: 2
Download size: 0 B / 1â¯032 MB
Space needed: 573 MB / 9â¯051 MB available
# time apt -y full-upgrade
real 9m49,143s
user 2m16,581s
sys 1m17,361s
# time eatmydata apt -y full-upgrade
real 3m25,268s
user 2m26,820s
sys 1m16,784s
Thatâs close to a *3 times faster* wall clock time when run with eatmydata.
The measurements are done after first running apt --download-only and taking
the VM snapshot to avoid network impact. The VM installation is running

plain

ext4 with 4 vCPU / 4 GiB RAM.
The host was otherwise idle. It runs sid on btrfs with default mount options
on top of LUKS with the discard flag set. The VMâs qcow2 file is flagged with
the C / No_COW xattr. Itâs a recent Ryzen system with plenty of free RAM /
disk space.
While I donât have a setup to quickly reproduce an upgrade on the bare metal
host in my experience I see comparable impacts. And Iâve experienced similar
behaviours on other machines.
I wonât pretend I know what Iâm doing, so Iâm probably doing it wrong and my
installs are probably broken in some obvious way. You were asking for data

so

here you go with a shiny data point. :)

That is an interesting data point. Could you also run with --force-unsafe-io
instead of eatmydata? I donât know if there would be much of a difference
(hence the reason for the need of a good benchmark), but as the proposal here
is to enable --force-unsafe-io by default instead of eatmydata it would be
interesting to see what the results of that option would be.

--
Soren Stoutner
***@debian.org

Aurélien COUDERC

2025-01-02 00:20:01 UTC

Post by Soren Stoutner
That is an interesting data point. Could you also run with --force-unsafe-io
instead of eatmydata? I don’t know if there would be much of a difference
(hence the reason for the need of a good benchmark), but as the proposal here
is to enable --force-unsafe-io by default instead of eatmydata it would be
interesting to see what the results of that option would be.

Sure but I wouldn’t know how to do that since I’m calling apt and force-unsafe-io seems to be a dpkg option ?

Thanks,
--
Aurélien

Soren Stoutner

2025-01-02 00:30:02 UTC

Post by Soren Stoutner
That is an interesting data point. Could you also run with
--force-unsafe-io
instead of eatmydata? I donât know if there would be much of a difference
(hence the reason for the need of a good benchmark), but as the proposal here
is to enable --force-unsafe-io by default instead of eatmydata it would be
interesting to see what the results of that option would be.

Sure but I wouldnât know how to do that since Iâm calling apt and
force-unsafe-io seems to be a dpkg option ?

Canât you just take the list of packages you have already downloaded with apt
and install them with dpkg instead?

The speed differential you have demonstrated with eatmydata is significant. I
donât know if --force-unsafe-io will produce the same speed differential or
not, but if it does then I think you have met the criteria for it being worth
our while to see if we can safely adopt at least some aspects of --force-
unsafe-io, at least on some file systems.

--
Soren Stoutner
***@debian.org

Julien Plissonneau Duquène

2025-01-02 07:50:01 UTC

Hi,

Post by AurÃ©lien COUDERC

Post by Soren Stoutner
That is an interesting data point. Could you also run with
--force-unsafe-io

Sure but I wouldn’t know how to do that since I’m calling apt and
force-unsafe-io seems to be a dpkg option ?

You could try adding -o dpkg::options=--force-unsafe-io to the command
line.

Cheers,

--
Julien Plissonneau Duquène

Michael Tokarev

2025-01-02 14:20:01 UTC

Post by AurÃ©lien COUDERC
Sure but I wouldn’t know how to do that since I’m calling apt and force-unsafe-io seems to be a dpkg option ?

echo force-unsafe-io > /etc/dpkg/dpkg.conf.d/unsafeio

before upgrade.

/mjt

Ángel

2025-01-02 23:50:01 UTC

Post by Michael Tokarev

Post by AurÃ©lien COUDERC
Sure but I wouldn’t know how to do that since I’m calling apt and
force-unsafe-io seems to be a dpkg option ?

echo force-unsafe-io > /etc/dpkg/dpkg.conf.d/unsafeio
before upgrade.
/mjt

Beware: this should actually be

echo force-unsafe-io > /etc/dpkg/dpkg.cfg.d/unsafeio

:)

Michael Stone

2025-01-03 16:10:01 UTC

Shared infrastructure of course. Note that this includes an update of
the initramfs, which is CPU bound and takes a bit on this system. You
can take around 45s off the clock for the initramfs regeneration in
each run. I did a couple of runs and the results were pretty
consistent.

This tracks with my experience: optimizing initramfs creation would
produce *far* more bang for the buck than fiddling with dpkg fsyncs...
especially since we tend to do that repeatedly on any major upgrade. :(

Bálint Réczey

2025-01-03 18:50:01 UTC

Hi,

Post by Michael Stone

Shared infrastructure of course. Note that this includes an update of
the initramfs, which is CPU bound and takes a bit on this system. You
can take around 45s off the clock for the initramfs regeneration in
each run. I did a couple of runs and the results were pretty
consistent.

This tracks with my experience: optimizing initramfs creation would
produce *far* more bang for the buck than fiddling with dpkg fsyncs...
especially since we tend to do that repeatedly on any major upgrade. :(

Well, that depends on the system configuration and on whether the
upgrade triggers initramfs updates.
OTOH 45s seems quite slow. Bernhard, do you have zstd installed and
initramfs-tools configured to use it?
On my laptop 3 kernels are installed and on initramfs update round ~10s:

***@nano:~$ grep -m 1 "model name" /proc/cpuinfo
model name : 11th Gen Intel(R) Core(TM) i7-1160G7 @ 1.20GHz
***@nano:~$ grep COMPRESS /etc/initramfs-tools/initramfs.conf ;
sudo time update-initramfs -k all -u
# COMPRESS: [ gzip | bzip2 | lz4 | lzma | lzop | xz | zstd ]
COMPRESS=zstd
...
update-initramfs: Generating /boot/initrd.img-6.8.0-51-generic
... (2 more kernels)
5.63user 5.35system 0:10.48elapsed 104%CPU (0avgtext+0avgdata 29540maxresident)k
541534inputs+540912outputs (247major+1241330minor)pagefaults 0swaps

If I switch to gzip, the initramfs update takes ~19s:
***@nano:~$ grep COMPRESS /etc/initramfs-tools/initramfs.conf ;
sudo time update-initramfs -k all -u
# COMPRESS: [ gzip | bzip2 | lz4 | lzma | lzop | xz | zstd ]
COMPRESS=gzip
# COMPRESSLEVEL: ...
# COMPRESSLEVEL=1
update-initramfs: Generating /boot/initrd.img-6.8.0-51-generic
... (2 more kernels)
10.84user 8.31system 0:18.78elapsed 101%CPU (0avgtext+0avgdata
29556maxresident)k
541502inputs+530160outputs (246major+1225801minor)pagefaults 0swaps

Cheers,
Balint

Ahmad Khalifa

2025-01-01 19:20:01 UTC

Post by AurÃ©lien COUDERC
Upgrading: 658, Installing: 304, Removing: 58, Not Upgrading: 2
Download size: 0 B / 1 032 MB
Space needed: 573 MB / 9 051 MB available
# time apt -y full-upgrade
real 9m49,143s
user 2m16,581s
sys 1m17,361s
# time eatmydata apt -y full-upgrade
real 3m25,268s
user 2m26,820s
sys 1m16,784s
That’s close to a *3 times faster* wall clock time when run with eatmydata.

The second upgrade hasn't been written back at all. For up to 30
seconds, if your machine loses power and for example grub.{cfg,efi} or
shim.efi was updated, you can't boot at all. Not to mention Kernel
panics still happen.

Surely, we can all agree dpkg should at least do a single 'sync' at the
end? Perhaps this would be a more realistic comparison?
# time (eatmydata apt -y full-upgrade; sync)

I wrote something to compare a single `dpkg --install`. This is on a
stable chroot where I downloaded 2 packages. Installing and purging
with/without eatmydata, but with an explicit sync.
`dpkg --install something; sync`
vs.
`eatmydata dpkg --install something; sync`

The two .debs were

Post by AurÃ©lien COUDERC
-rw-r--r-- 1 root root 138K Jan 1 18:29 fdisk_2.38.1-5+deb12u2_amd64.deb
-rw-r--r-- 1 root root 12M Jan 1 18:29 firmware-amd-graphics_20230210-5_all.deb
Warmup... Timing dpkg with fdisk
.05 .04 .04
Warmup... Timing eatmydata dpkg with fdisk
.03 .02 .02
Warmup... Timing dpkg with firmware-amd-graphics
.46 .47 .45
Warmup... Timing eatmydata dpkg with firmware-amd-graphics
.63 .63 .67
Warmup... Timing dpkg with fdisk
.09 .08 .11
Warmup... Timing eatmydata dpkg with fdisk
.05 .05 .05
Warmup... Timing dpkg with firmware-amd-graphics
3.46 3.58 3.46
Warmup... Timing eatmydata dpkg with firmware-amd-graphics
2.95 2.99 3.06

Not sure why SSD with eatmydata takes longer, but eatmydata is faster
consistently on a mechanical, but still only a 15% improvement.

Script snippet here:
https://salsa.debian.org/-/snippets/765

--
Regards,
Ahmad

Aurélien COUDERC

2025-01-02 00:20:01 UTC

Post by Ahmad Khalifa

Post by AurÃ©lien COUDERC
Upgrading: 658, Installing: 304, Removing: 58, Not Upgrading: 2
Download size: 0 B / 1 032 MB
Space needed: 573 MB / 9 051 MB available
# time apt -y full-upgrade
real 9m49,143s
user 2m16,581s
sys 1m17,361s
# time eatmydata apt -y full-upgrade
real 3m25,268s
user 2m26,820s
sys 1m16,784s
That’s close to a *3 times faster* wall clock time when run with eatmydata.

The second upgrade hasn't been written back at all. For up to 30
seconds, if your machine loses power and for example grub.{cfg,efi} or
shim.efi was updated, you can't boot at all. Not to mention Kernel
panics still happen.

Yes but no. :)
It’s true that I should have included a final sync in the measurement but in practice it makes no difference because writing however many 100s of MB in a single sync is almost instantaneous.

Here’s a new run with a final sync :

# time eatmydata apt -y full-upgrade ; time sync

real 3m25,116s
user 2m26,947s
sys 1m17,962s

real 0m0,109s
user 0m0,002s
sys 0m0,000s

Post by Ahmad Khalifa
Surely, we can all agree dpkg should at least do a single 'sync' at the
end? Perhaps this would be a more realistic comparison?
# time (eatmydata apt -y full-upgrade; sync)

Yes I never said the contrary, the kind of guarantees that dpkg and/or apt are providing are highly valuable.

It would be better if it was not coming at such a high cost.

Happy hacking,
--
Aurélien

Aaron Rainbolt

2024-12-24 15:10:01 UTC

Post by Michael Tokarev
Hi!
The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
issues, where a power-cut in the middle of filesystem metadata
operation (which dpkg does a lot) might result in in unconsistent
filesystem state. This workaround slowed down dpkg operations
quite significantly (and has been criticised due to that a lot,
the difference is really significant).
The workaround is to issue fsync() after almost every filesystem
operation, instead of after each transaction as dpkg did before.
Once again: dpkg has always been doing "safe io", the workaround
was needed for ext2fs only, - it was the filesystem which was
broken, not dpkg.
Today, doing an fsync() really hurts, - with SSDs/flash it reduces
the lifetime of the storage, for many modern filesystems it is a
costly operation which bloats the metadata tree significantly,
resulting in all further operations becomes inefficient.

Holy moly, is *this* why apt is sometimes slow to the point where you
think you're going to die before the software upgrade finishes? :P I'm
kidding obviously, but I have had upgrades that were unbelievably slow
and there didn't seem to be much CPU activity to help me understand
why. I bet this is it.

Post by Michael Tokarev
How about turning this option - force-unsafe-io - to on by default
in 2025? That would be a great present for 2025 New Year! :)

I love this idea. Just in case there are some ext2 users still out
there though, maybe dpkg could do its fsyncs intelligently, detecting
any ext2 filesystems mounted on the system and issuing an fsync only
after filesystem operations that affect an ext2 filesystem?

Post by Michael Tokarev
Thanks,
/mjt

Simon Josefsson

2024-12-26 16:50:01 UTC

Did anyone benchmark if this makes any real difference, on a set of
machines and file systems?

Say typical x86 laptop+server, arm64 SoC+server, GitLab/GitHub shared
runners, across ext4, xfs, btrfs, across modern SSD, old SSD/flash and
spinning rust.

If eatmydata still results in a performance boost or reliability
improvement (due to reduced wear and tear) on any of those platforms,
maybe we can recommend that instead.

/Simon

Guillem Jover

2024-12-28 16:10:01 UTC

Hi!

[ This was long ago, and the following is from recollection from the
top of my head and some mild «git log» crawling, and while I think
it's still accurate description of past events, interested people
can probably sieve through the various long discussions at the time
in bug reports and mailing lists from references in the FAQ entry,
which BTW I don't think has been touched since, so might additionally
be in need of a refresh perhaps, don't know. ]

Post by Michael Tokarev
The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
issues,

The problem showed up with ext4 (not ext2 or ext3), AFAIR when Ubuntu
switched their default filesystem in their installer, and reports
started to come in droves about systems being broken.

For all of its existence (AFAIR) dpkg has performed safe and durable
operations for its own database (not for the database directories),
but it was not doing the same for the installed filesystem. That was
introduced at the time to fix the zero-length file behavior from newer
filesystems.

Post by Michael Tokarev
where a power-cut in the middle of filesystem metadata
operation (which dpkg does a lot) might result in in unconsistent
filesystem state. This workaround slowed down dpkg operations
quite significantly (and has been criticised due to that a lot,
the difference is really significant).

I do think the potential for the zero-length files behavior is a
misfeature of newer filesystems, but I do agree that the fsync()s
are the only way to guarantee the properties dpkg expects from the
filesystem. So I don't consider that a workaround at all.

My main objection was/is with how upstream Linux filesystem
maintainers characterized all this. Where it looked like they were
disparaging userland application writers in general for being
incompetent for no performing such fsync()s, but then when one adds
them, those programs become extremely slow, and then one would need
to start using filesystem or OS specific APIs and rather unnatural
code patterns to regain some semblance of the previous performance.
I don't think this upstream perspective has changed much, given that
the derogatory O_PONIES subject still comes up from time to time.

Post by Michael Tokarev
The workaround is to issue fsync() after almost every filesystem
operation, instead of after each transaction as dpkg did before.
Once again: dpkg has always been doing "safe io", the workaround
was needed for ext2fs only, - it was the filesystem which was
broken, not dpkg.

The above also seems quite confused. :) dpkg has always done fsync()
for both its status file and for every in-core status modification
via its own journaled support for it (in the /var/lib/dpkg/updates/
directory).

What was implemented at the time was to add missing fsync()s for
database directories, and fsync()s for the unpacked filesystem objects.

AFAIR:

* We first implemented that via fsync()s to individual files
immediately after writing them on unpack, which had acceptable
performance on filesystems such as ext3 (which I do recall using
at the time) but was pretty terrible on ext4.
* Then we reworked the code to defer and batch all the fsync()s for
a specific package after all the file writes, and before the renames,
which was a bit better but not great.
* Then after a while we tried to use a single sync(2) before the
package file renames, which implied system wide syncs and implied
terrible performance for unrelated filesystems (such as USB
drives or network mounts), which got subsequently disabled.
* Then --force-unsafe-io was added to cope with workloads where the
safety was not required, or for people who preferred performance
over safety, on those same new filesystems that required it and
the option was performance xor safety.
* Then, after suggestions from Linux filesystem developers we switched
to initiate asynchronous writebacks immediately after a file unpack
to not block (via Linux sync_file_range(SYNC_FILE_RANGE_WRITE)),
and then add a writeback barrier where the previous (disabled)
sync(2) was (via Linux sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)),
so that the subsequent fsync(2) would had already been done by that
time, and would only imply a synchronization point.
* Then for non-Linux instead of the SYNC_FILE_RANGE_WRITE, a
posix_fadvise(POSIX_FADV_DONTNEED) was added.
* Then after a bit the disabled sync(2) code got removed.

Post by Michael Tokarev
Today, doing an fsync() really hurts, - with SSDs/flash it reduces
the lifetime of the storage, for many modern filesystems it is a
costly operation which bloats the metadata tree significantly,
resulting in all further operations becomes inefficient.
How about turning this option - force-unsafe-io - to on by default
in 2025? That would be a great present for 2025 New Year! :)

Given that the mail is based on multiple incorrect premises, :) and
that I don't see any tests or data backing up that the fsync()s are
no longer needed for safety in general, I'm going to be extremely
reluctant to even consider disabling them by default on the main
system installation, TBH, and would ask for substantial proof that
this would not damage user systems, and even then I'd probably still
feel rather uneasy about it.

And in fact, AFAIR dpkg is still missing fsync()s for filesystem
directories, which I think might have been the cause of reported
leftover files (shared library specifically) that never got removed
and then caused problems. Still need to prep a testing rig for this
and try to reproduce that with the WIP branch I've got around.

OTOH what I also have queued is to add a new --force-reckless-io, to
suppress all fsync()s (including the ones for the database), which
would be ideal to be used on installers, chroots or containers (or for
people who prefer performance over safety, or have lots of backups and
are aware of the trade-offs :). But that has been kind of blocked on
adding database tainting support, because the filesystem contents can
always be checked via digests or can be reinstalled, but if your
database is messed up it's rather hard to know that. The problem is
that because installers would want to use that option, we'd end up
with tainted end systems which would be wrong. Well, or the taint would
need to be manually removed (making external programs having to reach
for the dpkg database). But the above and --force-unsafe-io _could_ be
easily enabled by default in chroot mode (--root) w/o tainting anything
(I've also got some code to make that possible). And I've got on my
TODO to add integrity tracking for the database so that damage can be
more easily detected, which could perhaps make the tainting less of an
issue.

(So I'm sorry, but it looks like you'll not get your 2025 present. :)

Regards,
Guillem

Gioele Barabucci

2024-12-28 19:30:01 UTC

Post by Michael Tokarev
Today, doing an fsync() really hurts, - with SSDs/flash it reduces
the lifetime of the storage, for many modern filesystems it is a
costly operation which bloats the metadata tree significantly,
resulting in all further operations becomes inefficient.
How about turning this option - force-unsafe-io - to on by default
in 2025? That would be a great present for 2025 New Year! :)

There is a possible related, but independent, optimization that has the
chance to significantly reduce dpkg install's time up to 90%.

There is PoC patch [1,2] to teach dpkg to reflink files from data.tar
instead of copying them. With no changes in semantics or FS operations,
the time to install big packages like linux-firmware goes down from 39
seconds to 6 seconds. The use of reflink would have no adverse
consequences for users of ext4, but it would greatly speed up package
installation on XFS, btrfs and (in some cases) ZFS.

[1] https://bugs.debian.org/1086976
[2] https://github.com/teknoraver/dpkg/compare/main...cow

Regards,

--
Gioele Barabucci

Guillem Jover

2024-12-28 23:00:01 UTC

Hi!

Post by Gioele Barabucci
There is a possible related, but independent, optimization that has the
chance to significantly reduce dpkg install's time up to 90%.
There is PoC patch [1,2] to teach dpkg to reflink files from data.tar
instead of copying them. With no changes in semantics or FS operations, the
time to install big packages like linux-firmware goes down from 39 seconds
to 6 seconds. The use of reflink would have no adverse consequences for
users of ext4, but it would greatly speed up package installation on XFS,
btrfs and (in some cases) ZFS.

I've not closed that bug report yet, because I've been meaning to
ponder whether there is something from the proposal there that could
be used to build upon. And whether supporting that special case makes
sense at all.

Unfortunately as it stands, that proposal requires for .debs to have
been fsync()ed beforehand (by the frontend or the user or something),
for the data.tar to not be compressed at all, and introduces a layer
violation which I think makes the .deb handling less robust as I think
would make the tar parser trip over appended ar members after the
data.tar for example.

Part of the trick here is that the fsync()s are skipped, but I think
even if none of the above were problems, then we'd still need to
fsync() stuff to get the actual filesystem entries to make sense, so
the currently missing directory fsync()s might be a worse problem for
such reflinking than the proposed disabled file data fsync()s in the
patch. But I've not checked how reflinking interacts in general with
fsync()s, etc.

Thanks,
Guillem

Matteo Croce

2024-12-30 11:20:01 UTC

Post by Guillem Jover
Part of the trick here is that the fsync()s are skipped, but I think
even if none of the above were problems, then we'd still need to >

fsync() stuff to get the actual filesystem entries to make sense, so >
the currently missing directory fsync()s might be a worse problem for >
such reflinking than the proposed disabled file data fsync()s in the >
patch. But I've not checked how reflinking interacts in general with >
fsync()s, etc. Hi, I've removed the fsync() just because when using
reflinks there isn't any file data to flush. If you reintroduce them,
the performances will not vary that much. Regards, -- Matteo Croce

perl -e 'for($t=0;;$t++){print chr($t*($t>>8|$t>>13)&255)}' |aplay

Matteo Croce

2024-12-30 11:30:01 UTC

Sorry, seems that the previous message was corrupted by the client.

Post by Guillem Jover
Part of the trick here is that the fsync()s are skipped, but I think
even if none of the above were problems, then we'd still need to
fsync() stuff to get the actual filesystem entries to make sense, so
the currently missing directory fsync()s might be a worse problem for
such reflinking than the proposed disabled file data fsync()s in the
patch. But I've not checked how reflinking interacts in general with
fsync()s, etc.

Hi, I've removed the fsync() just because when using reflinks there
isn't any file data to flush.
If you reintroduce them, the performances will not vary that much.

Regards,

--
Matteo Croce

perl -e 'for($t=0;;$t++){print chr($t*($t>>8|$t>>13)&255)}' |aplay

57 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Michael Tokarev 2024-12-24 10:00:01 UTC

Julien Plissonneau Duquène 2024-12-24 10:20:01 UTC

Leandro Cunha 2024-12-24 11:50:01 UTC

Hakan Bayındır 2024-12-24 12:10:01 UTC

Simon Richter 2024-12-24 14:20:01 UTC

Michael Tokarev 2024-12-26 09:10:01 UTC

Julien Plissonneau Duquène 2024-12-26 09:40:01 UTC

nick black 2024-12-26 11:20:02 UTC

Julien Plissonneau Duquène 2024-12-26 11:40:01 UTC

Simon Richter 2024-12-26 12:30:02 UTC

Julien Plissonneau Duquène 2024-12-26 15:30:02 UTC

Chris Hofstaedtler 2024-12-26 16:30:01 UTC

Michael Stone 2024-12-26 18:30:01 UTC

Theodore Ts'o 2024-12-29 01:30:01 UTC

Michael Stone 2024-12-29 17:00:02 UTC

Florian Weimer 2025-01-05 09:00:01 UTC

Salvo Tomaselli 2025-01-14 23:10:01 UTC

Julian Andres Klode 2024-12-27 08:20:01 UTC

Geert Stappers 2024-12-27 12:40:01 UTC

Jonathan Kamens 2024-12-27 18:10:01 UTC

Hakan Bayındır 2024-12-27 20:30:02 UTC

Aurélien COUDERC 2024-12-27 23:20:01 UTC

Marc Haber 2024-12-28 09:50:01 UTC

David Kalnischkies 2024-12-28 15:40:01 UTC

Guillem Jover 2024-12-28 14:30:01 UTC

Marco d'Itri 2025-02-14 13:50:01 UTC

Simon McVittie 2025-02-15 22:20:01 UTC

Nikolaus Rath 2024-12-30 20:50:01 UTC

Julien Plissonneau Duquène 2024-12-30 21:30:02 UTC

Nikolaus Rath 2025-01-01 15:20:01 UTC

Michael Stone 2025-01-01 17:10:01 UTC

Julien Plissonneau Duquène 2025-01-02 08:00:02 UTC

Michael Stone 2024-12-31 14:10:01 UTC

Michael Stone 2024-12-31 17:20:02 UTC

Soren Stoutner 2024-12-31 17:40:01 UTC

Marc Haber 2024-12-31 17:50:01 UTC

Soren Stoutner 2024-12-31 18:00:01 UTC

Michael Stone 2024-12-31 20:50:01 UTC

Bálint Réczey 2025-01-01 20:40:01 UTC

Bálint Réczey 2025-02-13 07:20:01 UTC

Aurélien COUDERC 2025-01-01 18:30:02 UTC

Soren Stoutner 2025-01-01 18:40:03 UTC

Aurélien COUDERC 2025-01-02 00:20:01 UTC

Soren Stoutner 2025-01-02 00:30:02 UTC

Julien Plissonneau Duquène 2025-01-02 07:50:01 UTC

Michael Tokarev 2025-01-02 14:20:01 UTC

Ángel 2025-01-02 23:50:01 UTC

Michael Stone 2025-01-03 16:10:01 UTC

Bálint Réczey 2025-01-03 18:50:01 UTC

Ahmad Khalifa 2025-01-01 19:20:01 UTC

Aurélien COUDERC 2025-01-02 00:20:01 UTC

Aaron Rainbolt 2024-12-24 15:10:01 UTC

Simon Josefsson 2024-12-26 16:50:01 UTC

Guillem Jover 2024-12-28 16:10:01 UTC

Gioele Barabucci 2024-12-28 19:30:01 UTC

Guillem Jover 2024-12-28 23:00:01 UTC

Matteo Croce 2024-12-30 11:20:01 UTC

Matteo Croce 2024-12-30 11:30:01 UTC

about - legalese

Loading...