Discussion:
Mandatory LC_ALL=C.UTF-8 during package building
(too old to reply)
Gioele Barabucci
2024-06-06 06:20:01 UTC
Permalink
Hi,

setting LC_ALL=C.UTF-8 in d/rules is a common way to fix many
reproducibility problems. It is also, in general, a more sane way to
build packages, in comparison to using whatever locale settings happen
to be set during a build. However, sprinkling a variant of `export
LC_ALL=C.UTF-8` in every d/rules is error-prone and a waste of
maintainers' time.

Would it be possible to set in stone that packages are supposed to
always be built in an environment where LC_ALL=C.UTF-8, or, in other
words, that builders must set LC_ALL=C.UTF-8?

In which document should this rule be stated? Policy?

Regards,
--
Gioele Barabucci
Luca Boccassi
2024-06-06 09:20:01 UTC
Permalink
Post by Gioele Barabucci
Hi,
setting LC_ALL=C.UTF-8 in d/rules is a common way to fix many
reproducibility problems. It is also, in general, a more sane way to
build packages, in comparison to using whatever locale settings happen
to be set during a build. However, sprinkling a variant of `export
LC_ALL=C.UTF-8` in every d/rules is error-prone and a waste of
maintainers' time.
Would it be possible to set in stone that packages are supposed to
always be built in an environment where LC_ALL=C.UTF-8, or, in other
words, that builders must set LC_ALL=C.UTF-8?
In which document should this rule be stated? Policy?
This makes sense to me, seems similar enough to SOURCE_DATE_EPOCH
Simon Richter
2024-06-06 09:50:01 UTC
Permalink
Hi,
Post by Gioele Barabucci
Would it be possible to set in stone that packages are supposed to
always be built in an environment where LC_ALL=C.UTF-8, or, in other
words, that builders must set LC_ALL=C.UTF-8?
This would be the opposite of the current rule.

Setting LC_ALL=C in debian/rules is an one-liner.

If your package is not reproducible without it, then your package is
broken. It can go in with the workaround, but the underlying problem
should be fixed at some point.

The reproducible builds checker explicitly tests different locales to
ensure reproducibility. Adding this requirement would require disabling
this check, and thus hide an entire class of bugs from detection.

Simon
Johannes Schauer Marin Rodrigues
2024-06-06 10:10:01 UTC
Permalink
Hi,

Quoting Simon Richter (2024-06-06 11:32:33)
Post by Simon Richter
Would it be possible to set in stone that packages are supposed to always
be built in an environment where LC_ALL=C.UTF-8, or, in other words, that
builders must set LC_ALL=C.UTF-8?
This would be the opposite of the current rule.
Setting LC_ALL=C in debian/rules is an one-liner.
If your package is not reproducible without it, then your package is
broken. It can go in with the workaround, but the underlying problem
should be fixed at some point.
The reproducible builds checker explicitly tests different locales to
ensure reproducibility. Adding this requirement would require disabling this
check, and thus hide an entire class of bugs from detection.
this is one facet of a much bigger discussion (which we've had before). You can
argue both ways, depending on how you look at this problem.

It is the question of whether we want to:

a) debian/rules is supposed to be runnable in a wide variety of environments.
If your package FTBFS in a one specific environment, it is the job of d/rules
to normalize the environment to cater for the specific needs of the package.

b) debian/rules is supposed to be run in a well-defined environment. If your
package FTBFS in this normalized environment, then it is the job of d/rules to
add the specific needs of the package to d/rules.

So the question is whether you either want to have d/rules normalize
heterogeneous environments (a) or whether you want d/rules to make a normalized
environment specific to the build (b). This is of course a spectrum and I think
we currently doing much more of (a).

A question that goes in a similar direction is whether every d/rules that needs
it should have to do this:

export DPKG_EXPORT_BUILDFLAGS=y
include /usr/share/dpkg/buildflags.mk

Or whether we should switch the default and require that d/rules is run in an
environment (for example as set-up by dpkg-buildpackage) where these variables
are set?

Going back to the example of LC_ALL=C.UTF-8 and reproducibility: whether or not
this "hides" problem depends on the definition of what things are allowed to
change between two builds and what constitutes these things has changed already
in the past, for example for the build path which is not *not* changed anymore
but instead recorded in the buildinfo. The same could be argued for
LC_ALL=C.UTF-8 and the environment variables already are part of the buildinfo.

So I do not think that there is an easy answer to this question.

Thanks!

cheers, josch
Johannes Schauer Marin Rodrigues
2024-06-06 11:00:01 UTC
Permalink
Hi,

Quoting Hakan Bayındır (2024-06-06 12:32:27)
Post by Johannes Schauer Marin Rodrigues
Quoting Simon Richter (2024-06-06 11:32:33)
Post by Simon Richter
Would it be possible to set in stone that packages are supposed to always
be built in an environment where LC_ALL=C.UTF-8, or, in other words, that
builders must set LC_ALL=C.UTF-8?
This would be the opposite of the current rule.
Setting LC_ALL=C in debian/rules is an one-liner.
If your package is not reproducible without it, then your package is
broken. It can go in with the workaround, but the underlying problem
should be fixed at some point.
The reproducible builds checker explicitly tests different locales to
ensure reproducibility. Adding this requirement would require disabling this
check, and thus hide an entire class of bugs from detection.
this is one facet of a much bigger discussion (which we've had before). You can
argue both ways, depending on how you look at this problem.
a) debian/rules is supposed to be runnable in a wide variety of environments.
If your package FTBFS in a one specific environment, it is the job of d/rules
to normalize the environment to cater for the specific needs of the package.
b) debian/rules is supposed to be run in a well-defined environment. If your
package FTBFS in this normalized environment, then it is the job of d/rules to
add the specific needs of the package to d/rules.
So the question is whether you either want to have d/rules normalize
heterogeneous environments (a) or whether you want d/rules to make a normalized
environment specific to the build (b). This is of course a spectrum and I think
we currently doing much more of (a).
I agree with Simon here.
And, if I understand your reply correctly, you do not disagree with me either?
C, or C.UTF-8 is not a universal locale which > works for all.
Yes. If we imagine a hypothetical switch to LC_ALL=C.UTF-8 for all source
packages by default, then there will be bugs. The question is, which bugs do we
want to fix: Bugs that happen because of a problem that occurs because we did
*not* set LC_ALL=C.UTF-8 (like reproducible builds problems) or problems that
occur because we *did* set LC_ALL=C.UTF-8 as in the example that you are
describing below.
While C.UTF-8 solves character representation part of
"The Turkish Test" [0], it doesn't solve capitalization and sorting issues.
In short, Turkish is the reason why some English text has "İ" and "ı" in
it, because in Turkish, they're all present (ı, i, I, İ), and their
capitalization rules are different (i becomes İ and ı becomes I; i.e.
no loss/gain of dot during case changes).
This creates tons of problems with software which are not aware of the
issue (Kodi completely breaks for example, and some software needs
forced/custom environments to run).
As I'm curious: if your software breaks depending on the LC_ALL setting, how do
you make it produce reproducible binaries? If it breaks with a certain LC_ALL,
then during the build you have to set LC_ALL (or one of its friends) to some
specific value, right?
So, all in all, if your software is expected to run in an international
environment, and its build/run behavior breaks in an environment is not
to its liking, I also argue that the software is broken to begin with.
Because when this problem takes hold in a codebase, it is nigh
impossible to fix.
So, I think it's better to strive to evolve the software to be a better
international citizen rather than give all the software we build an
artificially sterile environment, which is iteratively harder and harder
to build and maintain.
Just to make sure I'm not misunderstood: I also am tending towards *not*
setting LC_ALL=C.UTF-8 (but probably not as strongly as I understood Simon's
mail) just because I like dumping my time into figuring out why my software
does something different in a very specific environment. Figuring this out
does uncover bugs that should be fixed most of the time.

At the same time though, I also get annoyed of copy-pasting d/rules snippets
from one of my packages to the next instead of making use of a few more
defaults in our package build environment.

Thanks!

cheers, josch
Simon Richter
2024-06-06 11:40:02 UTC
Permalink
Hi,
Post by Johannes Schauer Marin Rodrigues
At the same time though, I also get annoyed of copy-pasting d/rules snippets
from one of my packages to the next instead of making use of a few more
defaults in our package build environment.
Same here -- I just think that such a workaround should be applied only
when the package fails to build reproducibly, so this is definitely
something that should not be cargo-culted in.

What we could also do (but that would be a bigger change) would be
another flag similar to "Rules-Requires-Root" that lists aspects of the
package that are known to affect reproducibility -- that would be
declarative, so the reproducible-builds project can disable the test and
get meaningful results for the remaining typical problems, and could be
checked and handled by dpkg-buildpackage as well.

Simon
Simon McVittie
2024-06-06 15:00:01 UTC
Permalink
Post by Johannes Schauer Marin Rodrigues
If we imagine a hypothetical switch to LC_ALL=C.UTF-8 for all source
packages by default, then there will be bugs.
Do you mean: there will be bugs that break the build of certain packages,
which previously built successfully?

Or do you mean: there will be bugs in which a package does not work as
designed at runtime for users of certain locales, and those bugs would
previously have been detected at build-time by showing up as a FTBFS or
non-reproducibility, but are now only detected by users at runtime?

I'm not convinced that either of those is going to be true, and especially
the first one, because at least some (maybe all) of our official buildds
already export LC_ALL=C.UTF-8 for builds:
https://buildd.debian.org/status/fetch.php?pkg=flatpak&arch=amd64&ver=1.14.8-1&stamp=1714492944&raw=0

(Search for "Sufficient free space" and read down a few lines further;
and this is not at all specific to Flatpak, that's just an arbitrary
example of a package that I happen to know has a recent buildd log.)
Post by Johannes Schauer Marin Rodrigues
I like dumping my time into figuring out why my software
does something different in a very specific environment
That is of course fine, and you're welcome to do that, but the question
is in part about whether the benefit of expecting that every package
maintainer will do this exceeds its cost.

smcv
Simon McVittie
2024-06-06 14:40:01 UTC
Permalink
C, or C.UTF-8 is not a universal locale which works
for all.
Sure, and I don't think anyone is arguing that you or anyone else should
set the locale for your interactive terminal session, your GUI desktop
environment, or even your servers to C.UTF-8.

But, this thread is about build environments for our packages, not about
runtime environments. We have two-and-a-half possible policies:

1. Status quo, in theory:

Packages cannot make any assumptions about build-time locales.

The benefits are:

- Diagnostic messages are in the maintainer's local language, and
potentially easier to understand.

- If a mass-QA effort wants to assess whether the program is broken by
a particular locale, they can easily try running its build-time tests
in that locale, **if** the tests do not already force a different
locale. (But this comes with some serious limitations: it's likely
to have a significant number of false-positive situations where the
program is actually working perfectly but the **tests** make assumptions
that are not true in all locales, and as a result many upstream
projects set their build-time tests to force specific locales
anyway - often C, en_US.UTF-8 or C.UTF-8 - regardless of what we
might prefer in Debian.)

The costs are:

- Every program that might be run at build-time is expected to continue
to cope with running in non-UTF-8 locales, even if we strongly deprecate
non-UTF-8 locales for production use.

- Diagnostic messages from the reproducible-builds infrastructure are
in a random language chosen by the infrastructure, which the maintainer
does not necessarily understand. (If my package fails to build in a
Chinese locale, that's a valid bug, but if I'm expected to diagnose the
problem by reading Chinese error messages, as a non-Chinese-speaker I
am not going to get far.)

- If a program that is run during build intentionally has locale-specific
output, and its output ends up in the .deb, then the package maintainer
must go to additional effort to force that particular program to have
reproducible output, usually by running it in a specified locale.

2. What's being proposed in this thread:

Each package can assume that it's built in the C.UTF-8 locale.
If it needs a different locale during testing, it can set that itself
(as e.g. glib2.0 does for some tests), but unless it takes explicit
action, C.UTF-8 will be used.

The benefit is that packages that require a UTF-8 locale during build
or during testing (e.g. to process non-English strings in Python)
can assume that they have one, and an equivalence class of bugs
(packages where the content of the .deb can vary with the build-time
locale, or where e.g. build-time tests fail if UTF-8 output is not
possible) become non-bugs that we do not need to think about.

The costs are that we don't get the benefits from (1.) any more.

2½. Unwelcome compromise (increasingly the status quo):

Whenever a package is non-reproducible, fails to build or fails tests
in certain locales (for example legacy non-UTF-8 locales like C or
en_GB.ISO-8859-15), we add `export LC_ALL=C.UTF-8` to debian/rules and
move on.

This is just (2.) with extra steps, and has the same benefit and cost
for the affected packages as (2.) plus an additional cost (someone must
identify that the package is in this category and copy/paste the extra
line), and the same benefit and costs for unmodified packages as (1.).

2½ seems like the same boil-the-ocean pattern as any number of
manual-work-intensive transitions: Rules-Requires-Root, debhelper compat
levels, compiler hardening flags and so on. In situations where the
desired state is a backwards-compatibility break, the benefit of having
the transition be opt-in can exceed its (considerable!) cost, but we
shouldn't let that trick us into always paying the additional cost of an
opt-in transition, even in situations where it isn't worth it.
[Turkish dotted/dotless i]
creates tons of problems with software which are not aware of the
issue (Kodi completely breaks for example, and some software needs
forced/custom environments to run).
I agree that internationalization issues can be a serious problem **at
runtime**, and when our developers and users find such problems, they can
be reported as bugs downstream or upstream, and (hopefully!) fixed. What
I do not agree with is your suggestion that having the package build
occur in an undefined locale will solve this problem.

For example, let's imagine that we decide that perfect support for Turkish
is a release goal. Having reproducible-builds.org build packages in an
arbitrary language (in practice French is often used, I think?) doesn't
prove anything about whether they handle Turkish correctly, whatever
"correctly" might mean.

If someone wants to do a QA mass-rebuild in the tr_TR.UTF-8 locale,
that would come a little closer to having higher confidence about our
ability to run software in Turkish - but is it working *correctly*, or
are the tests making the wrong assertions, or are the code paths that
could go wrong in Turkish not even being tested? We probably won't know
any of those until a Turkish speaker investigates that specific piece
of software.

The fact that you say "Kodi completely breaks" also suggests to me that
fixing these problems is not trivial, because if it was easy, it would
have been fixed by now. And yet we ship Kodi in Debian, even knowing
that it has this bug, and it seems to work OK for most people.

Even if Kodi's problems with Turkish text are solved, **and** the
developer who solves those problems adds a build-time regression test
to avoid the bug coming back, I would expect the test to need to look
like this pseudocode:

def test_turkish:
old_locale = setlocale(LC_ALL, "tr_TR.UTF-8")

if old_locale is null:
skipTest("tr_TR.UTF-8 locale not available, try installing locales-all")

try:
do some stuff involving Turkish text
assert that the right thing happens
finally:
setlocale(LC_ALL, old_locale)

... for which having the rest of the build happen in the tr_TR.UTF-8
locale isn't even useful!

(src:glib2.0 has several tests like this, and the packaging goes to some
lengths to make sure that the required locales are available.)

A wider point here is that artificially elevating a certain class of bugs
to be de-facto release-critical by turning them into build failures is
not necessarily always going to improve the quality of Debian: we have
no shortage of bugs to work on, and a finite amount of volunteer time
available. Any time we make a class of bugs release-critical like this,
that's taking volunteer time away from identifying and fixing different
bugs that might have a larger impact on the overall quality of the
distribution, so we should only do this if we are sure that that class
of bugs is genuinely among our highest priorities.

Stepping back from the specifics of locales, I observe that operating
systems are extremely complicated and contain an overwhelming number
of choices and code paths. Obviously most of those choices are there
because someone needs them - but some are only there for historical
reasons or as an unintended side-effect of something more beneficial. If
we can make a simplifying assumption that will take an entire equivalence
class of bugs and make them into non-bugs, without losing significant
functionality or flexibility, then it's often good to do that instead.

(For example, a while ago we replaced "it is undefined whether /usr is
mounted or not during early boot" with the simplifying assumption "if
/usr is separate then it must be mounted by the initramfs", which turned a
whole class of bugs of the form "x is in /lib but depends on y which is in
/usr/lib" into non-bugs that do not need to be fixed or even identified.)

smcv
Jeremy Bícha
2024-06-06 14:50:01 UTC
Permalink
I believe debhelper already sets LC_ALL=C.UTF-8 for the cmake, meson,
and ninja buildsystems; therefore many but definitely not all packages
are already built with LC_ALL=C.UTF-8.

Thank you,
Jeremy Bícha
Guillem Jover
2024-06-07 12:40:01 UTC
Permalink
Hi!
Post by Simon McVittie
C, or C.UTF-8 is not a universal locale which works
for all.
Sure, and I don't think anyone is arguing that you or anyone else should
set the locale for your interactive terminal session, your GUI desktop
environment, or even your servers to C.UTF-8.
But, this thread is about build environments for our packages, not about
Packages cannot make any assumptions about build-time locales.
- Diagnostic messages are in the maintainer's local language, and
potentially easier to understand.
I think this is way more important than the relative space used to
mention it though. :) I'm a non-native speaker, who has been involved
in l10n for a long time, while at the same time I've pretty much
always run my systems with either LANG=C.UTF-8 or before that LANG=C,
LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.

And I think forcing a locale on buildds makes perfect sense, because
we want easy access to build logs. But forcing LC_ALL from the build
tools implies that no tool invoked will get translated messages at
all, and means that users (not just maintainers) might have a harder
time understanding what's going on, we make lots of l10n work rather
pointless, and if no one is running with different locales then l10n
bugs might easily creep in.
Post by Simon McVittie
- If a mass-QA effort wants to assess whether the program is broken by
a particular locale, they can easily try running its build-time tests
in that locale, **if** the tests do not already force a different
locale. (But this comes with some serious limitations: it's likely
to have a significant number of false-positive situations where the
program is actually working perfectly but the **tests** make assumptions
that are not true in all locales, and as a result many upstream
projects set their build-time tests to force specific locales
anyway - often C, en_US.UTF-8 or C.UTF-8 - regardless of what we
might prefer in Debian.)
I consider locale sensitive misbehavior as a category of "upstream"
bugs (be that in the package upstream or the native Debian tools), that
deserve to be spotted and fixed. I can understand though the sentiment
of wanting to shrug this problem category off and wanting instead to
sweep it under the carpet, but that has accessibility consequences.
Post by Simon McVittie
- […] but if I'm expected to diagnose the
problem by reading Chinese error messages, as a non-Chinese-speaker I
am not going to get far.)
Just as an aside, but while getting non-English messages makes for
harder to diagnose bugs, I've never found it a big deal to deal with
that kind of bug reports, as you can grep for (parts of) the
translated message, and then get the original English string from the
.po for example, or can translate the text back to know what it is
talking about, or ask the reported to translate it for you.
Post by Simon McVittie
Whenever a package is non-reproducible, fails to build or fails tests
in certain locales (for example legacy non-UTF-8 locales like C or
en_GB.ISO-8859-15), we add `export LC_ALL=C.UTF-8` to debian/rules and
move on.
This is just (2.) with extra steps, and has the same benefit and cost
for the affected packages as (2.) plus an additional cost (someone must
identify that the package is in this category and copy/paste the extra
line), and the same benefit and costs for unmodified packages as (1.).
I agree though, that if we end up with every debian/rules
unconditionally exporting LC_ALL, then there's not much point in not
making the build driver do it instead.


Related to this, dpkg-buildpackage 1.20.0 gained a --sanitize-env,
which for now on Debian and derivatives sets LC_COLLATE=C.UTF-8 and
umask=0022.

But _iff_ we end up with dpkg-buildpackage being declared the only
supported entry point, _and_ there is consensus that we'd want to set
some kind of locale variable from the build driver, then I guess this
could be done as a Debian vendor-specific thing, or via the
dpkg-build-api(7) interface.

Thanks,
Guillem
Holger Levsen
2024-06-07 13:40:01 UTC
Permalink
Post by Guillem Jover
And I think forcing a locale on buildds makes perfect sense, because
we want easy access to build logs. But forcing LC_ALL from the build
tools implies that no tool invoked will get translated messages at
all, and means that users (not just maintainers) might have a harder
time understanding what's going on, we make lots of l10n work rather
pointless, and if no one is running with different locales then l10n
bugs might easily creep in.
absolutly agreed & thanks for bringing up this aspect!
Post by Guillem Jover
Related to this, dpkg-buildpackage 1.20.0 gained a --sanitize-env,
which for now on Debian and derivatives sets LC_COLLATE=C.UTF-8 and
umask=0022.
that's great news!
--
cheers,
Holger

⢀⣎⠟⠻⢶⣊⠀
⣟⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄

The past is over.
Alexandre Detiste
2024-06-07 13:50:01 UTC
Permalink
Maybe a compromise would be to at least mandate some UTF-8 locale.
Simon McVittie
2024-06-07 15:50:01 UTC
Permalink
Post by Alexandre Detiste
Maybe a compromise would be to at least mandate some UTF-8 locale.
Having an UTF-8 locale available would be a good thing, but allowing
packages to rely on the active locale to be UTF-8 based reduces our testing
scope.
I'm not sure I follow. Are you suggesting that we should build each
package *n* times (in a UTF-8 locale, in a legacy locale, in locales
known to have unique quirks like Turkish and Japanese, ...), just for
its side-effect of *potentially* passing through those locales to the
upstream test suite?

If we want to run the test suite in each of those locales, then I think
instead we should do just that: run the test suite (and only the test
suite!) in each of those locales. dh_auto_test could grow a way to do
that, if there's demand. Repeating the whole compilation seems like a
sufficiently large waste of time and resources that, in practice, we
are not going to be able to do this routinely for more than a couple
of locales.

Or, better, we should provide packages with a way to guarantee that
certain locales are available[1], and then tests that are known to be
testing locale-sensitive things should explicitly switch into the locales
of interest, to make sure that they are tested every time, not just if
the builder's locale happens to be the interesting one. For example,
glib2.0's test suite temporarily switches to a Japanese locale in order to
test its handling of formatting dates with an era (Japanese is one of the
few locales where that concept exists), and it does this even when built
by a non-Japanese-speaking developer like me. If it relied on the current
locale for its test coverage, then we would never have discovered #1060735
unless it was coincidentally built by a Japanese developer who is using
a big-endian machine, which seems quite unlikely to happen by chance!

Or, when you say "testing", do you really mean "doing the build, for
the side-effect of seeing whether it succeeds or fails"? (That's not
really the same thing as running a test suite.)

Realistically, several important tools require a UTF-8 locale and will
not work reliably otherwise. Meson either is one of these, or was in
the past, as a result of Python's Unicode behaviour; so debhelper sets
LC_ALL=C.UTF-8 when it invokes Meson, ignoring any preference that might
have been expressed by the caller of dpkg-buildpackage.

[1] Build-Depends: locales-all does this, but is rather heavy.
debian/tests/run-with-locales in e.g. src:glib2.0 is another
implementation, but a more centralized version of this would probably
be better.
Basically, we need to define the severity of locale bugs
More than that, we need to define what is a locale bug and what is a
non-bug - ideally based on what is genuinely useful, rather than on
"this is something that could theoretically work". We should try to
solve bugs, because that benefits our users and Free Software, but we
should put zero effort into solving non-bugs.

What we say is a bug, and what we say is not a bug, is a policy decision
about our scope: we support some things and we do not support others.
There's nothing magical or set-in-stone about the set of things that we
do and don't support, and it can be varied if there is something close to
consensus that it ought to be. When we're deciding what is in-scope and
what is out-of-scope, we should make that decision based on comparing the
costs and benefits of a wider/narrower scope: "this is in-scope because
I say so" or "this is in-scope because we have traditionally said it is"
are considerably weaker arguments than "this is in-scope because otherwise
we can't access this benefit".

As an analogy: we have chosen to define in Policy that /bin/sh is anything
that supports the POSIX shell language, plus a few designated extensions
like `echo -n`. A consequence of that is that "foobar fails to build when
/bin/sh is bash" is considered to be a bug (which, in an ideal world,
we would solve), because bash is a POSIX shell; but "foobar fails to
build when /bin/sh is csh" is a non-bug (which we wouldn't even leave
open as wontfix, we would just close it), because csh isn't a POSIX shell.

In a different parallel universe, we might reasonably have declared
that /bin/sh is required to be bash (like it is in e.g. Fedora), which
would result in some things that are currently bugs becoming non-bugs -
that's a narrower scope than then one that Debian-in-this-universe has,
resulting in it being easier to maintain but less flexible.

Or, conversely, in a different parallel universe, we might have said that
/bin/sh can be literally any POSIX shell, which is a wider scope than
Debian-in-this-universe: "FTBFS when /bin/sh doesn't support echo -n"
is currently a non-bug, but in that hypothetical distribution it would
be a bug, making the distribution harder to maintain but more flexible.

I am, personally, a fan of setting a scope that makes some of our more
obscure or theoretical bugs into non-bugs, because that would let us
concentrate our attention on the remaining bugs (the ones that are more
likely to indicate a genuine problem for our users).

What Giole proposed at the beginning of this thread can be rephrased as
declaring that "FTBFS when locale is not C.UTF-8" and "non-reproducible
when locale is varied" are non-bugs, and therefore they are not only
wontfix, but they should be closed altogether as being out-of-scope.
Of course, if we chose to have this be our policy, then it would be best
if dpkg-buildpackage and/or debhelper would force the C.UTF-8 locale, so
that builds with different locales simply can't happen - instead of
allowing the build to continue, but considering it to be not-a-bug for
it to fail or give different results. Fortunately, forcing a C.UTF-8
locale is very easy (set some environment variables before forking each
subprocess).

Or, Alexandre's "weaker" suggestion, to which you are replying, could
be rephrased as declaring that things like "FTBFS when locale is not
UTF-8" and "non-reproducible when one of the two builds is non-UTF-8" are
non-bugs, but "FTBFS when locale is ja_JP.UTF-8" and "non-reproducible
when the two builds are different UTF-8 locales" would still be bugs
under Alexandre's suggestion. Similarly, if we chose to have *this* be
our policy, then it would be best if dpkg-buildpackage and/or debhelper
would either detect a non-UTF-8 locale and error out early, or detect a
non-UTF-8 locale and quietly replace it with some UTF-8 locale (perhaps
C.UTF-8, or perhaps the closest equivalent UTF-8 locale, like replacing
ja_JP.EUC-JP with ja_JP.UTF-8).

I can remember several conversations in the past about potentially
dropping support for legacy non-UTF-8 locales like en_GB.ISO-8859-15
*completely* (not just de-supporting them for package builds, but
de-supporting their use on Debian under any circumstances), and
Alexandre's suggestion is a subset of that: leaving them available for
users who might still need them for whatever reason, but declaring that
they are not something we support at package-build time.
Besides locales, there are other things that might affect outcomes, and we
need to find some reasonable position between "packages should be
reproducible even if built from the maintainer's user account on their
personal machine" and "anything that is not a sterile systemd-nspawn
container with exactly the requested Build-Depends and no Recommended
packages causes undefined behaviour."
Yes. I think there is room for a more nuanced approach to this general
design principle: we can define some sources of variation as "possible
but not recommended", set them to a known value for official buildds,
make it as easy as possible to set them to a known value for local
test-builds, and consider FTBFS or non-reproducibility under those
variations to be a *low-severity* bug.

For instance, if a package is non-reproducible depending on whether I
happen to have libreally-obscure-dev installed, of course ideally that
should be fixed, but I would say that it's a much lower severity than
the package being non-reproducible depending on whether I have a
more commonly-required package like libglib2.0-dev which might be
difficult to remove non-disruptively.

Similarly, if a package FTBFS when built on a Tuesday, I'd say that's RC;
if it FTBFS when my locale is en_GB.UTF-8, under our current policies I'd
personally say that's annoying but non-RC (because if I'm debugging
the package, I could always grumble and work around that issue by
LC_ALL=C.UTF-8); and if it FTBFS when built on a machine where the
/nonexistent directory does, in fact, exist, then I would say that's
a non-bug.

(A concrete example of the latter: I'm pretty sure glib2.0 will fail
its test suite if /nonexistent exists, but if someone reported that as
a bug, I would be inclined to reply "/nonexistent shouldn't exist, the
clue's in the name" and close it.)

For locales and other facets of the execution environment that are
similarly easy to clear/reset/sanitize/normalize, we don't necessarily
need to be saying "if you do a build in this situation, you are doing
it wrong", because we could equally well be saying "if you do a build in
this situation, the build toolchain will automatically fix it for you" -
much more friendly towards anyone who is building packages interactively,
which seems to be the use-case that you're primarily interested in.
Personally my preference would be as close as possible to
[not needing a special build environment],
because if I ever need to work on someone else's package, the chance is high
that I will need incremental builds and a graphical debugger, and both of
these are a major hassle in containers.
I don't think this is an either/or, but more like a spectrum: the more
your build environment diverges from what we might consider to be our
reference build environment, the more likely it is that a package will
FTBFS, fail tests, be non-reproducible or otherwise misbehave. It's
up to us, as a project, where to draw the line for "this divergence is
completely normal so the bug is RC" and, conversely, "this divergence
is so strange that it's a non-bug".

For something like the locale, it's very easy: if we decide that certain
locales are out-of-scope, then the build toolchain (dpkg-buildpackage or
similar) could just not allow the out-of-scope situations, because it's
straightforward (and doesn't require privileges) to force the build into
an in-scope situation and continue from there.

smcv
Simon McVittie
2024-06-10 15:30:01 UTC
Permalink
Reproducibility outside of sterile environments is however a problem for us
as a distribution, because it affects how well people are able to contribute
to packages they are not directly maintaining
If our package-building entry point sets up aspects of its desired
normalized (or "sterile") environment itself, surely it's equally easy
for those contributors to build every package in that way, whether they
maintain this particular package or not?
if my package is not required to work outside
of a very controlled environment, that is also an impediment to
co-maintenance
I'm not sure that follows. If the only thing we require is that it works
in a controlled environment, and the controlled environment is documented
and easy to achieve, then surely every prospective co-maintainer is in
an equally good position to be contributing? That seems better, to me, than
having unwritten rules about what environment is "close enough" and what
environment doesn't actually work.

If I want to contribute to (let's say) both GNOME and KDE, but the GNOME
team expects me to be building in one controlled environment, and the KDE
team expects me to be building in a *different* controlled environment,
then sure, that would be a barrier to contribution: I'd have to do that
setup once per team, and maybe they'd be mutually incompatible. But that
isn't going to be the case if we're setting a policy for the whole distro,
which only needs to happen once?

We already do expect maintainers to be building in a specified
environment: Debian unstable (not stable, and not Ubuntu, for example).

I can see that if our policy was something like "must build in a schroot",
then that would be making us vulnerable to a lock-in where we can't
move to Podman or systemd-nspawn or Incus or whatever is the flavour of
the month because our policy says we use schroot, and then we end up
shackled to schroot's particular properties and limitations. (Indeed,
to an extent, we already have that problem by using schroot on official
buildds, and as a result being unable to gain much benefit from work
done on container technologies outside the Debian bubble.)

But that's not what was proposed by this thread: this thread is about
locales. Now that glibc has C.UTF-8 built-in and non-optional, you can
set a normalized or sterile locale regardless of whether you're building
on bare metal, in a VM, in a schroot, in Docker, or whatever, and it's
very easy to do that in a tool (or even an interactive shell) and have
it inherit down through the build? So I'm not sure I see the problem?

If you're making a wider point about use of containers etc. that is
orthogonal to setting the locale, then that would be a valid objection
to someone saying "we should standardize on building in Docker" (and I
would make a similar objection myself), but that's not this thread.

(I also do agree that it is an anti-pattern if we have a specified
environment where tests or QA will be run, and serious consequences for
failures in that environment, without it being reasonably straightforward
for contributors to repeat the testing in a sufficiently similar
controlled environment that they have a decent chance at being able to
reproduce the failure. But, again, that isn't this thread.)
a lot of the debates we've had in the past years is who gets to
decide what is in scope
Yes, that's always going to be the case for a community that doesn't
have an authority figure telling us "the scope is what I say it is". We
have debates when we don't all agree, and the scope of our collective
project is one of the foundations for all the other decisions we make,
so it's certainly something that we can't expect to be unanimous. (Insert
wise words from Russ Allbery about the difference between unanimity and
consensus here...)

I hope we can come close enough to a consensus that we're all generally
willing to accept it, though, even if that means sometimes accepting a
narrower or wider scope than I would personally prefer.
Post by Simon McVittie
What Giole proposed at the beginning of this thread can be rephrased as
declaring that "FTBFS when locale is not C.UTF-8" and "non-reproducible
when locale is varied" are non-bugs, and therefore they are not only
wontfix, but they should be closed altogether as being out-of-scope.
Indeed -- however this class of bugs has already been solved because
reproducible-builds.org have filed bugs wherever this happened, and
maintainers have added workarounds where it was impossible to fix.
Someone (actually, quite a lot of someones) had to do that testing,
and those fixes or workarounds. Who did it benefit, and would they have
received the same benefit if we had said "building in a locale other than
C.UTF-8 is unsupported", or in some cases "building in a non-UTF-8 locale
is unsupported", and made it straightforward to build in the supported
locales?

I think there is a danger that we sink time and effort into doing work
that we are doing because our (written or unwritten) policy demands it,
even when it isn't clear that there is a real benefit from that work being
done. If that work is a fun and interesting puzzle and someone actively
wants to do it, then great!, but if it's something that a contributor
doesn't actually want to do, and is only doing because there is a rule
that demands it or a sanction that will be applied if it isn't done,
then we do need to consider whether the cost (imposing that work) is
justified by the benefit.
Turning this workaround into boilerplate code was a mistake already, so the
answer to the complaint about having to copy boilerplate code that should be
moved into the framework is "do not copy boilerplate code."
If you don't want package-specific code to be responsible for forcing
a "reasonable" locale where necessary, then what layer do you want to
be responsible for it? dpkg-buildpackage? debhelper? But then you go
on to say that you don't want those layers to set the locale either,
so I'm confused...

smcv
Simon Richter
2024-06-11 09:30:01 UTC
Permalink
Hi,
Post by Simon McVittie
Reproducibility outside of sterile environments is however a problem for us
as a distribution, because it affects how well people are able to contribute
to packages they are not directly maintaining
If our package-building entry point sets up aspects of its desired
normalized (or "sterile") environment itself, surely it's equally easy
for those contributors to build every package in that way, whether they
maintain this particular package or not?
Yes, but building the package is not the hard part in making a useful
contribution -- anything but trivial changes will need iterative
modifications and testing, and the package building entrypoint is
limited to "clean and build entire package" and "build package without
cleaning first", with the latter being untested and likely broken for a
lot of packages -- both meson and cmake utterly dislike being asked to
configure an existing build directory as if it were new.

For my own packages, I roughly know how far I can deviate from the clean
environment and still get meaningful test results, but for anything
else, I will still need to deep-dive into the build system to get
something that is capable of incremental builds.
Post by Simon McVittie
if my package is not required to work outside
of a very controlled environment, that is also an impediment to
co-maintenance
I'm not sure that follows. If the only thing we require is that it works
in a controlled environment, and the controlled environment is documented
and easy to achieve, then surely every prospective co-maintainer is in
an equally good position to be contributing? That seems better, to me, than
having unwritten rules about what environment is "close enough" and what
environment doesn't actually work.
I will need to deviate from the clean environment, because the clean
environment does not have vim installed. Someone else might need to
deviate further and have a graphical environment and a lot of dbus
services available because their preferred editor requires it.

Adding a global expectation about the environment that a package build
can rely on *creates* an unwritten per-package rule whether it is
permissible to deviate from this expectation during development.

I expect that pretty much no one uses the C.UTF-8 locale for their
normal login session, so adding this as a requirement to the build
environment creates a pretty onerous rule: "if you want to test your
changes, you need to remember to call make/ninja with LC_ALL=C.UTF-8."

Of course we know that this rule is bullshit, because the majority of
packages will build and test fine without it, but this knowledge is
precisely one of the "unwritten rules" that we're trying to avoid here.
Post by Simon McVittie
We already do expect maintainers to be building in a specified
environment: Debian unstable (not stable, and not Ubuntu, for example).
I develop mostly on Debian or Devuan stable, then do a pbuilder build
right before upload to make sure it also builds in a clean unstable
environment. The original requirement was mostly about uploading binary
packages, which we (almost) don't do anymore.
Post by Simon McVittie
(I also do agree that it is an anti-pattern if we have a specified
environment where tests or QA will be run, and serious consequences for
failures in that environment, without it being reasonably straightforward
for contributors to repeat the testing in a sufficiently similar
controlled environment that they have a decent chance at being able to
reproduce the failure. But, again, that isn't this thread.)
This is largely what I think is this thread -- narrowing the environment
where builds, tests and QA will be run, and narrowing what will be
considered a bug.
Post by Simon McVittie
Indeed -- however this class of bugs has already been solved because
reproducible-builds.org have filed bugs wherever this happened, and
maintainers have added workarounds where it was impossible to fix.
Someone (actually, quite a lot of someones) had to do that testing,
and those fixes or workarounds. Who did it benefit, and would they have
received the same benefit if we had said "building in a locale other than
C.UTF-8 is unsupported", or in some cases "building in a non-UTF-8 locale
is unsupported", and made it straightforward to build in the supported
locales?
I'd say that developers who don't have English as their first language
have directly benefited from this, and would not have benefited if it
was not seen as a problem if a package didn't build on their machines
without the use of a controlled environment.

I also think that we have indirectly benefited from better test coverage.
Post by Simon McVittie
Turning this workaround into boilerplate code was a mistake already, so the
answer to the complaint about having to copy boilerplate code that should be
moved into the framework is "do not copy boilerplate code."
If you don't want package-specific code to be responsible for forcing
a "reasonable" locale where necessary, then what layer do you want to
be responsible for it?
I want this to be package-specific, but applied only when necessary.

The original complaint was that having to copy this boilerplate code to
fix reproducibility issues to each new package was a waste of
maintainers' time and that it should be centralized into some framework,
and my response to that is to stop copying unnecessary code into
packages that don't need it.

At best, it does nothing because the package isn't broken, at worst it
manifests additional bugs while someone is modifying the package to fix
another problem.

If we are to move this into a framework, then this should take a
declarative form, like "Rules-Requires-Locale: C.UTF-8", and it should
be a goal to minimize use of this.

Simon
Guillem Jover
2024-07-02 01:50:01 UTC
Permalink
Hi!
Post by Alexandre Detiste
Maybe a compromise would be to at least mandate some UTF-8 locale.
Ah, good thinking! That would actually seem acceptable. I've prepared
the attached preliminary patch (missing better commit message, etc),
as a PoC for how this could look like. If there's consensus about
something like this, I'd be happy to merge into a future dpkg release.

Although I'm not sure though whether this would be enough to make it
possible to remove the hardcoding of LC_ALL=C.UTF-8 usage in debhelper,
which seems counter to l10n work, or perhaps to switch to a subset of
the locale settings. Niels?

Thanks,
Guillem
Simon McVittie
2024-07-02 09:00:01 UTC
Permalink
Post by Alexandre Detiste
Maybe a compromise would be to at least mandate some UTF-8 locale.
dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
building packages
Allowing ASCII seems counterproductive: that puts us in the code path
where various tools and runtimes (especially Python) will refuse to
process or output anything outside the 0-127 range, which I believe is
exactly the problem that debhelper aims to solve by using C.UTF-8 for
some categories of package (in particular those that build with Meson).

To get what Alexandre suggested, we'd need to allow UTF-8 but not allow
ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular
the C locale is not).

Or perhaps this pseudocode?

if (charset != UTF-8) {
emit a warning
export LC_ALL=C.UTF-8
unset LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE (etc.)
}

smcv
Guillem Jover
2024-07-02 12:40:02 UTC
Permalink
Hi!
Post by Simon McVittie
Post by Alexandre Detiste
Maybe a compromise would be to at least mandate some UTF-8 locale.
dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
building packages
Allowing ASCII seems counterproductive: that puts us in the code path
where various tools and runtimes (especially Python) will refuse to
process or output anything outside the 0-127 range, which I believe is
exactly the problem that debhelper aims to solve by using C.UTF-8 for
some categories of package (in particular those that build with Meson).
To get what Alexandre suggested, we'd need to allow UTF-8 but not allow
ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular
the C locale is not).
Err, you are right. I think I implemented this from my recollection of
the thread, trying to enforce as little as possible, and to try to let
users set "translations" to pure ASCII if desired, but that then defeats
the point brought up in the original mail, and the locale setting in
debhelper. I'll amend the PoC commit to only allow UTF-8.

(Also as long as LC_CTYPE is UTF-8 I think it should not matter whether
LC_MESSAGES is non-UTF-8 as the output codeset should still be UTF-8.)
Post by Simon McVittie
Or perhaps this pseudocode?
if (charset != UTF-8) {
emit a warning
export LC_ALL=C.UTF-8
unset LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE (etc.)
}
As it stands, I don't think this would be good enough, because it would
introduce an implicit setting in dpkg-buildpackage while it is
currently not the only supported entry point, so packages could still
not rely on this being always set, and it still disables translated
messages.

While erroring out (even when dpkg-buildpackage is still not the only
supported entry point) would not give a full guarantee that a package
build is always done in a UTF-8 locale, it at least forces the caller
(be that a tool or a human) to change the running environment, while
not forcing untranslated messages. I guess this could be made a stronger
guarantee if debhelper switched from unconditionally setting the locale
to performing a similar check and erroring out too (instead of simply
removing the locale setting).


But from your pseudocode, now I realize the check I implemented is
probably too naive, as it should probably at least also check whether
LC_COLLATE is also UTF-8. So I'll try to think how to make it more
robust.

But, I guess I can at least unconditionally set LC_CTYPE=C.UTF-8 when
using --sanitive-env, right away though.

Thanks,
Guillem
Alexandre Detiste
2024-07-06 11:20:01 UTC
Permalink
Hi,

Thank you both.

One thing that could be fixed quite quickly is fixing the few
remaining official buildd workers that does not yet run with an UTF-8 locale.

If one is unlucky the build will mysteriously fail.

Adding export {LC_ALL|LANG|LC_CTYPE}=C.UTF-8
in every single d/rules by fear of this seems overkill.

Greetings

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074586
https://buildd.debian.org/status/package.php?p=xrayutilities
Post by Guillem Jover
Hi!
Post by Simon McVittie
Post by Alexandre Detiste
Maybe a compromise would be to at least mandate some UTF-8 locale.
dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
building packages
Allowing ASCII seems counterproductive: that puts us in the code path
where various tools and runtimes (especially Python) will refuse to
process or output anything outside the 0-127 range, which I believe is
exactly the problem that debhelper aims to solve by using C.UTF-8 for
some categories of package (in particular those that build with Meson).
To get what Alexandre suggested, we'd need to allow UTF-8 but not allow
ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular
the C locale is not).
But, I guess I can at least unconditionally set LC_CTYPE=C.UTF-8 when
using --sanitive-env, right away though.
Thanks,
Guillem
Guillem Jover
2024-08-03 17:20:01 UTC
Permalink
Hi!
Post by Alexandre Detiste
Post by Guillem Jover
Post by Simon McVittie
Post by Alexandre Detiste
Maybe a compromise would be to at least mandate some UTF-8 locale.
dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
building packages
Allowing ASCII seems counterproductive: that puts us in the code path
where various tools and runtimes (especially Python) will refuse to
process or output anything outside the 0-127 range, which I believe is
exactly the problem that debhelper aims to solve by using C.UTF-8 for
some categories of package (in particular those that build with Meson).
To get what Alexandre suggested, we'd need to allow UTF-8 but not allow
ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular
the C locale is not).
But, I guess I can at least unconditionally set LC_CTYPE=C.UTF-8 when
using --sanitive-env, right away though.
I did something like that as part of dpkg 1.22.7, with commit:

https://git.dpkg.org/cgit/dpkg/dpkg.git/commit/?id=df60765ed4bc6640b788c796dd0c627d7714f807

Which should guarantee a UTF-8 codeset and stable sorting, while
preserving any translated output messages (and other locale settings).
Post by Alexandre Detiste
One thing that could be fixed quite quickly is fixing the few
remaining official buildd workers that does not yet run with an UTF-8 locale.
Something I realized after adding the above change, is that sbuild has
been running dpkg-buildpackage with --sanitize-env for a while now,
which checking now I was told about at the time, but I either didn't
piece together its consequences or perhaps forgot that the sbuild
package is nowadays used in build daemons (instead of the old fork)
and then forgot. :) (BTW, not blaming josch! I think that change in
sbuild on its own makes sense, I guess I was just not expecting the
option to be used that way, and perhaps its documentation should have
somehow made it more clear. :)

I guess this is both good and "bad". It's good because now all build
daemons will have a guaranteed UTF-8 locale codeset already starting
with Debian trixie, as requested in this thread, and give us a more
uniform build environment. It's "bad" because part of the reason to
add this through a new --sanitize-env option was to make this behavior
and its guarantees opt-in, but if the official Debian builds are using
this, then it's in a way equivalent to having set this by default w/o
the option, but perhaps worse because people running local build will
not have the same environment (although it's going to be easy to
replicate by passing that option, but a bit harder when calling
debian/rules directly which we still support).

I'm not sure the current state is ideal, because we are back to
packages being able to rely on some stuff on build daemons, that are
not guaranteed by default for our supported build entry points, and if
the result of this is that we end up patching all dpkg-buildpackage
callers to pass --sanitize-env, then we could have as well simply
changed the default instead. I think a way forward could be to make
the sanitizing the default, and finally drop debian/rules as a
supported (user) build entry point, I had in mind re-proposing this
already, but the above kind of gives it more urgency, so I'll try to
do that soon.

This also means, I guess, that part of the previous freedom I thought
we had to modify the --sanitize-env behavior is kind of gone now (and
would be gone too if we move its behavior as the default one), and we
should apply similar care as if the default itself was being changed,
because it has the potential to break the archive (via build daemons).
I'm thinking that depending on the changes there, it might be better
to gate them via dpkg-build-api(7) levels. I should also document the
vendor specific behavior in some manual page, as it is currently
listed as unspecified "vendor specific".
Post by Alexandre Detiste
If one is unlucky the build will mysteriously fail.
Adding export {LC_ALL|LANG|LC_CTYPE}=C.UTF-8
in every single d/rules by fear of this seems overkill.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074586
https://buildd.debian.org/status/package.php?p=xrayutilities
I implemented the attached patch (also on the next/req-utf8 branch) to
force a locale with a UTF-8 codeset, which would be a no-op now when
using --sanitize-env, but I didn't merge that for now, because I'm not
sure of the potential fallout, given that other infrastructure things
might be running dpkg-buildpackage w/o passing --sanitize-env. So I
think those would need to be found and changed before deploying that
change.

<https://git.hadrons.org/cgit/debian/dpkg/dpkg.git/log/?h=next/req-utf8>

But then, I guess whether merging that makes sense or not also depends
on how we want to proceed with the debian/rules build entry point, and
whether we are going to switch the default or transition to amend all
callers (which might still not catch private infra and similar).

Thanks,
Guillem
Simon McVittie
2024-08-03 18:20:01 UTC
Permalink
Post by Guillem Jover
I'm not sure the current state is ideal, because we are back to
packages being able to rely on some stuff on build daemons, that are
not guaranteed by default for our supported build entry points
This was already true, though: the official buildds all run sbuild
(which runs dpkg-buildpackage, not debian/rules), they're all set up in
whatever way that the Debian sysadmins prefer, they probably all run
as uid 'sbuild', they probably all use the same filesystem or one of
only a few filesystems for the build directory and /tmp (ext4? tmpfs?),
until recently they all ran under schroot, now they all run under either
schroot or unshare, and so on. In many ways that's a good thing: their
job is to build packages, not to validate that the packages are resilient
against unusual configurations.

Of course, if a package works on our infrastructure but FTBFS in a
reasonably typical build environment (like a contributor's laptop, or an
ordinary cloud VM image) then that's certainly inconvenient, and is a
bug that should ideally be reported and fixed. I don't think those bugs
always need to be RC: it depends on how "normal" the build environment
is, and how easy it is to work around the problem by building differently.

I don't think it should be a goal to make all of our packages build
successfully in "unreasonable" build environments (whatever we choose to
make that mean). For instance, I suspect that a significant proportion
of the archive will FTBFS if you try to build them on NTFS or SMB/CIFS,
and I'm not convinced that's even a bug: we can and should use a more
suitable filesystem instead.

smcv
Chris Hofstaedtler
2025-02-23 12:50:01 UTC
Permalink
Hi,
I think a way forward could be to make the sanitizing the default, and
finally drop debian/rules as a supported (user) build entry point, I had in
mind re-proposing this already, but the above kind of gives it more urgency, so
I'll try to do that soon.
Given I ran into this discrepancy today (in util-linux, buildd and
my local build are fine, salsa ci and pbuilder are not), I would
appreciate it if the default would change.

It's probably too late for trixie now, but maybe for forky?

Thanks,
Chris

Gioele Barabucci
2024-06-07 14:20:01 UTC
Permalink
Post by Guillem Jover
Related to this, dpkg-buildpackage 1.20.0 gained a --sanitize-env,
which for now on Debian and derivatives sets LC_COLLATE=C.UTF-8 and
umask=0022.
That's great news!
Post by Guillem Jover
But _iff_ we end up with dpkg-buildpackage being declared the only
supported entry point, [...]
Personally, I really appreciate how dpkg-buildpackage more and more
provides a standardized API to/for building Debian packages.

However I would prefer to have this API explicitly described in Policy
rather than hidden and implicitly defined by the code of a specific program.

What I propose is a new section in Policy [1] that explicitly lists all
these environment requirements (umask, LC_*, SOURCE_DATE_EPOCH, TMPDIR,
/bin/sh = POSIX shell + -n, etc). Each builder would then be changed to
be conformant by default, with the option to steer away if desired (for
example `dpkg-buildpackage --with-env-var LC_ALL=fr_FR.UTF-8`). This
would create an uniform environment while preserving the ability to run
d/rules with user-specific settings.

[1] Or any other similarly "binding" document.

Regards,
--
Gioele Barabucci
Simon McVittie
2024-06-07 16:30:01 UTC
Permalink
Post by Guillem Jover
I'm a non-native speaker, who has been involved
in l10n for a long time, while at the same time I've pretty much
always run my systems with either LANG=C.UTF-8 or before that LANG=C,
LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.
So diagnostic messages in your non-English language are so important to
you that you ... set your locale environment variables to values that
result in you seeing diagnostic messages in English instead? I'm not
sure I understand the point you're making here :-)

If your point is that people-who-are-not-you place a higher value on
having diagnostic messages come out in their non-English language than
you personally do, then, yes, that's certainly a valid thing for those
people to want.

But I'm not sure that our current package set actually achieves that -
increasingly many of our packages overwrite the locale with "C.UTF-8"
in some layer of their build system, because they cannot guarantee that
the locale they inherit from the environment is anything reasonable (in
particular, it might be "C", which often breaks tools that want to work
with non-ASCII filenames, inputs or outputs). In the enumeration from
my earlier message, you want (1.), but increasingly, what you actually
get is (2½.) instead, and that results in neither you nor Giole getting
the results you would hope for.

The compromise that Alexandre suggested elsewhere in the thread -
requiring the locale to be *something* UTF-8, but leaving it unspecified
exactly which UTF-8 locale, so that a French-speaking developer can ask
for fr_FR.UTF-8 and get their compiler warnings in French - seems like
something that might actually give you what you want in more cases than
the status quo does? If we mandate a UTF-8 locale, then stack layers like
debhelper's meson plugin could probably stop forcing C.UTF-8.
Post by Guillem Jover
we make lots of l10n work rather pointless
Surely only if that l10n work was done on tools that are only ever run
from package builds, and never interactively? A lot of localization is
done for end-user-facing tools (GUI, TUI or CLI) which are never relevant
during a package build anyway.

Even for compilers and similar non-interactive development tools, if
a French-speaking developer runs gcc in the fr_FR.UTF-8 locale during
their upstream development, they'll still benefit from its warnings being
localized into French, even if they would never see those same warnings
during a Debian package build of the same software.

(Analogous: I similarly benefit from gcc having ANSI colour highlights
in its output, even though my Debian package build logs don't have those.)
Post by Guillem Jover
and if no one is running with different locales then l10n
bugs might easily creep in
If no one is running (their interactive sessions) with a particular
locale, why do we even support that locale?

If a locale has users, and they find bugs, then of course those bugs are
something to be fixed (subject to triaging and prioritization, because
we have more bugs than time). But I'm not convinced that occasionally
doing package builds in arbitrary locales is something that will find
locale bugs more readily than real users' normal use of the software
that we ship.

The locale issues I've generally seen during package builds are more like
"I've set up this artificial situation, and now the consequences of what
I asked for are considered to be a bug", for instance "if I run this
tool that wants to output UTF-8 in an ASCII-only locale, it fails with
an error message" (well, of course it does, it's being put in a situation
where it can't do its job as-designed). Or building HTML documentation in
an arbitrary locale, and then having reproducible-builds act surprised
that one build mentions the French translation of "table of contents"
and the other mentions the German translation of "table of contents"
(well, of course it does - "you asked for it, you got it").
Post by Guillem Jover
I can understand though the sentiment
of wanting to shrug this problem category off and wanting instead to
sweep it under the carpet, but that has accessibility consequences.
I am not advocating sweeping this problem category under the carpet!
I'm just not convinced that saying "we support building any package
with an arbitrary locale at entry to the build system" is actually a
good way to detect the sorts of locale issues that cause the sorts of
concrete end-user-facing problems that have accessibility consequences.

If we want to run test-suites under multiple locales, then we should
maybe consider doing that, rather than using the locale of the build
system as a proxy for the (single) locale in which tests will be run for
this particular build. Saying "it's a bug if your test suite fails in
tr_TR.UTF-8" doesn't do anything to guarantee that anyone will actually
ever try that particular build scenario.

And, even if your test suite passes in tr_TR.UTF-8, that doesn't
necessarily mean that the right thing as expected by a Turkish speaker
is actually happening - as a non-Turkish-speaker, I'm certainly not
confident that I could write a unit test for whether dotted vs. dotless
i are being handled correctly, or even identify which component would
benefit from that unit test.

smcv
Guillem Jover
2024-08-03 13:30:01 UTC
Permalink
Hi!

[ Mostly trying to clarify some of my earlier comments. ]
Post by Simon McVittie
Post by Guillem Jover
I'm a non-native speaker, who has been involved
in l10n for a long time, while at the same time I've pretty much
always run my systems with either LANG=C.UTF-8 or before that LANG=C,
LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.
So diagnostic messages in your non-English language are so important to
you that you ... set your locale environment variables to values that
result in you seeing diagnostic messages in English instead? I'm not
sure I understand the point you're making here :-)
Ah, sorry, I see how my sentence might not be obvious to fully
unpack. :)

I know enough people in my locale surroundings that either do not have
a good enough command of English for whom output messages by default in
English would be a significant barrier to entry, or people who while
having a good command of English still feel more comfortable (or just
prefer) output messages to be in their native locale (to reduce mental
load for example). I've set my locale to C.UTF-8 or variants (in most
of my devices), in most part as a locale immersion device (so that I
could improve my English skills), while at the same time I'd consider
myself an exception in my locale group. My involvement in l10n has been
to try to help those groups of people, in addition to help me retain
some usage of my native language, and as a side effect to spot weird
or wrong constructs I might make in English strings too, which tend
to become obvious once you try to translate them. :)
Post by Simon McVittie
If your point is that people-who-are-not-you place a higher value on
having diagnostic messages come out in their non-English language than
you personally do, then, yes, that's certainly a valid thing for those
people to want.
More or less, the point I was trying to make was that while emitting
messages by default in English would not really affect me, I still think
it would be a significant problem (not just a preference) or a barrier
to entry for a big enough group of people.
Post by Simon McVittie
But I'm not sure that our current package set actually achieves that -
increasingly many of our packages overwrite the locale with "C.UTF-8"
in some layer of their build system, because they cannot guarantee that
the locale they inherit from the environment is anything reasonable (in
particular, it might be "C", which often breaks tools that want to work
with non-ASCII filenames, inputs or outputs). In the enumeration from
my earlier message, you want (1.), but increasingly, what you actually
get is (2½.) instead, and that results in neither you nor Giole getting
the results you would hope for.
The compromise that Alexandre suggested elsewhere in the thread -
requiring the locale to be *something* UTF-8, but leaving it unspecified
exactly which UTF-8 locale, so that a French-speaking developer can ask
for fr_FR.UTF-8 and get their compiler warnings in French - seems like
something that might actually give you what you want in more cases than
the status quo does? If we mandate a UTF-8 locale, then stack layers like
debhelper's meson plugin could probably stop forcing C.UTF-8.
Ideally, yes. I think the situation now is a bit better with the
recent dpkg uploads, but I'll expand in the thread from Alexandre's
suggestion.
Post by Simon McVittie
Post by Guillem Jover
we make lots of l10n work rather pointless
Surely only if that l10n work was done on tools that are only ever run
from package builds, and never interactively? A lot of localization is
done for end-user-facing tools (GUI, TUI or CLI) which are never relevant
during a package build anyway.
Even for compilers and similar non-interactive development tools, if
a French-speaking developer runs gcc in the fr_FR.UTF-8 locale during
their upstream development, they'll still benefit from its warnings being
localized into French, even if they would never see those same warnings
during a Debian package build of the same software.
(Analogous: I similarly benefit from gcc having ANSI colour highlights
in its output, even though my Debian package build logs don't have those.)
Sorry, right, my comment was specifically in the context of the dpkg
tooling (and surrounding scaffolding and helpers). If dpkg is always
forcing output messages in English from say dpkg-buildpackage, the are
going to be a set of tools that will pretty much never see any of
their output in localized form.
Post by Simon McVittie
Post by Guillem Jover
and if no one is running with different locales then l10n
bugs might easily creep in
If no one is running (their interactive sessions) with a particular
locale, why do we even support that locale?
This comment was in the context where the tooling forces a specific
locale, so users cannot have the chance of using it even if they want.

Thanks,
Guillem
Marco d'Itri
2024-06-06 16:50:01 UTC
Permalink
Post by Johannes Schauer Marin Rodrigues
A question that goes in a similar direction is whether every d/rules that needs
export DPKG_EXPORT_BUILDFLAGS=y
include /usr/share/dpkg/buildflags.mk
Or whether we should switch the default and require that d/rules is run in an
environment (for example as set-up by dpkg-buildpackage) where these variables
are set?
Indeed, a few years ago I decided that it does not make any sense,
removed all these includes and started always using
dpkg-buildpackage/debuild to call debian/rules.
This is the resilient and future-proof option.
--
ciao,
Marco
Andrey Rakhmatullin
2024-06-06 17:00:01 UTC
Permalink
Post by Johannes Schauer Marin Rodrigues
Or whether we should switch the default and require that d/rules is run in an
environment (for example as set-up by dpkg-buildpackage) where these variables
are set?
(a previous discussion on this:
https://lists.debian.org/debian-devel/2017/10/msg00317.html)
--
WBR, wRAR
Holger Levsen
2024-06-07 08:40:01 UTC
Permalink
I would prefer that dpkg-buildpackage provides a "sane" build environment by
default (which I think includes a LC_ setting pointing at a .UTF-8 locale)
and fewer packages explicitly setting those things via debian/rules.
same here. like the rest of the world does in 2024.
Afaics, this would actually make efforts like reproducible builds *easier*
as settings provided by reproducible-builds wouldn't be overwritten by
debian/rules.
it would make a lot of things easier. :)
--
cheers,
Holger

⢀⣎⠟⠻⢶⣊⠀
⣟⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄

No matter how many mistakes you make or how slow you progress, you are still
way ahead of everyone who isn't trying.
Daniel Gröber
2024-06-06 12:50:01 UTC
Permalink
Hi,
Post by Simon Richter
If your package is not reproducible without it, then your package is
broken. It can go in with the workaround, but the underlying problem
should be fixed at some point.
It's easy to say "should be fixed" but finding the source of such build
problems is another matter.

I was debugging a hard to find locale repro bug with some people at
mDebConf Berlin and we had this thought: why don't we have a debugger for
this yet? Seems pretty simple to detect in principle:

At build-time, if a program doesn't call setlocale before using locale
dependent standard library functions it's probably a reproducibility
hazard.

Using the LD_PRELOAD hack like fakeroot/faketime we could make the program
crash or print a stack trace at the point it's trying to use the locale
from the environment. That should make it easier to figure out where these
problems even are.

I wonder if there's other repro things we could screen for in a similar
manner?

--Daniel
Simon McVittie
2024-06-06 13:40:01 UTC
Permalink
Post by Daniel Gröber
Post by Simon Richter
If your package is not reproducible without it, then your package is
broken.
At build-time, if a program doesn't call setlocale before using locale
dependent standard library functions it's probably a reproducibility
hazard.
I think that's the wrong way round: if the program *does* call
setlocale(., "") then it's a potential reproducibility hazard, but
until/unless it calls setlocale or equivalent, it's documented in
setlocale(3) that it runs in the portable (but bad[1]) "C" locale.

But if a program that is run during compilation does call setlocale, then
it's most likely doing so for a reason - most commonly so that it can emit
diagnostic messages in the user's locale, rather than in programmer-English
(and advocates of l10n would likely say that it's a bug for a program to
emit diagnostic messages *without* having called setlocale(., "") first).
It's only a reproducibility hazard if locale-dependent functions are
used to parse machine-readable input, or to emit output that ends up in
the .deb. Without further context, we cannot know whether locale-sensitive
functions are being used correctly or incorrectly, in the same way that we
can't tell without context whether a use of strcmp() is correct or
whether a related but different function like strcasecmp() was intended.

If we want programs to be locale-insensitive during build, there is a
well-defined interface for that - namely, setting LC_ALL to (C or) C.UTF-8.
If we don't do that, but instead leave locale environment variables set
to whatever arbitrary value has been inherited from the caller, then we
are effectively saying "we want programs to remain locale-sensitive", and
arguably it would be a (wishlist?) bug for those programs to *not*
respect the locale environment variables (at least for their diagnostic
output). It seems to me that this applies equally to programs that are
or aren't typically used during compilation.

If a program uses locale-sensitive functions to parse its configuration
file or format its output or something like that, then that's often a
bug, but it might equally well be working as designed/documented - again,
we can't tell which without domain-specific knowledge of the program.

smcv

[1] unable to output, or in some cases parse, any character outside the
1-127 ASCII range
Adrian Bunk
2024-06-17 22:40:01 UTC
Permalink
Sorry for being late to this discussion, but there are a few points
and a suggestion I'd like to make:


1. Reproducibility is not a big concern

Quoting policy:

Packages should build reproducibly, which for the purposes of this
document means that given
...
- a set of environment variable values;
...
repeatedly building the source package
...
with ... exactly those environment variable values set
will produce bit-for-bit identical binary packages.

There is also the practical side that our buildds already set LC_ALL=C.UTF-8,
in main one can already assume that every package in a release has been
built with in this environment.


2. RC is what does FTBFS on the buildds

Usually a FTBFS is RC only when it happens on the buildds.

FTBFS with non-C.UTF-8 locales itself is not RC,
just like FTBFS on single-core machines is not RC.

These are of course still bugs, especially if a different UTF-8 locale
results in test failures that indicate runtime issues.


3. Importance of build-time diversity

Less than 3 years ago, having build-arch/build-indep targets in
debian/rules was a usecase important enought for some people that a MBF
with hundreds of RC bugs was done and many people (including myself)
spent time fixing this usecase by adding build-arch/build-indep targets
to packages.

Calling the clean target manually is something I frequently do.

Doing a build test or autopkgtest with an Estonian or Turkish locale
is hard/impossible when something (no matter whether debian/rules or
debhelper or dpkg-buildpackage) enforces C.UTF-8.


4. C.UTF-8 or *some* UTF-8 locale?

The main problems are with non-UTF-8 locales, it might be
uncontroversial to declare building with a non-UTF-8 locale
unsupported and make dpkg-buildpackage reject this with a message like:

Building with a non-UTF-8 locale is no longer supported, please do
LC_ALL=C.UTF-8 dpkg-buildpackage

This should be sufficient to address the root cause of all/most of the
current manual and tooling settings of C.UTF-8, and could actually
enable useful testbuilds for finding problems for Turkish users.


cu
Adrian
Loading...