Discussion:
Musings about Usernames in adduser and Debian
(too old to reply)
Marc Haber
2024-11-21 17:50:01 UTC
Permalink
[writing this with my adduser hat on. I am also in touch with the
maintainers of src:shadow and base-passwd]

Hi,

recently, I have "taken over" the wiki page about UserAccounts and have
put in some history and general thoughts about what Debian thinks about
user names and name restrictions.

https://wiki.debian.org/UserAccounts

I fear that I have opened an especially nasty can of worms by beginning
to do sanity checks in adduser and being pointed towards user name
encoding in that process. Can you help me to bring some sense into this
mess?

I would like to hear your comments. Feel free to directly apply
corrections to the wiki page. I am especially interested in having clear
terminology regarding unicode codepoints, UTF-8, character strings and
byte strings. It is vitally important to be consistent her to avoid
making the mess even worse.

For adduser's next release, I would like to discuss the following
things:

(1)
Should Debian allow UTF-8 user names in the first place or should we
restrict names for regular users to some us-ascii near set as well? (I
think yes, we should)

(2)
If the answer to (1) is "allow UTF-8", should we also do that for system
users? (I think no, we should not)

(2a)
Which UTF-8 subset / code point classes should we allow and which should
we reject? (I don't have an opinion about that)

(3)
I think that 32 characters/bytes (it's the same if we don't allow UTF-8)
is a good limitation for a system user name. But, should we increase
that for regular user names? (I think yes)

(4)
If we decide to relax some of our current requirements, where are the
borders between "normal" user name, one that requires --allow-bad-names
and finally one that requires --allow-all-names? Wouldn't it be
offensive to speakers of some languages that require --allow-bad-names
for their special characters to be allowed on a user name? (no opinion
here that would not break backwards compatibility)

(5)
Is it right to say "the user name in /etc/passwd is UTF-8 encoded" or
should I better say "the user name in /etc/passwd can be UTF-8 encoded"?

(6)
Does it still make sense to give non-UTF-8-locales special handling
(which one?), or can adduser safely assume that any non-ascii locale is
UTF-8? Or must I check for locale and reject UTF-8 user names on
non-UTF-8 locales? (I hope that we can safely assume UTF-8)

(7)
Do the general restrictions for both kinds of user names make sense?
Going forward with this would mean to reject user names that we used to
accept before. (I think we should come close to systemd's ideas)

(8)
I think that our current way to restrict system account names is fine.
Any objections/additions here?

(9)
Should some of this language be in Policy instead of some random wiki
page? Policy is quite short about user names (chapter 9.2) (I think yes)

(10)
What should adduser do regarding subuids? Since I was ignorant about
that concept until a few hours ago, all accounts created by adduser do
have subuids, regardless of being system account or not, while useradd
does not give system accounts subuids.

Greetings
Marc

P.S.: The teams and inviduals working on src:shadow, base-passwd and
adduser would appreciate your help in coding and packaging. You can gt
in touch with all involved parties via
pkg-shadow-***@lists.alioth.debian.org
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Richard Lewis
2024-11-21 22:10:01 UTC
Permalink
Post by Marc Haber
For adduser's next release, I would like to discuss the following
(1)
Should Debian allow UTF-8 user names in the first place or should we
restrict names for regular users to some us-ascii near set as well? (I
think yes, we should)
would allowing utf-8 enable some of the abuse described at
https://lwn.net/Articles/874951/ ?

as usernames appear in logs and other output (and are passed to all
sorts of commands), it seems a bad idea to be too permissive or to
change from historic practice by default, even though from a user pov it
would be nice to have the option
Post by Marc Haber
P.S.: The teams and inviduals working on src:shadow, base-passwd and
adduser would appreciate your help in coding and packaging.
Is there a list of "things that need doing"?
Marc Haber
2024-11-22 09:40:01 UTC
Permalink
Post by Richard Lewis
Post by Marc Haber
For adduser's next release, I would like to discuss the following
(1)
Should Debian allow UTF-8 user names in the first place or should we
restrict names for regular users to some us-ascii near set as well? (I
think yes, we should)
would allowing utf-8 enable some of the abuse described at
https://lwn.net/Articles/874951/ ?
as usernames appear in logs and other output (and are passed to all
sorts of commands), it seems a bad idea to be too permissive or to
change from historic practice by default, even though from a user pov it
would be nice to have the option
I am not sure about that. Would typosquatting on a user name make sense?
It might be possible to make logs ambiguious. Being passed to other
commands SHOULD not be dangerous since we can expect other commands to
gracefully handle a byte stream, can't we?

I might be naive here , but I don't have much experience with non-ascii
names since I have the privilege of being fluent in two languages that
use the latin alphabet.

On the other side, wouldnt it be a courtesy to allow people having a
name that needs transcription to be written in latin to use their name
in the real alphabet that it is usually written in as a login name as
well? To make things worse, transcriptions are often ambigious.

I would like to hear the opinion of people who would be affected by this
change.

Local Administrators are able today to use UTF-8 user names in useradd
or configure adduser to allow their locally important subset of UTF-8,
but at the moment with things being more restrictive, our software is
untested in this regard. I think that Debian would get more robust if
we'd allow things here.

Vulnerabilities that could be exploited by having non-ascii user names
are already here and present today, just not uncovered yet.
Post by Richard Lewis
Post by Marc Haber
P.S.: The teams and inviduals working on src:shadow, base-passwd and
adduser would appreciate your help in coding and packaging.
Is there a list of "things that need doing"?
The collaboration between src:shadow, base-passwd and adduser is a
relatively fresh thing that came from the fact that src:shadow recently
introduced changes that made adduser's test suite break. So we haven't
yet found good paths yet. I suggested moving together as a method to
improve communication and also to at least a bit reducing the bus
factors of those quite important packages. That was also the reason why
I suggested base-passwd to join and I am happy that Colin agreed.

In adduser, nearly everything that needs doing has issues in the BTS,
with the severity set to the urgency of the matter in my opinion. You'll
see that adduser has quite a lot of bugs that were filed by myself. I
consider it a feature to have a public to-do list. For the other two
packages, I'd let their respective maintainers comment.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Étienne Mollier
2024-11-22 19:50:01 UTC
Permalink
Hi Marc,
Post by Marc Haber
I might be naive here , but I don't have much experience with non-ascii
names since I have the privilege of being fluent in two languages that
use the latin alphabet.
I am not sure whether I am the intended audience here, because
my name is almost Ascii based. That being said, I happen to
have one weird enough latin based character as the first letter
in my first name, that it gives interesting results when thrown
toward random databases. Thus I do happen to have some thoughts
about this topic.
Post by Marc Haber
On the other side, wouldnt it be a courtesy to allow people having a
name that needs transcription to be written in latin to use their name
in the real alphabet that it is usually written in as a login name as
well? To make things worse, transcriptions are often ambigious.
I would like to hear the opinion of people who would be affected by this
change.
I tried to consider what it would take to have an émollier or an
Émollier login, and there is one little blocker : I may have to
login from environments or keyboards lacking the necessary i18n
and l10n capabilities to transcribe the 'e' acute, let alone the
uppercase 'e' acute. For example, I hit this particular issue
when populating the Gecos field from the Debian installer
environment: if I choose a Qwerty US configuration but miss the
step to choose which Qwerty US internationalized variant I want
to use, then I don't get to type uppercase 'e' acute, but there
are many other situations unrelated to d-i or even Debian where
I run into that. For this practical reason, I tend to feel
better about keeping a full Ascii login name. I wouldn't feel
strongly if unicode support for login never happens. I believe
however that the Gecos is the right place to store the properly
typed-in person name, because it is a "presented" name that
hasn't the technical coupling that the login name has, and I
would probably have stronger feelings if it were to not have
unicode support.

You probably want to have some more thoughts, especially from
people with entirely non latin character names. Having a latin
name, I accomodate perhaps too well of a full Ascii login.

Have a nice day, :)
--
.''`. Étienne Mollier <***@debian.org>
: :' : pgp: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
`. `' sent from /dev/pts/1, please excuse my verbosity
`-
Gioele Barabucci
2024-11-22 21:10:01 UTC
Permalink
I tried to consider what it would take to have an émollier or an
Émollier login, and there is one little blocker : I may have to
login from environments or keyboards lacking the necessary i18n
and l10n capabilities to transcribe the 'e' acute, let alone the
uppercase 'e' acute.
Dear Étienne,

your case highlights another problem not mentioned in the original list
posted by Marc: comparison (and normalization).

Some characters can be encoded in more than one way. For instance, "é"
in "émollier" could we stored as "e with acute" U+00E9 (and encoded in
UTF-8 as 0xc3 0xa9) or as "e, combined with an acute accent" U+0065 plus
U+0301 (UTF-8: 0x65 0xcc 0x81). If a keyboard input system provides the
former sequence of bytes, but the username is stored in the login
infrastructure using the latter sequence of bites, then a naive
comparison will not find the user "émollier" in the system. Unicode
defines in Annex 15 a few normalization forms as a way to work around
this problem. But a correct use of these normalization forms still
requires coordination and standardization among all programs accessing
the data.

Does POSIX (or other de-facto standards) prescribe a normalization form
for Unicode-/UTF-8-encoded usernames?

Regards,
--
Gioele Barabucci
Peter Pentchev
2024-11-22 23:40:01 UTC
Permalink
Post by Étienne Mollier
I tried to consider what it would take to have an émollier or an
Émollier login, and there is one little blocker : I may have to
login from environments or keyboards lacking the necessary i18n
and l10n capabilities to transcribe the 'e' acute, let alone the
uppercase 'e' acute.
Dear Étienne,
your case highlights another problem not mentioned in the original list
posted by Marc: comparison (and normalization).
Some characters can be encoded in more than one way. For instance, "é" in
"émollier" could we stored as "e with acute" U+00E9 (and encoded in UTF-8 as
0xc3 0xa9) or as "e, combined with an acute accent" U+0065 plus U+0301
(UTF-8: 0x65 0xcc 0x81). If a keyboard input system provides the former
sequence of bytes, but the username is stored in the login infrastructure
using the latter sequence of bites, then a naive comparison will not find
the user "émollier" in the system. Unicode defines in Annex 15 a few
normalization forms as a way to work around this problem. But a correct use
of these normalization forms still requires coordination and standardization
among all programs accessing the data.
Does POSIX (or other de-facto standards) prescribe a normalization form for
Unicode-/UTF-8-encoded usernames?
POSIX says "if you want your applications to be portable, do not use any
funny characters in usernames":

https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_409

3.409 User Name

A string that is used to identify a user; see also 3.407 User Database.
To be portable across systems conforming to POSIX.1-2024, the value is
composed of characters from the portable filename character set.
The <hyphen-minus> character should not be used as the first character
of a portable user name.

For people unfamiliar with POSIX terms, the portable filename character
set is defined as:

https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap03.html#tag_03_265

The set of characters from which portable filenames are constructed.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . _ -

The last three characters are the <period>, <underscore>, and
<hyphen-minus> characters, respectively.

G'luck,
Peter
--
Peter Pentchev ***@ringlet.net ***@debian.org ***@morpheusly.com
PGP key: https://www.ringlet.net/roam/roam.key.asc
Key fingerprint 2EE7 A7A5 17FC 124C F115 C354 651E EFB0 2527 DF13
Marc Haber
2024-11-27 16:50:01 UTC
Permalink
Post by Peter Pentchev
POSIX says "if you want your applications to be portable, do not use any
But we are not writing applications, we are a distribution. Anything
that works with the software we distribute is fine.
Post by Peter Pentchev
A string that is used to identify a user; see also 3.407 User Database.
To be portable across systems conforming to POSIX.1-2024, the value is
composed of characters from the portable filename character set.
If a local admin wants their local user database (hence, /etc/passwd or
an LDAP diretory) to work with non-Debian OSses, they need to take care
about which accounts they create. I don't think that we should restrict
local admins who don't need that kind of portability.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Marc Haber
2024-11-27 16:40:02 UTC
Permalink
Post by Gioele Barabucci
your case highlights another problem not mentioned in the original list
posted by Marc: comparison (and normalization).
Some characters can be encoded in more than one way. For instance, "é" in
"émollier" could we stored as "e with acute" U+00E9 (and encoded in UTF-8 as
0xc3 0xa9) or as "e, combined with an acute accent" U+0065 plus U+0301
(UTF-8: 0x65 0xcc 0x81).
That would be two distinct user names. Unless we have a widely available
unicode library that can do this kind of normalization it is unlikely
that our system utilities can take care of that. I'd like to put that
responsibility on to the person who / the system that actually creates
those user names.
Post by Gioele Barabucci
If a keyboard input system provides the former
sequence of bytes, but the username is stored in the login infrastructure
using the latter sequence of bites, then a naive comparison will not find
the user "émollier" in the system.
Currently adduser just takes the characters that come from the command
line and encodes it into the byte stream that goes to useradd and
library calls.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Marc Haber
2024-11-27 16:40:02 UTC
Permalink
Post by Étienne Mollier
Post by Marc Haber
I might be naive here , but I don't have much experience with non-ascii
names since I have the privilege of being fluent in two languages that
use the latin alphabet.
I am not sure whether I am the intended audience here, because
my name is almost Ascii based. That being said, I happen to
have one weird enough latin based character as the first letter
in my first name, that it gives interesting results when thrown
toward random databases. Thus I do happen to have some thoughts
about this topic.
All opinions are important.
Post by Étienne Mollier
Post by Marc Haber
On the other side, wouldnt it be a courtesy to allow people having a
name that needs transcription to be written in latin to use their name
in the real alphabet that it is usually written in as a login name as
well? To make things worse, transcriptions are often ambigious.
I would like to hear the opinion of people who would be affected by this
change.
I tried to consider what it would take to have an émollier or an
Émollier login, and there is one little blocker : I may have to
login from environments or keyboards lacking the necessary i18n
and l10n capabilities to transcribe the 'e' acute, let alone the
uppercase 'e' acute.
Yes. Configuring all keyboards and input subsystems in the realm of this
instance of the user database in a way that all users are able to login
are the responsibility of the local admi.
Post by Étienne Mollier
For example, I hit this particular issue
when populating the Gecos field from the Debian installer
environment: if I choose a Qwerty US configuration but miss the
step to choose which Qwerty US internationalized variant I want
to use, then I don't get to type uppercase 'e' acute, but there
are many other situations unrelated to d-i or even Debian where
I run into that.
That issue would only affect users created from the Installer, and even
if you insist to have étienne as UID 1000, you could change to that
after installation. I tend to classify the inability to type the
intended user name on account creation a user error ;-)

I always create "zgadmin" in the installer, which is my user to ssh into
before sudoing to root if my regular account (which has a higher UID
for historial reasons) is unavailable. I wonder whether we should give
this advice in the documentation we are bound to write once we have
decided to officially allow UTF-8 login names.
Post by Étienne Mollier
For this practical reason, I tend to feel
better about keeping a full Ascii login name. I wouldn't feel
strongly if unicode support for login never happens.
It is already allowed. Only its support status is unclear.
Post by Étienne Mollier
I believe
however that the Gecos is the right place to store the properly
typed-in person name, because it is a "presented" name that
hasn't the technical coupling that the login name has, and I
would probably have stronger feelings if it were to not have
unicode support.
Console tools tend to ignore the gecos/comment name.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Timo Röhling
2024-11-22 14:30:01 UTC
Permalink
Hi,
Post by Richard Lewis
would allowing utf-8 enable some of the abuse described at
https://lwn.net/Articles/874951/ ?
as usernames appear in logs and other output (and are passed to all
sorts of commands), it seems a bad idea to be too permissive or to
change from historic practice by default, even though from a user pov it
would be nice to have the option
I have no experience with bidirectional attacks, but browsers
mitigate homograph attacks in IDNs by disallowing mixed alphabets
such as cyrillic and latin letters in the same name. That seems to
be a reasonable restriction for user names as well.


Cheers
Timo
--
⢀⣎⠟⠻⢶⣊⠀ ╭────────────────────────────────────────────────────╮
⣟⠁⢠⠒⠀⣿⡁ │ Timo Röhling │
⢿⡄⠘⠷⠚⠋⠀ │ 9B03 EBB9 8300 DF97 C2B1 23BF CC8C 6BDD 1403 F4CA │
⠈⠳⣄⠀⠀⠀⠀ ╰────────────────────────────────────────────────────╯
Marc Haber
2024-11-22 16:40:01 UTC
Permalink
I have no experience with bidirectional attacks, but browsers mitigate
homograph attacks in IDNs by disallowing mixed alphabets such as cyrillic
and latin letters in the same name. That seems to be a reasonable
restriction for user names as well.
I am not willing to implement that myself in adduser. I will accept code
and test cases written by others, but this is a thing that goes beyond
my resources. Additionally, it won't help since an attacker can directly
write to /etc/passwd.

Homograph attacks would be best mitigated in software reading
/etc/passwd, alerting in their output or logs that the user name they
just printed was composed of strange alphabets.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Alejandro Colomar
2024-12-05 13:50:01 UTC
Permalink
Hi Marc,
Post by Marc Haber
Homograph attacks would be best mitigated in software reading
/etc/passwd, alerting in their output or logs that the user name they
just printed was composed of strange alphabets.
Software that reads /etc/passwd or /etc/shadow is quite sensitive, and
should therefore be as simple as possible. More code, more bugs.

The best mitigation for those attacks is to ban the names altogether.
IMO, setuid programs should not accept Unicode.

Have a lovely day!
Alex
--
<https://www.alejandro-colomar.es/>
Stephan Seitz
2024-12-05 15:40:02 UTC
Permalink
Post by Alejandro Colomar
The best mitigation for those attacks is to ban the names altogether.
IMO, setuid programs should not accept Unicode.
Today, not many people want to live in the past and accept simply ASCII
if there name needs a bigger character set.

Stephan
--
| If your life was a horse, you'd have to shoot it. |
Marc Haber
2024-12-05 16:10:01 UTC
Permalink
Post by Alejandro Colomar
The best mitigation for those attacks is to ban the names altogether.
IMO, setuid programs should not accept Unicode.
Oh, Bugs by Code. Dangerous. We should stop producing code completely.
No code, no bugs.

Neither adduser nor useradd are setuid.
--
----------------------------------------------------------------------------
Marc Haber | " Questions are the | Mailadresse im Header
Rhein-Neckar, DE | Beginning of Wisdom " |
Nordisch by Nature | Lt. Worf, TNG "Rightful Heir" | Fon: *49 6224 1600402
Stephan Seitz
2024-12-05 16:20:01 UTC
Permalink
Post by Marc Haber
Neither adduser nor useradd are setuid.
To be fair, passwd is setuid. And I’m sure you are using it to set the
password. So it has to survive an unicode user name.

Stephan
--
| If your life was a horse, you'd have to shoot it. |
Iustin Pop
2024-11-21 22:30:01 UTC
Permalink
Post by Marc Haber
[writing this with my adduser hat on. I am also in touch with the
maintainers of src:shadow and base-passwd]
Hi,
recently, I have "taken over" the wiki page about UserAccounts and have
put in some history and general thoughts about what Debian thinks about
user names and name restrictions.
https://wiki.debian.org/UserAccounts
I fear that I have opened an especially nasty can of worms by beginning
to do sanity checks in adduser and being pointed towards user name
encoding in that process. Can you help me to bring some sense into this
mess?
I would like to hear your comments. Feel free to directly apply
corrections to the wiki page. I am especially interested in having clear
terminology regarding unicode codepoints, UTF-8, character strings and
byte strings. It is vitally important to be consistent her to avoid
making the mess even worse.
For adduser's next release, I would like to discuss the following
(1)
Should Debian allow UTF-8 user names in the first place or should we
restrict names for regular users to some us-ascii near set as well? (I
think yes, we should)
You weren't clear to which part you agreed. If by "we should" you meant
the closest option, i.e. restrict, then I agree as well.

As Richard also replied, full UTF-8 is tricky, and I think it's somewhat
misplaced to focus on the username, as opposed to gecos. Aren't most
other OSes using the "full name" as the "display name", and the username
is mostly one part of the user/password combination, but not a display
property most of the time?

So I would suggest that maybe the better option is to standardise the
gecos format/gecos parsing, so migrate UI tools to use that more often.

On the other hand, as long as this is admin-controlled, it doesn't
matter much. I could see that viewpoint, but I wonder how much latent
breakage would be introduced that will take years to fix in all tooling
and all packages.

regards,
iustin
Marc Haber
2024-11-22 09:50:01 UTC
Permalink
[Reducing the list to debian-devel. I have omitted to set Reply-To and
apologize for that]
Post by Iustin Pop
Post by Marc Haber
Should Debian allow UTF-8 user names in the first place or should we
restrict names for regular users to some us-ascii near set as well? (I
think yes, we should)
You weren't clear to which part you agreed. If by "we should" you meant
the closest option, i.e. restrict, then I agree as well.
I am sorry. My personal opinions were among the last things I added to
the article and I was not clear here. I think we should allow UTF-8 user
names as a courtesy to those people who need non-ascii user names to
write their name, since user names are frequently chosen from the real
name of the person. In addition, this will enhance software quality
since we now get the chance of finding bugs that are already here in
many software.

This comes kind of late in the Trixie cycle, but as it is currently
already possible to create user names with UTF-8 characters, I do not
like the idea of tightening our restrictions in Trixie over what we have
in Bookworm just to maybe revisit our decision in Trixie+1.
Post by Iustin Pop
As Richard also replied, full UTF-8 is tricky,
My current code uses \p{Graph} as a least common denominator. I am not
sure whether this is wise.
Post by Iustin Pop
and I think it's somewhat
misplaced to focus on the username, as opposed to gecos. Aren't most
other OSes using the "full name" as the "display name", and the username
is mostly one part of the user/password combination, but not a display
property most of the time?
I think that we should allow full UTF-8 in the gecos¹ field, yes. People
should be allowed to have their fully correct name in there. I also
think that users of non-latin languages should have the possibility to
have a login name that resembles their name.

¹ in 2024 noone remembers what gecos means any more. Adduser and
src:shadow are using "comment" for that field nowadays.
Post by Iustin Pop
So I would suggest that maybe the better option is to standardise the
gecos format/gecos parsing, so migrate UI tools to use that more often.
That doesn't solve the issue I am having with adduser right now: That
we're allowing things that we are not sure we should allow.
Post by Iustin Pop
On the other hand, as long as this is admin-controlled, it doesn't
matter much. I could see that viewpoint, but I wonder how much latent
breakage would be introduced that will take years to fix in all tooling
and all packages.
Yes. Fixing breakage makes software better, and by disallowing non-latin
characters in user names we are hiding those issues away.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Bjørn Mork
2024-11-24 11:00:01 UTC
Permalink
Post by Marc Haber
Post by Iustin Pop
On the other hand, as long as this is admin-controlled, it doesn't
matter much. I could see that viewpoint, but I wonder how much latent
breakage would be introduced that will take years to fix in all tooling
and all packages.
Yes. Fixing breakage makes software better, and by disallowing non-latin
characters in user names we are hiding those issues away.
This is arrogant. Assuming that a username can be displayed, sorted,
compared and typed using strict us-ascii is not a bug today. It's not
"hiding" any issue.

The question is whether it makes sense to introduce a new class of bugs
by changing the rules. And we can pretty much guarantee that some of
those bugs are securty critical, since this is all about authentication
and authorization.

Knowingly introducing security bugs does not sound like a good idea.

For what purpose?


Bjørn
Marc Haber
2024-11-24 20:20:02 UTC
Permalink
Post by Bjørn Mork
Post by Marc Haber
Post by Iustin Pop
On the other hand, as long as this is admin-controlled, it doesn't
matter much. I could see that viewpoint, but I wonder how much latent
breakage would be introduced that will take years to fix in all tooling
and all packages.
Yes. Fixing breakage makes software better, and by disallowing non-latin
characters in user names we are hiding those issues away.
This is arrogant.
That was not my intention. I apologize for that.
Post by Bjørn Mork
Assuming that a username can be displayed, sorted,
compared and typed using strict us-ascii is not a bug today. It's not
"hiding" any issue.
I have to disagree. Our tools allow creating UTF-8 usernames today, and
even if they did it would be possible to just edit /etc/passwd.
Post by Bjørn Mork
The question is whether it makes sense to introduce a new class of bugs
by changing the rules. And we can pretty much guarantee that some of
those bugs are securty critical, since this is all about authentication
and authorization.
So we're having these bugs right noow. If you can use adduser or useradd
to create such accounts, then you have the privilege of putting them
directly into /etc/passwd as well. /etc/passwd is a well-defined and
documented interface.
Post by Bjørn Mork
For what purpose?
Being friendly to people who can't properly write their names in latin.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Simon McVittie
2024-11-24 14:30:01 UTC
Permalink
Post by Iustin Pop
As Richard also replied, full UTF-8 is tricky, and I think it's somewhat
misplaced to focus on the username, as opposed to gecos. Aren't most
other OSes using the "full name" as the "display name", and the username
is mostly one part of the user/password combination, but not a display
property most of the time?
So I would suggest that maybe the better option is to standardise the
gecos format/gecos parsing, so migrate UI tools to use that more often.
As a data point, in our default GNOME desktop, System Settings
(gnome-control-center) prompts for a "Full Name" first (behind the
scenes that's the full name part of the pw_gecos field), and a "Username"
second (this is the pw_name); and the default display mode for the
gdm3 login prompt is to show a list of full names from pw_gecos.

My understanding is that the full name already allows arbitrary UTF-8,
except for the characters that can't be represented in passwd(5) syntax
(colon, comma, newline) and the ampersand.

Outside the Linux/GNU/freedesktop worlds, this is fairly similar to how
macOS presents the distinction between the display name and the Unix
username (pw_name). macOS is interesting here because it's an operating
system with a lot of Unix ancestry, but has also had a lot of effort put
into making it friendly for non-technical users.

In the macOS world, it seems to be conventional and encouraged to set the
username to a lower-case ASCII string with no punctuation, similar to the
conventions in POSIX and <https://systemd.io/USER_NAMES/>.
Unfortunately I haven't been able to find a reference for what characters
macOS allows in pw_name. Perhaps a DD who has a macOS system (or a family
member with a macOS system) could help here?

I think one good idea that we should certainly adopt from
<https://systemd.io/USER_NAMES/> is its separation between "strict mode"
(the naming convention that it encourages for all uses, and enforces
when a user is created via systemd tools) and "relaxed mode" (the much
less strict naming convention that systemd requires for names created by
non-systemd tools). Because of the differences between those two modes,
systemd is quite conservative in what its own tools will emit but a
lot more liberal in what it will accept, and that seems like a good
principle here, even if the specific rules that Debian chooses end up
differing from those that systemd has chosen.

smcv
Chris Hofstaedtler
2024-11-24 14:30:01 UTC
Permalink
Post by Simon McVittie
As a data point, in our default GNOME desktop, System Settings
(gnome-control-center) prompts for a "Full Name" first (behind the
scenes that's the full name part of the pw_gecos field), and a "Username"
second (this is the pw_name); and the default display mode for the
gdm3 login prompt is to show a list of full names from pw_gecos.
In the macOS world, it seems to be conventional and encouraged to set the
username to a lower-case ASCII string with no punctuation, similar to the
conventions in POSIX and <https://systemd.io/USER_NAMES/>.
Unfortunately I haven't been able to find a reference for what characters
macOS allows in pw_name. Perhaps a DD who has a macOS system (or a family
member with a macOS system) could help here?
macOS generates a suggested pw_name from the fullname, approximating
it by a reduction to an (apparently) ASCII character set.

macOS is at least nice enough to try very hard to hide pw_name in
all graphical interfaces and show the full name instead. (IIRC you
can type the pw_name in if necessary.)

Chris
Marc Haber
2024-11-27 16:30:02 UTC
Permalink
Post by Simon McVittie
I think one good idea that we should certainly adopt from
<https://systemd.io/USER_NAMES/> is its separation between "strict mode"
(the naming convention that it encourages for all uses, and enforces
when a user is created via systemd tools) and "relaxed mode" (the much
less strict naming convention that systemd requires for names created by
non-systemd tools). Because of the differences between those two modes,
systemd is quite conservative in what its own tools will emit but a
lot more liberal in what it will accept, and that seems like a good
principle here, even if the specific rules that Debian chooses end up
differing from those that systemd has chosen.
Yes. Especially we need to note that systemd strict mode is even
stricter than what we currently allow for system accounts. I also don't
like that this is not configurable, especially regarding systemd-homed
which affects the account names of regular users.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
nick black
2024-11-23 07:50:02 UTC
Permalink
Post by Marc Haber
(1)
Should Debian allow UTF-8 user names in the first place or should we
restrict names for regular users to some us-ascii near set as well? (I
think yes, we should)
I feel strongly yes, despite POSIX admonitions (quoted elsewhere
in this thread) and sure breakage any number of places. I think
a test plan would be very desirable (off the top of my head,
we'd want to check login, the DMs, PAM, OpenSSH, passwd, w,
framebuffer console input, etc. It would probably also be a good
idea to loop in other distributions.

I recommend Chapter 7 of my free book, "Hacking the Planet with
Notcurses: A Guide to TUIs and Character Semigraphics" for the
full story (as I understand it) regarding Unicode presentation:
https://nick-black.com/htp-notcurses.pdf (starts on page 41).

Some serious concerns:

* any upstream tool could say "bad idea" and refuse patches,
requiring their long term management,
* the Linux framebuffer console is pretty limited in what
glyphs it has available, and the number of glyphs it can
support,
* you want installer support if you intend to do this right,
* ubiquitous input for UTF-8 is a pretty complicated story, and
* broken localization (or failure to call setlocale()) could be
a bigger problem, especially for root/system accounts.

Other concerns:

You'll likely now be linking libunistring into some
binaries where it wasn't previously used.

Regarding the subset of Unicode characters you'd want to allow,
this would be best decided using the General Category trait.
Each codepoint is assigned one of a finite set of General
Categories. We would probably want to allow Letters, Marks, and
Numbers, and perhaps a whitelist from Punctuation and Symbols
(Punctuation, connector and Punctuation, dash are probably all
we'd want) extended from currently supported ispunct(3)
characters. This data is available from libunistring (and
probably other places). This eliminates a great swatch of known
security issues.

Names containing invalid UTF-8 sequences ought be rejected.

Characters 0-127 would presumably be allowed iff they are now;
UTF-8 preserves US-ASCII.

We ought support combining characters up through the Extended
Grapheme Cluster (a single user-perceived character, roughly a
glyph, made up of one or more encoded characters). Generally a
single backspace ought map to an entire EGC.

Regarding canonicalization/normalization, this is a complex
question without a necessarily correct technical answer. I think
you'd want to follow the Principle of Least Astonishment; as to what
would astonish the least, I'd like to hear wider input. But
Unicode definitely defines multiple normal forms and equivalency
classes.

You now have glyphs which occupy more than one column. Are your
columnar/tabular programs prepared for that? ﷜𒁭𒐫
Post by Marc Haber
(2)
If the answer to (1) is "allow UTF-8", should we also do that for system
users? (I think no, we should not)
I think you should, simply because otherwise you have two paths
in more places.
Post by Marc Haber
(2a)
Which UTF-8 subset / code point classes should we allow and which should
we reject? (I don't have an opinion about that)
Answered above.
Post by Marc Haber
(3)
I think that 32 characters/bytes (it's the same if we don't allow UTF-8)
is a good limitation for a system user name. But, should we increase
that for regular user names? (I think yes)
I hesitate to comment here because who really cares, but does 32
save us something over 128? 128 seems the default "enough for
everybody" these days, looking at IPv6 and ZFS.

My printer is administered by i̞̒nÌŽÍ›e̵̎l̎͝uÌ·ÌŸc̎̉t̵́å̵b̷͋l̷͐eÌŽÌ‹m̞̆o̷̚d̎̐ä̞́l̶͝i̷̋t̷͗ẏ̷ȏ̵f̞̃t̶͘h̷͗eÌŽÌ¿v̶͘i̷̛s̞̈́ì̵b̷̃l̶̎e̷͊.
Post by Marc Haber
(5)
Is it right to say "the user name in /etc/passwd is UTF-8 encoded" or
should I better say "the user name in /etc/passwd can be UTF-8 encoded"?
"It is UTF-8 encoded."
Post by Marc Haber
(6)
Does it still make sense to give non-UTF-8-locales special handling
(which one?), or can adduser safely assume that any non-ascii locale is
UTF-8? Or must I check for locale and reject UTF-8 user names on
non-UTF-8 locales? (I hope that we can safely assume UTF-8)
It cannot. "C" is not UTF-8. Assumption of UTF-8 requires a
properly set LANG and programs calling setlocale(). This, as
alluded to above, has the potential for a big mess.
--
nick black -=- https://nick-black.com
to make an apple pie from scratch,
you need first invent a universe.
Johannes Schauer Marin Rodrigues
2024-11-23 08:40:01 UTC
Permalink
Quoting nick black (2024-11-23 08:48:10)
Post by nick black
You now have glyphs which occupy more than one column. Are your
columnar/tabular programs prepared for that? ﷜𒁭𒐫i
xfce-terminal renders this like this: Loading Image...

No idea if this is correct and I'll leave the details to those who know more
about this topic than I. And maybe my email client completely messes this up in
this response of mine.

But my 2 cents on the topic are: Lets please allow more than ascii in
usernames. I find it very uncomfortable every time I have to tell my students
that sorry, you somehow have to manage writing your name using American letters
because that's all we have after half a century of Computers being a thing...

If having this work in Debian can put a bit on pressure on those software
projects that do not support this, then please let that happen so that missing
unicode support becomes more annoying for those pieces of software that are
missing it. For example, if my email client messed this up, then lets fix it.
We cannot find these kind of bugs if we accept translating everybody's given
name to the American alphabet.

Thanks!

cheers, josch
nick black
2024-11-23 11:20:01 UTC
Permalink
Post by Johannes Schauer Marin Rodrigues
Quoting nick black (2024-11-23 08:48:10)
Post by nick black
You now have glyphs which occupy more than one column. Are your
columnar/tabular programs prepared for that? ﷜𒁭𒐫i
xfce-terminal renders this like this: https://mister-muffin.de/p/4o2v.png
more correctly, it renders it like that when using your font,
with its current font rendering engine. very little can be
assumed here. thankfully, presentation shouldn't be that big of
a deal outside of tabular UIs.
--
nick black -=- https://nick-black.com
to make an apple pie from scratch,
you need first invent a universe.
Gioele Barabucci
2024-11-23 12:00:01 UTC
Permalink
Post by Johannes Schauer Marin Rodrigues
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames.
Yes please, but opt-in and behind a big red warning that says that it is
not interoperable (outside POSIX), potentially insecure (homographs) and
at high-risk of breaking existing applications (lack of standardized
normalization form).

Regards,
--
Gioele Barabucci
nick black
2024-11-24 10:10:01 UTC
Permalink
Post by Johannes Schauer Marin Rodrigues
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames.
potentially insecure (homographs) and at
high-risk of breaking existing applications (lack of standardized
normalization form).
i'm not sure why this is being repeated.

https://unicode.org/reports/tr15/
--
nick black -=- https://nick-black.com
to make an apple pie from scratch,
you need first invent a universe.
Gioele Barabucci
2024-11-24 11:30:01 UTC
Permalink
Post by nick black
Post by Johannes Schauer Marin Rodrigues
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames.
potentially insecure (homographs) and at
high-risk of breaking existing applications (lack of standardized
normalization form).
i'm not sure why this is being repeated.
https://unicode.org/reports/tr15/
Dear Nick,

You may have misunderstood that phrase. I was not referring to the fact
that there are no standardized normalization forms for Unicode (I
explicitly mention Annex 15 in [1]), but to the fact that there is no
standard that specifies which of the possible normalization forms should
be used for account names (and other fields in passwd).

POSIX explicitly limits itself of a subset of ASCII, so it is not going
to mandate any normalization form. Are there other standards (or
initiatives) in this area that you know of?

Regards,

[1] https://lists.debian.org/debian-devel/2024/11/msg00305.html
--
Gioele Barabucci
Michal Politowski
2024-11-28 11:40:01 UTC
Permalink
Post by Gioele Barabucci
Post by nick black
Post by Johannes Schauer Marin Rodrigues
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames.
potentially insecure (homographs) and at
high-risk of breaking existing applications (lack of standardized
normalization form).
i'm not sure why this is being repeated.
https://unicode.org/reports/tr15/
Dear Nick,
You may have misunderstood that phrase. I was not referring to the fact that
there are no standardized normalization forms for Unicode (I explicitly
mention Annex 15 in [1]), but to the fact that there is no standard that
specifies which of the possible normalization forms should be used for
account names (and other fields in passwd).
POSIX explicitly limits itself of a subset of ASCII, so it is not going to
mandate any normalization form. Are there other standards (or initiatives)
in this area that you know of?
What about RFC 8265?
"Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords"
https://datatracker.ietf.org/doc/html/rfc8265
Post by Gioele Barabucci
Regards,
[1] https://lists.debian.org/debian-devel/2024/11/msg00305.html
--
Michał Politowski
Gioele Barabucci
2024-12-01 22:30:01 UTC
Permalink
Post by Michal Politowski
Post by Gioele Barabucci
POSIX explicitly limits itself of a subset of ASCII, so it is not going to
mandate any normalization form. Are there other standards (or initiatives)
in this area that you know of?
What about RFC 8265?
"Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords"
https://datatracker.ietf.org/doc/html/rfc8265
Thank you Michal for the pointer.

RFC 8265 (and the associated RFC 8264 "PRECIS Framework: Preparation,
Enforcement, and Comparison of Internationalized Strings in Application
Protocols") looks exactly what all login-related programs should
implement in order to avoid the kind of errors described in
<https://lists.debian.org/debian-devel/2024/11/msg00491.html>.

But a cursory search shows that none of the current upstreams support
(or mention) PRECIS. (It also shows that src:precis is a Java library
squatting a bit on that package name... :))

Regards,
--
Gioele Barabucci
Michal Politowski
2024-12-02 10:40:01 UTC
Permalink
Dnia Sun, 1 Dec 2024 23:27:09 +0100, Gioele Barabucci napisał(a):
[...]
But a cursory search shows that none of the current upstreams support (or
mention) PRECIS. (It also shows that src:precis is a Java library squatting
a bit on that package name... :))
But at least it is an implementation of this PRECIS :)
There is also python3-precis-i18n in the archive.
--
Michał Politowski
nick black
2024-12-02 00:10:01 UTC
Permalink
Post by Gioele Barabucci
You may have misunderstood that phrase. I was not referring to the fact that
there are no standardized normalization forms for Unicode (I explicitly
mention Annex 15 in [1]), but to the fact that there is no standard that
specifies which of the possible normalization forms should be used for
account names (and other fields in passwd).
POSIX explicitly limits itself of a subset of ASCII, so it is not going to
mandate any normalization form. Are there other standards (or initiatives)
in this area that you know of?
I'm glad we're both on page for Annex 15, and indeed, POSIX does
seem to explicitly exclude any work in this area. Assuming we're
willing to go beyond POSIX (and again, this seems something
where we'd want to loop in other distributions, and probably
kernel developers), I'm honestly not sure which of the Annex 15
canonicalizations we'd want to use -- I'd like to hear from
experts (or at least people with extensive experience outside of
US-ASCII) as to which method is best. I have no dog in that
hunt, save that everyone agrees on a method.

It's for this reason that I think any work in this area needs be
encapsulated in a common library.
--
nick black -=- https://nick-black.com
to make an apple pie from scratch,
you need first invent a universe.
G. Branden Robinson
2024-12-02 03:20:01 UTC
Permalink
Hi nick (and Marc),
Post by Gioele Barabucci
You may have misunderstood that phrase. I was not referring to the
fact that there are no standardized normalization forms for Unicode
(I explicitly mention Annex 15 in [1]), but to the fact that there
is no standard that specifies which of the possible normalization
forms should be used for account names (and other fields in passwd).
POSIX explicitly limits itself of a subset of ASCII, so it is not
going to mandate any normalization form. Are there other standards
(or initiatives) in this area that you know of?
I'm glad we're both on page for Annex 15, and indeed, POSIX does seem
to explicitly exclude any work in this area. Assuming we're willing to
go beyond POSIX (and again, this seems something where we'd want to
loop in other distributions, and probably kernel developers), I'm
honestly not sure which of the Annex 15 canonicalizations we'd want to
use -- I'd like to hear from experts (or at least people with
extensive experience outside of US-ASCII) as to which method is best.
I have no dog in that hunt, save that everyone agrees on a method.
It's for this reason that I think any work in this area needs be
encapsulated in a common library.
It sounds like you want something isomorphic, if not identical, to,
Punycode.

https://en.wikipedia.org/wiki/Punycode

...for which libraries exist, as I understand it.

These things are ugly, which is why I suppose they haven't caught on
despite being around for decades, but I would guess that this problem
space is such that there are no non-ugly solutions apart from "just
stick to ASCII", which some people find ugly in a different way.

Apologies if I missed someone bringing up and rejecting Punycode in the
previous ~41 messages in this thread. I rescanned, using my fallible
human eyeballs. It would be helpful to me if lists.debian.org supported
a search feature.

Regards,
Branden
nick black
2024-12-02 06:20:01 UTC
Permalink
Post by G. Branden Robinson
It sounds like you want something isomorphic, if not identical, to,
Punycode.
https://en.wikipedia.org/wiki/Punycode
it's my understanding that Punycode's objective is to be "clean"
with regards to things that match against the hostname character
set, hence its pickup for IDN (where it's expected that DNS
will be traversing all kinds of network middleware). a similar
(obsolete) proposal was 7 bit-clean UTF-7.

i don't see this as a necessity for this effort, as we have
access to the entirety of the stack that'll be touching this
material. it's just a matter of having that stack match, store,
and display things properly.
--
nick black -=- https://nick-black.com
to make an apple pie from scratch,
you need first invent a universe.
nick black
2024-12-02 06:40:01 UTC
Permalink
Post by nick black
it's my understanding that Punycode's objective is to be "clean"
with regards to things that match against the hostname character
set, hence its pickup for IDN (where it's expected that DNS
will be traversing all kinds of network middleware). a similar
(obsolete) proposal was 7 bit-clean UTF-7.
it occurs to me that the properties of UTF-8 might not be in the
forefront of everyone's minds. there are several good references
to its properties and advantages [0] [1] [2]; i'll quote myself
[3]:

Unicode Technical Report #17 defines seven official Unicode
character encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE,
UTF-32, UTF-32BE, and UTF-32LE. What a wealth of encodings! How
is one to choose? The -16BE and -16LE forms are simply UTF-16
with a known byte order; a UTF-16 stream can (optionally!) be
prefixed with a Byte-Order Mark, at which point the stream
reduces to -16LE or -16BE (in the absence of a BOM, the best
advice is to follow your heart). UTF-32 breaks down the same
way. This question of endianness arises from the fact that
UTF-16 and UTF-32 are coded in terms of 16- and 32-bit units.
UTF-8, being coded in terms of individual bytes, has no need to
define byte order.

“Well, that BOM sounds kinda annoying,” I hear you asking. “What
other advantages are offered by UTF-8?” Remember how ANSI
X3.4-1986 maps precisely to the first 128 characters of UCS?
UTF-8 (and only UTF-8, of the official encodings) encodes these
128 characters the same as US-ASCII! Boom! Every ASCII document
you have—including most source code, configuration files, system
files, etc.—is a perfectly valid UTF-8 document. Furthermore,
UTF-8 never encodes non-ASCII characters to the ASCII bytes. So
an arbitrary UTF-8 document may have plenty of high-bit bytes
that your ASCII-aware, POSIX-locale program doesn’t understand,
but it never sees a valid ASCII character where one wasn’t
intended. UTF-8 encodes ASCII’s 0–0x7f to 0–0x7f, and otherwise
never produces a byte in that range. This includes the
all-important null character 0—Boom! Every nul-terminated C
string is a valid UTF-8 string. Every UTF-8 string can be passed
through standard C APIs cleanly, and they’ll more or less work.
It’s furthermore self-synchronizing. If you pick up a UTF-8
stream in the middle, you know after reading a single byte
whether you’re in the middle of a multibyte character.

“Sweet! What’s the catch? Does it waste space?” RFC 3629
limits UTF-8’s range to the 17 ∗ 2^16-ary code space of UCS, in
which case the maximum length of a single UTF-8-encoded UCS code
point is four bytes [4]. It’s thus always as or more efficient
than UCS-32. When the ASCII characters are used, UTF-8 is more
efficient than either UTF-16 or UTF-32. Only for streams utterly
dominated by BMP codepoints requiring three or more bytes from
UTF-8 can UTF-16 encode more efficiently.

“Sweet! What’s the catch? Is it super slow?” UTF-32, it is true,
allows you to index into a string by character in O(1) (UTF-16
does not, unless you’re only dealing with BMP strings). UTF-32
also allows you to compute the bytes necessary for encoding in
O(1), given the number of Unicode codepoints, but that’s only
because it’s wasteful; if you’re willing to be similarly
wasteful, you can do the same calculation with UTF-8 (and then
trim any wastage at the end, if you wish). Any advantage UTF-32
might hold in lexing simplicity is likely a wash when UTF-8’s
usual space efficiency is taken into account, owing to more
effective use of cache and memory bandwidth. Nope, it’s not
slow. *Always interoperate in UTF-8 by default.*

UTF-16 is some truly stupid shit, fit only for jabronies. It
only ever passed muster because people thought UCS was going to
be a sixteen-bit character set. The moment a second Plane was
added, UTF-16 ought have been shown the door. There’s an
argument to be made for ripping it from the pages of books in
your local library. If you must work on a UTF-16 system, use
UTF-16 at the boundary, and then keep it around as UTF-32 or
UTF-8. Always interoperate—including writing files—in UTF-8 by
default.

There are a dozen-odd similarly-named encodings which are useful
for nothing but trivia. UCS-2 was UTF16, but for only the BMP.
UCS-4 is just UTF-32. UTF-7 is a seven-bit-clean UTF-8 [5]. UTF-1
is UTF-8’s older, misshapen sister, locked away from sight in
the attic. UTF-5 and UTF-6 were proposed encodings for IDN, but
Punycode was selected instead. WTF-8 extends UTF-8 to handle
invalid UTF-16 input. BOCU-1 and SCSU are compressing encodings
that don’t compress as well as gzipped UTF-8. UTF-9 and UTF-18
were jokes. Is UTF-EBCDIC a thing? Of course UTF-EBCDIC is a
thing.

The one place where you won’t interoperate with UTF-8 is for
domain name lookup, when converting IDNA into the LDH subset of
ASCII. If you’re interested, consult RFC 3492, and Godspeed.

--rigorously, nick

[0] https://research.swtch.com/utf8
[1] https://utf8everywhere.org/
[2] https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
[3] https://nick-black.com/htp-notcurses.pdf p52-53
[4] You might hear six bytes, and indeed ISO/IEC 10646 specifies
six bytes to handle up through U+7FFFFFFF
but only defines UCS
to cover 17 planes. Verify your wctomb(3) rejects inputs in
excess of 0x10ffff before exploiting RFC 3629’s tighter bound.
[5] The primary seven-bit-clean media of the modern era is
probably email sent without a MIME transfer encoding.
--
nick black -=- https://nick-black.com
to make an apple pie from scratch,
you need first invent a universe.
Marc Haber
2024-12-02 08:30:01 UTC
Permalink
Post by nick black
WTF-8 extends UTF-8 to handle
invalid UTF-16 input.
WTF-8 is a seriously defined encoding? I have only experienced that name
as a mocking name for an UTF-8 string that erroneously went though UTF-8
encoding a second time (double-UTF-8).

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Marc Haber
2024-12-02 08:00:01 UTC
Permalink
Post by G. Branden Robinson
These things are ugly, which is why I suppose they haven't caught on
despite being around for decades, but I would guess that this problem
space is such that there are no non-ugly solutions apart from "just
stick to ASCII", which some people find ugly in a different way.
The issue is that we didn't stick to ASCII. You CAN use UTF-8 in user
names and it works.
Post by G. Branden Robinson
Apologies if I missed someone bringing up and rejecting Punycode in the
previous ~41 messages in this thread.
Noone did. It doesn't make sense anyway (and I would not implement this
in adduser), because we HAVE UTF-8 and it works. So ther alternatives
are really

(1) Stick with the current way, having UTF-8 work but keeping it
undocumented, hurling any breakage on the user
(2) Document UTF-8 as working and consider breakage a bug
(3) Forbid UTF-8

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Marc Haber
2024-11-27 16:00:01 UTC
Permalink
Post by Johannes Schauer Marin Rodrigues
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames.
Yes please, but opt-in and behind a big red warning that says that it is not
interoperable (outside POSIX),
adduser requires an option to allow such user names. I think that some
peoples might find it offensive to require an option to be allowed their
native names. You're arguing to not relax the requirement for plain
adduser <username>, right?
potentially insecure (homographs) and at
high-risk of breaking existing applications (lack of standardized
normalization form).
Can you outline an attack/failure scenario?

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Andy Smith
2024-11-27 20:10:02 UTC
Permalink
Hi,
Post by Marc Haber
Can you outline an attack/failure scenario?
On the failure side, I did a few tests and noticed that on Debian 12 if
I create a user with for example é in their username then I can log in
by SSH as long as that é is encoded the same way: as utf-8 0xC3 0xA9.
But if that é is made of the combining characters 0x65 0xCC 0x81 (as
that one just was) then that's not the same user even if it looks the
same.

Upon login, the logs from sshd contain the escaped bytes but the logs from PAM
and systemd-logind are in utf-8:

2024-11-23T00:35:37.743827+00:00 arran sshd[1903006]: Accepted password for h\303\251llo from 200:d0e9:8d97:72fe:69af:eb63:7e9e:1f07 port 37396 ssh2
2024-11-23T00:35:37.744825+00:00 arran sshd[1903006]: pam_unix(sshd:session): session opened for user héllo(uid=1001) by (uid=0)

So, anything which parses usernames out of logs will need to be aware of
that.

Thanks,
Andy
Bjørn Mork
2024-11-24 10:50:01 UTC
Permalink
Post by Johannes Schauer Marin Rodrigues
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames. I find it very uncomfortable every time I have to tell my students
that sorry, you somehow have to manage writing your name using American letters
because that's all we have after half a century of Computers being a thing...
You are confusing usernames and names. Different concepts with
different rules. Let's just hope you never get two students with the
same name.


Bjørn
Iustin Pop
2024-11-24 12:30:01 UTC
Permalink
Post by Bjørn Mork
Post by Johannes Schauer Marin Rodrigues
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames. I find it very uncomfortable every time I have to tell my students
that sorry, you somehow have to manage writing your name using American letters
because that's all we have after half a century of Computers being a thing...
You are confusing usernames and names. Different concepts with
different rules. Let's just hope you never get two students with the
same name.
I wanted to reply to Johannes, but I didn't exactly how to phrase it -
you did it perfectly.

I still don't understand the need for username to be very
representative of one's name. OTOH, my name can be fully written using
ASCII, so maybe I miss something. But I've also had to use accounts like
abc745, which didn't bother me much over the duration of a semester or
year.

regards,
iustin
Giuseppe Sacco
2024-11-24 14:40:02 UTC
Permalink
Hi all,
Post by Iustin Pop
[...]
I still don't understand the need for username to be very
representative of one's name. OTOH, my name can be fully written using
ASCII, so maybe I miss something. But I've also had to use accounts like
abc745, which didn't bother me much over the duration of a semester or
year.
It is true that user account name and user (display) name are
different, of course. But still, when you log in, you use the user
account name to the access system; this is the text shown in file
ownership listing and almost everywhere in the system.
I think that user (display) name, that may be put in gecos field, are
not widely used. Moreover, adduser man page on Debian stable, states
that gecos fields will be removed after bookworm.

So, having a good account user name is an important thing. And we have
to chose if it should be "good" for the computer (like in: unique,
lowercase, short, US-ASCII, etc.) or if it should be "good" for the
real user. In the latter case, I would accept a broader class of
strings for the very simple reason that it should be left to user
preference.

I checked what other systems do:

Windows[0] accept any characters, except " / \ [ ] : ; | = , + * ? < >,
and allow for 64 characters (or bytes, I am unsure on this).

SunOS has these restrictions[1] "a string of no more than thirty-two
bytes consisting of characters from the set of alphabetic characters,
numeric characters, period (.), underscore (_), and hyphen (-). The
first character should be alphabetic and the field should contain at
least one lowercase alphabetic character"

In LDAP[2] the uid field is a "Directory String"[3], so any non zero
length UTF8 text. There is a note: Servers and clients MUST be prepared
to receive arbitrary UCS code points, including code points outside the
range of printable ASCII and code points not presently assigned to any
character.

FreeBSD[4] suggest to "use user names that consist of eight or fewer,
all lower case characters in order to maintain backwards compatibility
with applications." But the real syntax[5] is: login name must not
begin with a hyphen (`-'), and cannot contain 8-bit characters, tabs or
spaces, or any of these symbols: `,:+&#%^()!@~*?<>=|\/";'. The dollar
symbol (`$') is allowed only as the last character for use with Samba.
No field may contain a colon (`:') as this has been used historically
to separate the fields in the user database.

IBM AIX has these rules[6]: must not begin with a hyphen (-), plus sign
(+), at sign (@), or tilde (~). Additionally, do not use any of the
following characters within a user-name string: :"#,=\/?'`
Finally, the login parameter cannot contain any space, tab, or newline
characters.

On HP-UX user names are restricted[8] to eight characters and group
names to 16 character ut you may change limits up to 254 characters.
Anyway, it must start with a letter.

Kerberos syntax for principal[9] is GeneralString constrained to
contain only characters in IA5String (so, basically US-ASCII 7 bits),
with this note: US-ASCII control characters should not be used.

So, I think any sequence of unicode "printable" letters should be
allowed. It may be encoded in UTF-8 or other encoding, but I think UTF-
8 is the best encoding since in includes the US-ASCII 7 bit chars.
About the meaning of "printable", probably this means a few unicode
categories[7] should be included: lowercase letter, uppercase letter,
decimal number, plus a few symbols (hyphen, period, plus, at sign, and
underscore at minimum).

Bye,
Giuseppe

[0]https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/bb726984(v=technet.10)?redirectedfrom=MSDN
[1]https://docs.oracle.com/cd/E88353_01/html/E37852/passwd-5.html#REFMAN5passwd-5
[2]https://www.rfc-editor.org/rfc/rfc4519#section-2.39
[3]https://docs.ldap.com/specs/rfc4517.txt
[4]https://docs.freebsd.org/en/books/handbook/basics/#users-synopsis
[5]https://man.freebsd.org/cgi/man.cgi?query=passwd&sektion=5&format=html
[6]https://www.ibm.com/docs/en/aix/7.2?topic=u-useradd-command
[7]https://www.compart.com/en/unicode/category
[8]https://support.hpe.com/hpesc/public/docDisplay?docId=c01922594&docLocale=en_US
[9]https://www.rfc-editor.org/rfc/rfc4120#section-5.2.1
Simon McVittie
2024-11-24 16:40:02 UTC
Permalink
Post by Giuseppe Sacco
Moreover, adduser man page on Debian stable, states
that gecos fields will be removed after bookworm.
No, it says the --gecos *option* will be removed after bookworm,
replaced by --comment, which seems to be another name for the same thing:
passwd(5) "user name or comment field" = struct passwd's pw_gecos,
as can be edited by chfn(1).

The field containing the user's full name, and a way to edit it, are
definitely something that should stay.

smcv
Marc Haber
2024-11-27 16:30:02 UTC
Permalink
Hi,
Post by Giuseppe Sacco
It is true that user account name and user (display) name are
different, of course. But still, when you log in, you use the user
account name to the access system; this is the text shown in file
ownership listing and almost everywhere in the system.
I think that user (display) name, that may be put in gecos field, are
not widely used.
I think this differes between GUIs and DEs (which are more likely to use
the display name) and the console (where the user name is used).
Post by Giuseppe Sacco
Moreover, adduser man page on Debian stable, states
that gecos fields will be removed after bookworm.
That's a misunderstanding. We're just in the process of renaming the
--gecos option to --comment as per passwd(5) documentation. Sadly,
passwd(5) uses "login name" instead of "user name"
Post by Giuseppe Sacco
So, having a good account user name is an important thing. And we have
to chose if it should be "good" for the computer (like in: unique,
lowercase, short, US-ASCII, etc.) or if it should be "good" for the
real user. In the latter case, I would accept a broader class of
strings for the very simple reason that it should be left to user
preference.
I think that we should have reached a state where a properly UTF-8
encoded string should be a good compromise between "good for the
computer" and "good for the person". In Debian, we have a rather tightly
controlled ecosystem and can take care that things don't break too
badly.
Thank you for this tedious work. I have incorporated that into
https://wiki.debian.org/UserAccountsPhilosophy to preserve the
information.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Chris Hofstaedtler
2024-11-24 13:40:01 UTC
Permalink
Post by Bjørn Mork
Post by Johannes Schauer Marin Rodrigues
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames. I find it very uncomfortable every time I have to tell my students
that sorry, you somehow have to manage writing your name using American letters
because that's all we have after half a century of Computers being a thing...
You are confusing usernames and names. Different concepts with
different rules. Let's just hope you never get two students with the
same name.
I find your reply massively insulting, and I'm not even the original
author.

Usernames (not the "comment" field) are identifiers, and humans care
about the identifiers used for them.

Yes, some humans don't care if you assign them a random 32byte
string as their username. Enough humans however, do have
preferences. In some countries humans even have a right to choose
how they are being adressed.

Chris
Iustin Pop
2024-11-24 13:50:02 UTC
Permalink
Post by Chris Hofstaedtler
Post by Bjørn Mork
Post by Johannes Schauer Marin Rodrigues
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames. I find it very uncomfortable every time I have to tell my students
that sorry, you somehow have to manage writing your name using American letters
because that's all we have after half a century of Computers being a thing...
You are confusing usernames and names. Different concepts with
different rules. Let's just hope you never get two students with the
same name.
I find your reply massively insulting, and I'm not even the original
author.
Massively?
Post by Chris Hofstaedtler
Usernames (not the "comment" field) are identifiers, and humans care
about the identifiers used for them.
Yes, some humans don't care if you assign them a random 32byte
string as their username. Enough humans however, do have
preferences. In some countries humans even have a right to choose
how they are being adressed.
And what relation does the username used for logging in have to "being
addressed"? Isn't it akin a passport/ID card number?

regards,
iustin
Chris Hofstaedtler
2024-11-24 14:30:01 UTC
Permalink
Post by Iustin Pop
Post by Chris Hofstaedtler
Post by Bjørn Mork
Post by Johannes Schauer Marin Rodrigues
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames. I find it very uncomfortable every time I have to tell my students
that sorry, you somehow have to manage writing your name using American letters
because that's all we have after half a century of Computers being a thing...
You are confusing usernames and names. Different concepts with
different rules. Let's just hope you never get two students with the
same name.
I find your reply massively insulting, and I'm not even the original
author.
Massively?
Yes.
Post by Iustin Pop
Post by Chris Hofstaedtler
Usernames (not the "comment" field) are identifiers, and humans care
about the identifiers used for them.
Yes, some humans don't care if you assign them a random 32byte
string as their username. Enough humans however, do have
preferences. In some countries humans even have a right to choose
how they are being adressed.
And what relation does the username used for logging in have to "being
addressed"? Isn't it akin a passport/ID card number?
No. I see and type my username hundreds times a day, people use it
to address me in written and spoken conversations with it, etc.

If it were my uid, which I see maybe once a week and don't have to
remember, I wouldn't care.

Chris
Bjørn Mork
2024-11-24 15:40:02 UTC
Permalink
Post by Chris Hofstaedtler
No. I see and type my username hundreds times a day, people use it
to address me in written and spoken conversations with it, etc.
This is confusing the subject even more.

Are you sure you are talking about usernames? Or is this email local
parts, chat nicknames and spoken nicks? If so, then there is no reason
you can't use utf8. Today. Without changing any username.

It's also possible to modify $PS1 if seeing \u without utf8 is annoying.


Bjørn
Philipp Kern
2024-11-24 17:30:01 UTC
Permalink
Post by Bjørn Mork
Post by Chris Hofstaedtler
No. I see and type my username hundreds times a day, people use it
to address me in written and spoken conversations with it, etc.
This is confusing the subject even more.
Are you sure you are talking about usernames? Or is this email local
parts, chat nicknames and spoken nicks? If so, then there is no reason
you can't use utf8. Today. Without changing any username.
In many organizations the email local part matches the username[1] and it
is also used in spoken conversations. To the point where I needed to
make clear on internal yellow pages that I would prefer not to be called
"pkern" in spoken conversation, thank you very much.

So yes, usernames are pretty much used in spoken conversation. Many do
not actually understand what a username is and think that it reflects
how someone wants to be called - as their default assumption.

Kind regards
Philipp Kern

PS: My personal, ignorant, Latin-world opinion is that it is probably
too hard for most people to type each others' usernames if UTF-8 were to
be allowed. And I would never ever use UTF-8 in a local part. And I
suffered a bit too much recently looking at differences between byte
count and character count.

[1] Referred to as "LDAP" in mine, which is both funny and sad.
Marc Haber
2024-11-27 16:00:01 UTC
Permalink
Post by Philipp Kern
PS: My personal, ignorant, Latin-world opinion is that it is probably
too hard for most people to type each others' usernames if UTF-8 were to
be allowed.
Why would anybody need to type somebody else's user name despite in
"su"? I see it as the exception that local parts of mail addresses do
1:1 map to a UNIX user name.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Bálint Réczey
2024-11-24 15:00:01 UTC
Permalink
Hi Johannes,
Post by Johannes Schauer Marin Rodrigues
Quoting nick black (2024-11-23 08:48:10)
Post by nick black
You now have glyphs which occupy more than one column. Are your
columnar/tabular programs prepared for that? ﷽𒁭𒐫i
xfce-terminal renders this like this: https://mister-muffin.de/p/4o2v.png
No idea if this is correct and I'll leave the details to those who know more
about this topic than I. And maybe my email client completely messes this up in
this response of mine.
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames. I find it very uncomfortable every time I have to tell my students
that sorry, you somehow have to manage writing your name using American letters
because that's all we have after half a century of Computers being a thing...
I had students as well with many of them having accents in their name,
like myself and never had this kind of discomfort before.

If any time it occurs to me, I'll remind myself that also deeply
personal birthdays are shown as Arabic numerals instead of Roman ones
which would look way cooler, and also use the base 10 encoding instead
of base 60 which encoding was widely used by Sumers.
Post by Johannes Schauer Marin Rodrigues
If having this work in Debian can put a bit on pressure on those software
projects that do not support this, then please let that happen so that missing
unicode support becomes more annoying for those pieces of software that are
missing it. For example, if my email client messed this up, then lets fix it.
We cannot find these kind of bugs if we accept translating everybody's given
name to the American alphabet.
Please don't open this can of worms and impose pointless work on
upsteams. Keep what works reasonably well for decades.

Cheers,
Balint

PS: The mandatory relevant Monty Python sketch:

Post by Johannes Schauer Marin Rodrigues
Thanks!
cheers, josch
Marc Haber
2024-11-27 16:00:01 UTC
Permalink
Post by Johannes Schauer Marin Rodrigues
But my 2 cents on the topic are: Lets please allow more than ascii in
usernames. I find it very uncomfortable every time I have to tell my students
that sorry, you somehow have to manage writing your name using American letters
because that's all we have after half a century of Computers being a thing...
In Debian stable, they can already try.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Marc Haber
2024-11-27 15:50:01 UTC
Permalink
Hi nick,
Post by nick black
Post by Marc Haber
(1)
Should Debian allow UTF-8 user names in the first place or should we
restrict names for regular users to some us-ascii near set as well? (I
think yes, we should)
I feel strongly yes, despite POSIX admonitions (quoted elsewhere
in this thread) and sure breakage any number of places.
Thank you, noticed.
Post by nick black
I think
a test plan would be very desirable (off the top of my head,
we'd want to check login, the DMs, PAM, OpenSSH, passwd, w,
framebuffer console input, etc. It would probably also be a good
idea to loop in other distributions.
Coordinating this test is way beyond what I have available in resources,
most notably time. Our tools have been allowing UTF-8 user names at
least since bookworm (I don't have any bullseye systems left, buster's
adduser does not allow UTF-8). So we are already testing this in a
stable release (albeit unplanned).

Please note that allowing UTF-8 user names by default does not break
compatibility in any place where only 7bit user names are being used.
Debian is not using such user names in anything that we ship. We only
allow them.

Actually _doing_ this is still the local admin's decision. And should
they decide to not want this, adduser can be configured to disallow.

This thread is mainly about whether we should disallow things in next
stable that are possible in current stable. I think we need good reasons
for that, and I ain't seeing any right now.
Post by nick black
I recommend Chapter 7 of my free book, "Hacking the Planet with
Notcurses: A Guide to TUIs and Character Semigraphics" for the
https://nick-black.com/htp-notcurses.pdf (starts on page 41).
Noted for reading.
Post by nick black
* any upstream tool could say "bad idea" and refuse patches,
requiring their long term management,
Depending of how important this tool is, we could get away without
patching and probably not even documenting this failure.
Post by nick black
* the Linux framebuffer console is pretty limited in what
glyphs it has available, and the number of glyphs it can
support,
Probably, yes. But people working on the Linux framebuffer console are
unlikely to actually use UTF-8 user names, so the only really bad
situation would be a rescue situation. We could get away with
documenting "please use 7bit only user names for accounts that are
likely to be used in system rescue situations".
Post by nick black
* you want installer support if you intend to do this right,
The installer currently allows me to type UTF-8 user names in the entry
fields (and even displays them correctly when one goes through the
dialogs a second time), but rejects them with a sanitation error message
("The username you entered is invalid. Note that usernames must start
with a lower-case letter, which can be followed by any combination of
numbers and more lower-case letters, and must be no more than 32
characters long.") which is incorrect, it should be "lower-case us-ascii
letters". From a German point of view "jürgen" conforms to the rules
given in the error message.
Post by nick black
* ubiquitous input for UTF-8 is a pretty complicated story, and
Sites using such letters in user names should know which of them can be
typed.
Post by nick black
* broken localization (or failure to call setlocale()) could be
a bigger problem, especially for root/system accounts.
I don't think we should allow UTF-8 charactes in the string "root" or in
system account names. And if a local admin decides to do so, Debian
packages should still restrict themselves to using US-ASCII in their
system accounts.
Post by nick black
You'll likely now be linking libunistring into some
binaries where it wasn't previously used.
Probably, yes. I hope to get away in adduser without that, since I'd
like to keep adduser's dependencies minimal (it's being used in the
installer).
Post by nick black
Regarding the subset of Unicode characters you'd want to allow,
this would be best decided using the General Category trait.
Each codepoint is assigned one of a finite set of General
Categories. We would probably want to allow Letters, Marks, and
Numbers, and perhaps a whitelist from Punctuation and Symbols
(Punctuation, connector and Punctuation, dash are probably all
we'd want) extended from currently supported ispunct(3)
characters. This data is available from libunistring (and
probably other places). This eliminates a great swatch of known
security issues.
Do you have a suggestion for a perl regexp that allows this? My current
Post by nick black
Names containing invalid UTF-8 sequences ought be rejected.
Agreed. How do I check for this in perl?
Post by nick black
Characters 0-127 would presumably be allowed iff they are now;
UTF-8 preserves US-ASCII.
I'd rather allow 32-127 only.
Post by nick black
We ought support combining characters up through the Extended
Grapheme Cluster (a single user-perceived character, roughly a
glyph, made up of one or more encoded characters). Generally a
single backspace ought map to an entire EGC.
This is beyond my knowledge of Unicode.
Post by nick black
Regarding canonicalization/normalization, this is a complex
question without a necessarily correct technical answer. I think
you'd want to follow the Principle of Least Astonishment; as to what
would astonish the least, I'd like to hear wider input. But
Unicode definitely defines multiple normal forms and equivalency
classes.
I am not sure whether we need this. A local admin is likely to be
consistent to herself in creating user names.
Post by nick black
You now have glyphs which occupy more than one column. Are your
columnar/tabular programs prepared for that? ﷽𒁭𒐫
Probably not. If that's important for a local admin, they can disallow
such characters and maybe even file a patch against adduser.

Quoting the character just out of curiosity.
Post by nick black
Post by Marc Haber
(2)
If the answer to (1) is "allow UTF-8", should we also do that for system
users? (I think no, we should not)
I think you should, simply because otherwise you have two paths
in more places.
Adduser already has different code paths for normal and system accounts.
Post by nick black
Post by Marc Haber
(3)
I think that 32 characters/bytes (it's the same if we don't allow UTF-8)
is a good limitation for a system user name. But, should we increase
that for regular user names? (I think yes)
I hesitate to comment here because who really cares, but does 32
save us something over 128? 128 seems the default "enough for
everybody" these days, looking at IPv6 and ZFS.
systemd argues that > 32 characters are rarely supported in "older and
unmaintaind" utilities.
Post by nick black
My printer is administered by i̸̒n̴͛e̵̎l̴͝u̷̾c̴̉t̵́å̵b̷͋l̷͐e̴̋m̸̆o̷̚d̴̐ä̸́l̶͝i̷̋t̷͗ẏ̷ȏ̵f̸̃t̶͘h̷͗e̴̿v̶͘i̷̛s̸̈́ì̵b̷̃l̶̎e̷͊.
That really renders strangely here.
Post by nick black
Post by Marc Haber
(6)
Does it still make sense to give non-UTF-8-locales special handling
(which one?), or can adduser safely assume that any non-ascii locale is
UTF-8? Or must I check for locale and reject UTF-8 user names on
non-UTF-8 locales? (I hope that we can safely assume UTF-8)
It cannot. "C" is not UTF-8. Assumption of UTF-8 requires a
properly set LANG and programs calling setlocale(). This, as
alluded to above, has the potential for a big mess.
Our default is C.UTF-8 and has been like that for a while.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
nick black
2024-12-02 00:00:01 UTC
Permalink
Post by Marc Haber
Post by nick black
* any upstream tool could say "bad idea" and refuse patches,
requiring their long term management,
Depending of how important this tool is, we could get away without
patching and probably not even documenting this failure.
This kind of attitude seems self-defeating. Despite being
*strongly* in favor of this effort, I would oppose it if were
strictly a Debian thing. We can inspire the move, but going it
alone seems a recipe for present and future pain (think SSHing
from/to Debian and a non-Debian machine).
Post by Marc Haber
Post by nick black
* the Linux framebuffer console is pretty limited in what
glyphs it has available, and the number of glyphs it can
support,
Probably, yes. But people working on the Linux framebuffer console are
unlikely to actually use UTF-8 user names, so the only really bad
With all due respect, this seems totally unsupported by anything
other than vibes =].
Post by Marc Haber
Post by nick black
* broken localization (or failure to call setlocale()) could be
a bigger problem, especially for root/system accounts.
I don't think we should allow UTF-8 charactes in the string "root" or in
system account names. And if a local admin decides to do so, Debian
packages should still restrict themselves to using US-ASCII in their
system accounts.
Why? This would require multiple code paths for what seems to me a
very questionable objective. You point out later in your
response that there already exist diverging codepaths, but isn't
unifying such things always a goal?
Post by Marc Haber
Do you have a suggestion for a perl regexp that allows this? My current
I do not. This is not a regex problem in my mind and experience;
you need full access to complicated libraries. Any such effort
should go through Annex 15 canonicalization before being
inspected at all. At that point, you're well past regular
languages so far as I can tell. I do not see this goal as
possible with small surgeries on the adduser code base, but
rather something that requires work across the chain.
Post by Marc Haber
Post by nick black
Names containing invalid UTF-8 sequences ought be rejected.
Agreed. How do I check for this in perl?
I have no idea. It's not very simple. Here's code from my
Notcurses library that extracts a single EGC from a UTF8 string:

https://github.com/dankamongmen/notcurses/blob/a5c7d2262a333353bd5c3428c9397de4864c79ff/src/lib/egcpool.h#L87
Post by Marc Haber
Post by nick black
My printer is administered by i̞̒nÌŽÍ›e̵̎l̎͝uÌ·ÌŸc̎̉t̵́å̵b̷͋l̷͐eÌŽÌ‹m̞̆o̷̚d̎̐ä̞́l̶͝i̷̋t̷͗ẏ̷ȏ̵f̞̃t̶͘h̷͗eÌŽÌ¿v̶͘i̷̛s̞̈́ì̵b̷̃l̶̎e̷͊.
That really renders strangely here.
That was intended, to demonstrate the complexity of potential
strings we might have to deal with.
Post by Marc Haber
Post by nick black
It cannot. "C" is not UTF-8. Assumption of UTF-8 requires a
properly set LANG and programs calling setlocale(). This, as
alluded to above, has the potential for a big mess.
Our default is C.UTF-8 and has been like that for a while.
Yes, but that can be changed.

With all due respect, I admire your gung ho candoit spirit, but
adduser alone is not IMHO the place. This is a major change
requiring support from libraries, applications, and UI to do
right, and thus wide buyin. I love the idea, but it's not going
to happen with a few Perl regexes. Please don't read this as
commentary on you or your code.
--
nick black -=- https://nick-black.com
to make an apple pie from scratch,
you need first invent a universe.
Marc Haber
2024-12-02 08:50:01 UTC
Permalink
Post by nick black
Post by Marc Haber
Post by nick black
* any upstream tool could say "bad idea" and refuse patches,
requiring their long term management,
Depending of how important this tool is, we could get away without
patching and probably not even documenting this failure.
This kind of attitude seems self-defeating. Despite being
*strongly* in favor of this effort, I would oppose it if were
strictly a Debian thing. We can inspire the move, but going it
alone seems a recipe for present and future pain (think SSHing
from/to Debian and a non-Debian machine).
I bet that other distribtions will also allow me to useradd an UTF-8
name today. I don't think that we have patched useradd to allow this.
Post by nick black
Post by Marc Haber
Post by nick black
* the Linux framebuffer console is pretty limited in what
glyphs it has available, and the number of glyphs it can
support,
Probably, yes. But people working on the Linux framebuffer console are
unlikely to actually use UTF-8 user names, so the only really bad
With all due respect, this seems totally unsupported by anything
other than vibes =].
So you think that we should be stricter than we are today?
Post by nick black
Post by Marc Haber
Post by nick black
* broken localization (or failure to call setlocale()) could be
a bigger problem, especially for root/system accounts.
I don't think we should allow UTF-8 charactes in the string "root" or in
system account names. And if a local admin decides to do so, Debian
packages should still restrict themselves to using US-ASCII in their
system accounts.
Why? This would require multiple code paths for what seems to me a
very questionable objective. You point out later in your
response that there already exist diverging codepaths, but isn't
unifying such things always a goal?
I think that the distinction between system users and regular users is a
good thing and that we should continue treating them differently.
Strictly, it's only adduser (and useradd, UID only) having different
code paths, the treatment in other software is identical.

Even if we unify things (either by allowing strange characters in system
user names, or by restricting regular user names to the western
character set), adduser will need to keep the distinction because we
assign UIDs from different ranges.
Post by nick black
Post by Marc Haber
Do you have a suggestion for a perl regexp that allows this? My current
I do not. This is not a regex problem in my mind and experience;
you need full access to complicated libraries.
Adduser will have to stick to regexes for dependency reasons.
Post by nick black
Any such effort
should go through Annex 15 canonicalization before being
inspected at all.
I have always assumed that canonicalization would be used for sorting
and equality, while in the databases it is important to keep the
difference between the unit Angstrom and the capital letter A with
circle. If we canonicalize everything, why do we have different
codepoints for different semantics?

Yes, I need to read your book.
Post by nick black
At that point, you're well past regular
languages so far as I can tell. I do not see this goal as
possible with small surgeries on the adduser code base, but
rather something that requires work across the chain.
So, "not for Trixie". And what would we do in Trixie? I think we need
something that a single person can implement in spare time before
christmas. This is a rather limited amount of time that we have.
Post by nick black
Post by Marc Haber
Post by nick black
It cannot. "C" is not UTF-8. Assumption of UTF-8 requires a
properly set LANG and programs calling setlocale(). This, as
alluded to above, has the potential for a big mess.
Our default is C.UTF-8 and has been like that for a while.
Yes, but that can be changed.
By the local admin? Yes. That's why we (Linux distributions) should
stick to us-ascii user names for the accounts that are created by our
packages. If a local admin creates UTF-8 user names but gives them a
non-UTF-8 locale than it's their fault, and if a user with a UTF-8 user
name selects a non-UTF-8 locale it's deliberate sabotage. I don't think
we should or care about that, and it's already possible today.
Post by nick black
With all due respect, I admire your gung ho candoit spirit, but
adduser alone is not IMHO the place. This is a major change
requiring support from libraries, applications, and UI to do
right, and thus wide buyin. I love the idea, but it's not going
to happen with a few Perl regexes. Please don't read this as
commentary on you or your code.
So your recommendation is to disallow things that we have allowed until
recently, and maybe remove configurability to REALLY disallow it?

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Chris Hofstaedtler
2024-12-02 15:30:02 UTC
Permalink
Post by Marc Haber
Post by nick black
Post by Marc Haber
Post by nick black
* any upstream tool could say "bad idea" and refuse patches,
requiring their long term management,
Depending of how important this tool is, we could get away without
patching and probably not even documenting this failure.
This kind of attitude seems self-defeating. Despite being
*strongly* in favor of this effort, I would oppose it if were
strictly a Debian thing. We can inspire the move, but going it
alone seems a recipe for present and future pain (think SSHing
from/to Debian and a non-Debian machine).
I bet that other distribtions will also allow me to useradd an UTF-8
name today. I don't think that we have patched useradd to allow this.
We did. Debian carries (since "forever") a patch in useradd to turn
off most name checking. (Trying to) remove this patch is what
started this all.

Observe:

[***@cc65635fbf00 /]# cat /etc/os-release
NAME="Fedora Linux"
VERSION="40 (Container Image)"
...
[***@cc65635fbf00 /]# useradd för
useradd: invalid user name 'för': use --badname to ignore


Not sure if mjt brought it up yet, but the sendmail interface will
also need some solution for utf8 usernames (=email address local
parts). However, it seems some sendmail implementations already
cannot cope with utf8 gecos fields.

Chris
Marc Haber
2024-12-05 17:10:01 UTC
Permalink
Post by nick black
I recommend Chapter 7 of my free book, "Hacking the Planet with
Notcurses: A Guide to TUIs and Character Semigraphics" for the
https://nick-black.com/htp-notcurses.pdf (starts on page 41).
Thank you very much for providing this. The chapter has educated me.
"The vast minimum of things you should know about Unicode."

The time to read it was well spent.

Greetings
Marc

P.S.: Sadly, this has gotten less than positive coverage on LWN. I
apologize for the harm this discussion has done.
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Chris Hofstaedtler
2024-12-09 17:20:01 UTC
Permalink
Post by Marc Haber
P.S.: Sadly, this has gotten less than positive coverage on LWN. I
apologize for the harm this discussion has done.
Marc, my thank you for collecting the info on the wiki, and starting
this discussion. I'm sorry I was not able to participate more.

However, I reject the idea that it is on you to apologize for LWN
covering this discussion and the harm that might have come out of
it. This is something we need to address on a wider floor. Otherwise
we lose our ability to discuss anything (and then changing anything
ever).

Best,
Chris
Marc Haber
2024-12-03 16:30:01 UTC
Permalink
Hi,

thank you all for your contributions to this discussion. I have now
finally understood¹ that it is not enough to try creating an UTF-8
encoded user name and see that it correctly shows up in /etc/passwd to
declare UTF-8 support. Please forgive me for not replying to all of you
in this thread individually, I have read everything and if I didnt cater
for your arguments in this message please feel free to remind me.

https://lists.debian.org/debian-devel/2024/11/msg00491.html correctly
outlines that homograph characters (such as é (UTF-8 0xC3 0xA9 and the
lookalike é 0x65 0xCC 0x81) are not only a nuisance. At the least,
adduser should reject creating étienne if étienne already exists - those
are different user names but look the same, and if you don't
cut-and-paste user names instead of typing them you're bound to hit the
wrong user depending on HOW you type and what input medium you use. Not
good.

https://wiki.debian.org/UserAccounts and
https://wiki.debian.org/UserAccountsPhilosophy are updated accordingly.

After understanding this, I must admit that what's currently left active
on the adduser team (me) doesn't have the capacity to implement this
properly and in time for trixie. To make things worse, the
Unicode::Precis module, which should be in Debian as
libunicode-precis-perl (but isn't) hasnt seen an upstream release in
more than five years.

Additionally, I don't see myself in the situation of writing a proper
checker for the RFC 8264 IdentifierClass (Chapter 4.2) at the moment
since I don't have the time to check out which \p{Foo} character classes
match the classes given in the RFC.

I would appreciate volunteers to help here, but first I need to bring
some sense in adduser's current state of affairs to make an unstable
upload that can eventuall migrate to testing.

What I intend to do in adduser for the next unstable upload is:

- adduser --system's user name validation will not change
- I'll make sure that adduser <normal user account> doesn't accept
UTF-8 user names, bringing it closer to systemd's notion of a valid
user name
- adduser --allow-bad-names will still allow UTF-8 usernames, not doing
normalization. I will document this and make it clear that the local
admin needs to make sure that they don't allow things they don't want
to have
- adduser --allow-all-names will just verbatim pass all user names to
useradd.

All this will be documented in the man page, in README.Debian and/or the
Wiki after the code passes the test suite again.

I'll probably deprecate --allow-bad-names in favor of something that
doesn't use the word "bad" (suggestions appreciated). Otoh, adduser in
the Red Hat World uses --badname to allow such names as well.

I would love to hear your opinion. Silence is agreement ;-)

Greetings
Marc


¹ RFC 8264, RFC 8265, and Unicode TR 15 linked in this thread were
educating for me
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Gioele Barabucci
2024-12-03 16:50:01 UTC
Permalink
Post by Marc Haber
- adduser --system's user name validation will not change
- I'll make sure that adduser <normal user account> doesn't accept
UTF-8 user names, bringing it closer to systemd's notion of a valid
user name
- adduser --allow-bad-names will still allow UTF-8 usernames, not doing
normalization. I will document this and make it clear that the local
admin needs to make sure that they don't allow things they don't want
to have
Dear Marc,

in preparation for a PRECIS future, couldn't adduser pass the usernames
through NFC instead of doing no normalization?

RFC 8264 5.2.4 Normalization Rule states:

In accordance with [RFC5198], Normalization Form C (NFC) is
RECOMMENDED.

[1] https://www.rfc-editor.org/rfc/rfc8264.html#section-5.2.4

Regards,
--
Gioele Barabucci
Marc Haber
2024-12-03 17:00:02 UTC
Permalink
Post by Gioele Barabucci
Post by Marc Haber
- adduser --system's user name validation will not change
- I'll make sure that adduser <normal user account> doesn't accept
UTF-8 user names, bringing it closer to systemd's notion of a valid
user name
- adduser --allow-bad-names will still allow UTF-8 usernames, not doing
normalization. I will document this and make it clear that the local
admin needs to make sure that they don't allow things they don't want
to have
Dear Marc,
in preparation for a PRECIS future, couldn't adduser pass the usernames
through NFC instead of doing no normalization?
In accordance with [RFC5198], Normalization Form C (NFC) is
RECOMMENDED.
that would solve the étienne and étienne issue (where the two characters
are just different renderings of the same character), but not the
Ohm-against-Omega issue, right?

While this seems the right thing to do, I think this should be done in
useradd (pkg:shadow), in the respective upstream project, so that all
Linux distributions get the same behavior.

I have filed https://github.com/shadow-maint/shadow/issues/1138 in the
general regard. Feel free to add what I fotgot to mention there.

I'd rather not have this can of worms in adduser, but I'd consider a
patch.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Gioele Barabucci
2024-12-03 20:40:01 UTC
Permalink
Post by Marc Haber
Post by Gioele Barabucci
in preparation for a PRECIS future, couldn't adduser pass the usernames
through NFC instead of doing no normalization?
In accordance with [RFC5198], Normalization Form C (NFC) is
RECOMMENDED.
that would solve the étienne and étienne issue (where the two characters
are just different renderings of the same character), but not the
Ohm-against-Omega issue, right?
NFC would solve both of these "problems":

* Both U+00E9 (é) and U+0065, U+0301 are NFC-normalized to U+00E9,
* Both U+2126 (Ohm sign) and U+0349 (omega) are NFC-normalized to U+0349
(omega).

What NFC alone will not solve are homograph collisions: a (U+0061 Latin
small letter a) and а (U+0430 Cyrillic small letter a) are
NFC-normalized to different codepoints.

But these are two different scenarios: the former problem may (and does)
arise without any wrongdoing from the user's side (a different OS, or a
different string manipulation library, or a screen keyboard may produce
a different é), the latter is an attack. The former is an
interoperability issue, the latter is a security issue.
Post by Marc Haber
While this seems the right thing to do, I think this should be done in
useradd (pkg:shadow), in the respective upstream project, so that all
Linux distributions get the same behavior.
That's probably the best approach.

Thanks for taking the time to delve into this issue,
--
Gioele Barabucci
Marc Haber
2024-12-03 21:10:02 UTC
Permalink
Post by Gioele Barabucci
Post by Marc Haber
Post by Gioele Barabucci
in preparation for a PRECIS future, couldn't adduser pass the usernames
through NFC instead of doing no normalization?
In accordance with [RFC5198], Normalization Form C (NFC) is
RECOMMENDED.
that would solve the étienne and étienne issue (where the two characters
are just different renderings of the same character), but not the
Ohm-against-Omega issue, right?
* Both U+00E9 (é) and U+0065, U+0301 are NFC-normalized to U+00E9,
* Both U+2126 (Ohm sign) and U+0349 (omega) are NFC-normalized to U+0349
(omega).
Converting Ohm into an Omega is losing intended information, isnt it?
Post by Gioele Barabucci
Thanks for taking the time to delve into this issue,
I have learned many things.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Gioele Barabucci
2024-12-03 21:20:01 UTC
Permalink
Post by Marc Haber
Post by Gioele Barabucci
Post by Marc Haber
Post by Gioele Barabucci
in preparation for a PRECIS future, couldn't adduser pass the usernames
through NFC instead of doing no normalization?
In accordance with [RFC5198], Normalization Form C (NFC) is
RECOMMENDED.
that would solve the étienne and étienne issue (where the two characters
are just different renderings of the same character), but not the
Ohm-against-Omega issue, right?
* Both U+00E9 (é) and U+0065, U+0301 are NFC-normalized to U+00E9,
* Both U+2126 (Ohm sign) and U+0349 (omega) are NFC-normalized to U+0349
(omega).
Converting Ohm into an Omega is losing intended information, isnt it?
Normalization is always lossy, at least in principle.

Applications that employ normalization accept that tradeoff in order to
gain something valuable: in this case the ability to have a Ohm sign
codepoint as part of your username is traded for the ability to compare
usernames across different OSes and applications.

Regards,
--
Gioele Barabucci
Marc Haber
2024-12-03 21:50:02 UTC
Permalink
Post by Gioele Barabucci
Normalization is always lossy, at least in principle.
Applications that employ normalization accept that tradeoff in order to gain
something valuable: in this case the ability to have a Ohm sign codepoint as
part of your username is traded for the ability to compare usernames across
different OSes and applications.
I don't know what's exactly in the standard, but my gut feeling says
that I would probably store _exactly_ what was received, but normalize
both sides before duplicate checking, sorting, comparing.

If we'd normalize things away in storage, why do we have homographs in
the first place? Why would I replace a kyrillic a with a latin a,
destroying the idea of a "script"?

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Theodore Ts'o
2024-12-10 12:50:01 UTC
Permalink
Post by Gioele Barabucci
* Both U+00E9 (é) and U+0065, U+0301 are NFC-normalized to U+00E9,
* Both U+2126 (Ohm sign) and U+0349 (omega) are NFC-normalized to U+0349
(omega).
What NFC alone will not solve are homograph collisions: a (U+0061 Latin
small letter a) and а (U+0430 Cyrillic small letter a) are NFC-normalized to
different codepoints.
NFC also doesn't solve various invisible characters (e.g., zero-width
spaces, bidirectional control characters). For more information about
all of the various security land mines, see[1]. I also suggest that
people do a google search on "CVE" and "Unicode". There has been at
least one interaction where we needed to make a kernel(!) change to
address a security vulnerability, although we decided it wasn't
super-critical because "no sane distribution actually enables the
casefold feature on users' file systems by default".

[1] https://www.unicode.org/reports/tr39/tr39-22.html

The other security consideration to consider is the vast amount of
code that you need to link into security critical / setuid programs if
you are going to use libunicode. (And yes, we do include libunicode
into the kernel in order to support casefold. If you are thinking
about potentially enabling casefold by default on User file systems
because Windows and MacOS does it, and we need to appeal to Gen Z'ers
in order for Debian to stay relevent(tm) --- please don't. :-)

So if we really do want to support unicode in usernames, may I suggest
that having someone implement the smallest possible Unicode
canonicalization library, which also handles getting rid of all of the
*other* Unicode security traps like invisible characters,
bidirectional control characters, etc., and then asking it to get
subjected to rigorous security audits before we propose linking it
into setuid programs, that would be a Really Good Idea.

This would also reduce bloat in the minimal Debian install required
for installer images, docker containers, etc., since we wouldn't need
to support things like Unicode sorting rules, Unicode case folding,
conversion between the many different Unicode encoding forms, etc.

Cheers,

- Ted
Post by Gioele Barabucci
But these are two different scenarios: the former problem may (and does)
arise without any wrongdoing from the user's side (a different OS, or a
different string manipulation library, or a screen keyboard may produce a
different é), the latter is an attack. The former is an interoperability
issue, the latter is a security issue.
Post by Marc Haber
While this seems the right thing to do, I think this should be done in
useradd (pkg:shadow), in the respective upstream project, so that all
Linux distributions get the same behavior.
That's probably the best approach.
Thanks for taking the time to delve into this issue,
--
Gioele Barabucci
Gioele Barabucci
2024-12-10 14:00:01 UTC
Permalink
Post by Theodore Ts'o
Post by Gioele Barabucci
* Both U+00E9 (é) and U+0065, U+0301 are NFC-normalized to U+00E9,
* Both U+2126 (Ohm sign) and U+0349 (omega) are NFC-normalized to U+0349
(omega).
What NFC alone will not solve are homograph collisions: a (U+0061 Latin
small letter a) and а (U+0430 Cyrillic small letter a) are NFC-normalized to
different codepoints.
NFC also doesn't solve various invisible characters (e.g., zero-width
spaces, bidirectional control characters). For more information about
all of the various security land mines, see[1].
NFC has been mentioned in a broader discussion on PRECIS/RFC8264/RFC8265.

The IdentifierClass of RFC 8264 explicitly disallows all these "security
land mines": https://www.rfc-editor.org/rfc/rfc8264.html#section-4.2.3

The "Security considerations" section is quite extensive (5 pages long):
https://www.rfc-editor.org/rfc/rfc8264.html#section-12

In general, the PRECIS RFCs are more prescriptive than Unicode UTS #39,
so, should Unicode usernames ever happen, the PRECIS RFCs are the
reference all programs should follow.

Regards,
--
Gioele Barabucci
Theodore Ts'o
2024-12-10 15:00:01 UTC
Permalink
Post by Gioele Barabucci
NFC has been mentioned in a broader discussion on PRECIS/RFC8264/RFC8265.
The IdentifierClass of RFC 8264 explicitly disallows all these "security
land mines": https://www.rfc-editor.org/rfc/rfc8264.html#section-4.2.3
https://www.rfc-editor.org/rfc/rfc8264.html#section-12
Oh, good. I was just getting worried when discussion on the list
seemed to be treating NFC as a silver bullet, and people were
suggesting that the canonicalization should be done both by readers
and writers of /etc/passwd --- which would imply linking libunicode
into setuid programs like sudo and login, with the (to my view)
invevitable results of hilarity ensuing.

As I look at RFC 8264, I note that it does not take a position about
which version of Unicode should be considered canonical, and in fact
talks about one of the features (tm) of RFC 8264 being that it is
agile with respect to newer versions of Unicode.

However, it should be noted that RFC 8264 also states that code points
which are not defined in whatever version of the Unicode supported by
"the application" shall be disallowed. From Debian's perspective,
though, if we are going to take a position about what version of
Unicode should be supported by "the application(s)" that read and
write /etc/passwd, we *will* need to take a position on what version
of Unicode should be supported, and therefore, what set of characters
will be disallowed.

It also means that we need to be careful about what happens when we
want to upgrade to newer versions of Unicode in future versions of
Debian. If the system administrator wants to support more than one
version of Debian, then it would be advisable if the Unicode version
is something which is configurable, especially if the passwd entries
are being supplied via some kind of network protocol such as LDAP or
Hesiod (for those people who remember MIT Project Athena :-P).

There is also (admittedly, only on edge case) of what to do if a newer
version of Unicode disallows or remove characters. This rarely
happens, but it has in the past (in particular in the case of various
security disasters, or in the case of characters getting deprecated in
favor of newer characters, many of which are mentioned in RFC 8264).
So we can probably just ignore this case and hope that the Unicode
consortium will be more careful in the future, but I'd thought I'd
just mention it.

The bottom line is that while I am sympethetic to the desire to
support Unicode --- heck, I was one of the primary drivers of
libunicode into the kernel so we could support case folding for more
than just the ASCII character set --- the meme of "One does not simply
walk into Morder" also applies for "adopting Unicode".

And I am reminded of one of my IETF mentors who was an
Iternationalization expert tell me two decades ago that, late at
night, in the bar after a standard meeting, one of the things that
I18N folks would say, just amongst themselves, was, "It would be
easier just to teach everyone English" --- and this was with I18N
experts who understood everything that was involved in doing full I18N
support. No doubt this was only half-joking, but I think the point is
valid.

So if we're going to do this, let's do it right. :-)

- Ted
Simon Josefsson
2024-12-10 17:10:02 UTC
Permalink
Post by Theodore Ts'o
However, it should be noted that RFC 8264 also states that code points
which are not defined in whatever version of the Unicode supported by
"the application" shall be disallowed. From Debian's perspective,
though, if we are going to take a position about what version of
Unicode should be supported by "the application(s)" that read and
write /etc/passwd, we *will* need to take a position on what version
of Unicode should be supported, and therefore, what set of characters
will be disallowed.
A possible position may be to treat code points that are the subject of
version mismatching to be undefined. This is how IDNA resolved the same
problem, and PRECIS inherited this. While I protested about that
approach many years ago as libidn maintainer when IDNA2003 was
hard-coded to use Unicode 3.2, I think today that the approach is
reasonable since Unicode has maintained good stability. We've done a
couple of Unicode version bumps in libidn2 and interop with other IDN
implementations -- that typically always use some other Unicode version
-- is good enough to not cause serious breakage. I would expect the
same to be true for PRECIS usernames too. Hostnames are hashed and is
subject to string comparisons, just like usernames, so we have some
experience to build on here.

I would involve cross-distribution discussion about this though.
Perhaps the /etc/passwd APIs affect some POSIX specifications, and a
non-ASCII extension could be proposed.

/Simon
Theodore Ts'o
2024-12-10 18:20:01 UTC
Permalink
Post by Simon Josefsson
I would involve cross-distribution discussion about this though.
Perhaps the /etc/passwd APIs affect some POSIX specifications, and a
non-ASCII extension could be proposed.
Yeah, good point. If the scope is going to include passwd entries
that are distributed via network protocols like LDAP, then we need to
worry about sites that support other Linux distributions beyond just
Debian --- or for that matter, sites that need to support Linux as
well as legacy Unix systems like AIX or Solaris.

Of course, we could just exclude them from the scope and say that if
you are using LDAP, then you MUST only use ASCII characters in the
username, given that POSIX has decided to run away from the I18N
problems wrt to usernames. That might be the simpler approach, unless
we want to drive something that could eventually be adopted by POSIX.

- Ted
Marc Haber
2024-12-10 20:30:01 UTC
Permalink
Post by Theodore Ts'o
Yeah, good point. If the scope is going to include passwd entries
that are distributed via network protocols like LDAP, then we need to
worry about sites that support other Linux distributions beyond just
Debian --- or for that matter, sites that need to support Linux as
well as legacy Unix systems like AIX or Solaris.
Even if we had full Unicode support for anything using /etc/passwd, a
site is always free to restict itself to us-ascii usernames. Same with
POSIX, in my understanding we would still be POSIX compliant if we had
full Unicode support for usernames, because POSIX defines the minimum
of things a system MUST support, but it is always free to support
more. Or, at least I hope so.

But things are moving by shadow upstream taking a user-hostile stance,
willing to take away freedom. I must be fine with that because I
cannot change it. But I don't need to like it.

Greetings
Marc
--
----------------------------------------------------------------------------
Marc Haber | " Questions are the | Mailadresse im Header
Rhein-Neckar, DE | Beginning of Wisdom " |
Nordisch by Nature | Lt. Worf, TNG "Rightful Heir" | Fon: *49 6224 1600402
Theodore Ts'o
2024-12-12 16:10:01 UTC
Permalink
Post by Marc Haber
But things are moving by shadow upstream taking a user-hostile stance,
willing to take away freedom. I must be fine with that because I
cannot change it. But I don't need to like it.
As a suggestion, we might make more forward progress if we assume good
faith and accept that other people might have different priorities
than others. I could easily see shadow, being a security-related
package, would consider encouraging something that could lead to
security bugs or just other random breakage, as "user-hostile".

I am reminded of Professor Jerome Saltzer, who was responsible for the
overall technical architecture for MIT's Project Athena, insisting
that he be assigned the username Saltzer. He theorized that while
this *would* cause breakage (for a long time, usernames were assumed
to be always lowercase ASCII, and given that e-mail localparts where
case insensitive, and usernames were case sensitive), but since he was
(a) a Professor, and (b) responsible for the technical architecture
for Project Athena, that when problems inevitably showed up, that
programmers would be incentivized to fix them. As I recall, we didn't
let students chose mixed-case usernames for a while, since there was
presumed to be breakage; Professor Saltzer's username was a special
case.

If there are brave people who want to use Unicode characters (for
bonus points, they could try using "unofficial" characters such as the
Klingon script), they could be the first to find bugs, and report
them. And if they suffer from security breaches, they would know what
they were getting into. (And we salute them for their courage. :-)

Perhaps at some future stable Debian release (not Trixie), we could
enable it by default. But I really do think we need to do some
technical work, including not requring adding libunicode as a required
package, but having a minimal security unicode library that can be
used by privileged programs first.

Cheers,

- Ted
Marc Haber
2024-12-13 11:30:01 UTC
Permalink
Post by Theodore Ts'o
Post by Marc Haber
But things are moving by shadow upstream taking a user-hostile stance,
willing to take away freedom. I must be fine with that because I
cannot change it. But I don't need to like it.
As a suggestion, we might make more forward progress if we assume good
faith and accept that other people might have different priorities
than others. I could easily see shadow, being a security-related
package, would consider encouraging something that could lead to
security bugs or just other random breakage, as "user-hostile".
They are planning to remove the --badname option from useradd, making
it impossible to even try UTF-8 user names, without patching useradd.
And if I was in Chris' shoes, I would probably refrain from doing so
as well.

And shadow would be the canonical place to do the PRECIS normalization
at least for comparing usernames. That's something they wouldn't do.
Post by Theodore Ts'o
Perhaps at some future stable Debian release (not Trixie), we could
enable it by default.
There won't be such an option for us to enable.

I need to be fine with that because I cannot change it. But I don't
need to like it.

Greetings
Marc
--
----------------------------------------------------------------------------
Marc Haber | " Questions are the | Mailadresse im Header
Rhein-Neckar, DE | Beginning of Wisdom " |
Nordisch by Nature | Lt. Worf, TNG "Rightful Heir" | Fon: *49 6224 1600402
Michael Stone
2024-12-13 15:10:01 UTC
Permalink
Post by Marc Haber
They are planning to remove the --badname option from useradd, making
it impossible to even try UTF-8 user names, without patching useradd.
Or edit the passwd file (vipw), or use any non-passwd-file
authentication mechanism, or use a different user management tool, etc.
I think you're overemphasizing the importance of the useradd command
here--it just acts as a convenience and sets some baseline policies;
it's not actually essential for adding a user. If you don't like the
policy that useradd sets...just don't use it.
Peter Pentchev
2024-12-13 17:10:01 UTC
Permalink
Post by Marc Haber
They are planning to remove the --badname option from useradd, making
it impossible to even try UTF-8 user names, without patching useradd.
Or edit the passwd file (vipw), or use any non-passwd-file authentication
mechanism, or use a different user management tool, etc.
I think you're overemphasizing the importance of the useradd command
here--it just acts as a convenience and sets some baseline policies;
it's not actually essential for adding a user. If you don't like the policy
that useradd sets...just don't use it.
In the context of the whole thread, are you suggesting that adduser(1)
should be changed to use something other than useradd(8) under the hood?
Sigh, that's adduser(8) too, of course.

G'luck,
Peter
--
Peter Pentchev ***@ringlet.net ***@debian.org ***@morpheusly.com
PGP key: https://www.ringlet.net/roam/roam.key.asc
Key fingerprint 2EE7 A7A5 17FC 124C F115 C354 651E EFB0 2527 DF13
Peter Pentchev
2024-12-13 17:10:01 UTC
Permalink
Post by Marc Haber
They are planning to remove the --badname option from useradd, making
it impossible to even try UTF-8 user names, without patching useradd.
Or edit the passwd file (vipw), or use any non-passwd-file authentication
mechanism, or use a different user management tool, etc.
I think you're overemphasizing the importance of the useradd command
here--it just acts as a convenience and sets some baseline policies;
it's not actually essential for adding a user. If you don't like the policy
that useradd sets...just don't use it.
In the context of the whole thread, are you suggesting that adduser(1)
should be changed to use something other than useradd(8) under the hood?

G'luck,
Peter
--
Peter Pentchev ***@ringlet.net ***@debian.org ***@morpheusly.com
PGP key: https://www.ringlet.net/roam/roam.key.asc
Key fingerprint 2EE7 A7A5 17FC 124C F115 C354 651E EFB0 2527 DF13
Marc Haber
2024-12-13 20:40:01 UTC
Permalink
In the context of the whole thread, are you suggesting that adduser(1)
should be changed to use something other than useradd(8) under the hood?
adduser will not do that. Doing so is nonsense.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Michael Stone
2024-12-14 04:10:01 UTC
Permalink
Post by Marc Haber
They are planning to remove the --badname option from useradd, making
it impossible to even try UTF-8 user names, without patching useradd.
Or edit the passwd file (vipw), or use any non-passwd-file authentication
mechanism, or use a different user management tool, etc.
I think you're overemphasizing the importance of the useradd command
here--it just acts as a convenience and sets some baseline policies;
it's not actually essential for adding a user. If you don't like the policy
that useradd sets...just don't use it.
In the context of the whole thread, are you suggesting that adduser(1)
should be changed to use something other than useradd(8) under the hood?
No, I'm suggesting that rhetoric asserting that any adduser/useradd
policy could constrain people is overblown because users can be added to
the system without using either of those tools. The tools' policies
should reflect what is safest and most sensible for the majority of
users, but if someone wants to do something different there is nothing
stopping them from doing so.

The claim at the top of this subthread is that some useradd change would
prevent people from experimenting with UTF-8 usernames. As an exercise I
just created UTF-8 users and groups entirely without useradd/adduser
getent passwd 1144
💩:*:1144:1144::/nowhere:/bin/false
getent group 1144
ls -l /tmp/samplefile
-rw-r--r-- 1 💩 💩 0 Dec 13 22:42 /tmp/samplefile

On an individual basis there aren't so many steps that creating a user
manually is a big deal, or that a script dedicated to creating users
according to the policies of a particular environment would be overly
complicated. For a large organization I question the idea that user
accounts would be managed by adduser/useradd at all.

Charles Plessy
2024-12-11 01:10:01 UTC
Permalink
Hello everybody,

sorry if it is too naive, but is there an easy way to determine for a
given Unicode string if it can be typed from a single keboard layout or
produced by a text-to-speech system? People who want a username because
of SSH, email and su will want to be able to input it. On the other
range of user cases, they can use a computer for years without seeing
their username.

If we take one step back and look at the future: will usernames
still be a thing in 10 years? If not, then a simple heuristic that
satisfies more than half of the users may be enough...

Have a nice day,

Charles
--
Charles Plessy Nagahama, Yomitan, Okinawa, Japan
Debian Med packaging team http://www.debian.org/devel/debian-med
Tooting from home https://framapiaf.org/@charles_plessy
- You do not have my permission to use this email to train an AI -
Jeremy Stanley
2024-12-11 01:50:01 UTC
Permalink
On 2024-12-11 10:04:44 +0900 (+0900), Charles Plessy wrote:
[...]
is there an easy way to determine for a given Unicode string if it
can be typed from a single keboard layout
[...]

Do keyboards with a "compose" key count? There's plenty of glyphs I
can type which aren't depicted directly on my keyboard's keycaps,
after all.
--
Jeremy Stanley
Marc Haber
2024-12-11 08:20:01 UTC
Permalink
Post by Charles Plessy
sorry if it is too naive, but is there an easy way to determine for a
given Unicode string if it can be typed from a single keboard layout or
produced by a text-to-speech system? People who want a username because
of SSH, email and su will want to be able to input it.
That's easy, just choose a user name for YOU that YOU can type on YOUR
keyboard. Why would anybody chose a username that is impossible to use
in their own locale?

Greetings
Marc
--
----------------------------------------------------------------------------
Marc Haber | " Questions are the | Mailadresse im Header
Rhein-Neckar, DE | Beginning of Wisdom " |
Nordisch by Nature | Lt. Worf, TNG "Rightful Heir" | Fon: *49 6224 1600402
Henrik Ahlgren
2024-12-12 18:40:01 UTC
Permalink
Post by Marc Haber
That's easy, just choose a user name for YOU that YOU can type on YOUR
keyboard. Why would anybody chose a username that is impossible to use
in their own locale?
I don't see much problems with single-user machines, especially security
related. But, think multi-user environments? Imagine, as a non-Chinese
speaking Westerner, needing to chown a file to a colleague called 陈成. Even
if you have Pinyin configured, you might not even know how to type it. (Of
course, you have the same problem with filenames that have essentially no
limitations. I know from experience how hard it is to type names in Arabic
which I can't read.)
Marc Haber
2024-12-13 11:30:01 UTC
Permalink
Post by Henrik Ahlgren
I don't see much problems with single-user machines, especially security
related. But, think multi-user environments? Imagine, as a non-Chinese
speaking Westerner, needing to chown a file to a colleague called 陈成.
I would type "chown 陈成 <filename>", pasting the user name from the
written request or probably from /etc/passwd. Or I would ask the system
administrator for a solution.

I see your argument, but I'd also see that as an issue that the system
administrator choosing the user names needs to solve. I's nothing that
we as a distribution should solve.

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Stephan Seitz
2024-12-13 12:10:01 UTC
Permalink
Post by Henrik Ahlgren
I don't see much problems with single-user machines, especially security
related. But, think multi-user environments? Imagine, as a non-Chinese
speaking Westerner, needing to chown a file to a colleague called 陈成. Even
You are joking, aren’t you? You could use „getent passwd” and copy
& paste the username. Or use the user id.

With this argument passwd should refuse to set the password to „12345”.

And no one in this thread has said that you *have* to use non-ASCII
usernames. But some people don’t want to give you a chance to do it.

I don’t need non-ASCII for my name but I would never use a system that
would forces me to rewrite my name in ASCII because it is so utterly
broken in 2024. I bet there is no problem on Windows systems.

Stephan
--
| Stephan Seitz E-Mail: ***@rootsland.net |
| If your life was a horse, you'd have to shoot it. |
IOhannes m zmölnig
2024-12-13 13:00:02 UTC
Permalink
I don’t need non-ASCII for my name but I would never use a system that would forces me to rewrite my name in ASCII because it is so utterly broken in 2024. I bet there is no problem on Windows systems.
Stephan
Incidentally, my kid's school rolled out their school laptops this week, which of course come with Windows11 preinstalled (as a sidenote I am now looking forward to four years of "digital competence training" consisting entirely of Windows(basics), PowerPoint, Word and Excel; but that's another story), and *of course* all usernames have been normalized to lowercase ASCII.

so my take is, that "no. In Redmond you would use ASCII for username"

Oh, and my name does have non-ASCII characters, and I have been using Unicode in my display name for the last 20 years.
I do remember problems in the 90ies.
But those are long past.


mfh.her.fsr
IOhannes
Stephan Seitz
2024-12-13 13:40:01 UTC
Permalink
Post by IOhannes m zmölnig
Incidentally, my kid's school rolled out their school laptops this week,
which of course come with Windows11 preinstalled (as a sidenote I am now
looking forward to four years of "digital competence training"
consisting entirely of Windows(basics), PowerPoint, Word and Excel; but
that's another story), and *of course* all usernames have been
normalized to lowercase ASCII.
I’m quite sure I have never seen an Asian Windows where you had to use
ASCII for your username.

Stephan
--
| Stephan Seitz E-Mail: ***@rootsland.net |
| If your life was a horse, you'd have to shoot it. |
s***@free.fr
2024-12-13 14:30:01 UTC
Permalink
Hi,
Post by IOhannes m zmölnig
and *of course* all usernames have been normalized to lowercase ASCII.
I just took a look at some reasonably recent government-issued IDs and
it turns out the French ones normalized my name to uppercase
whatever-some-clerk-had-on-their-typewriter-keyboard-late-last-millenium,
dropping the accent from the second word of my name. My father's birth
certificate is handwritten and has the accent. My Canadian IDs are
better as they retained the name as I wrote it in in the application
form. I don't remember if the french online application forms for IDs
allowed accents in names but I would not be too surprised if they
didn't. I might start a procedure to try to get that officially fixed in
2025, as there is another issue with the way my name is registered with
some administrations that occasionnally complicates my life. I'm pretty
confident the other issue will get fixed, much less the accent one
though the law should be on my side which here means that I could well
sue the government, win the lawsuit and the subsequent ones up to the
ECJ and back and still not get that fixed within my lifetime.

I was going to write that on payment cards you can't have accents in
your name. Wrong. I managed to get one that reproduced it. I don't use
that one much online so I don't know if entering my name with the accent
actually works somewhere when paying with that card.

I would not try too hard to get non-ascii characters in that convenient
computer identifier often named "login name" rather than "user name".
You can't get them in the local part of an e-mail address and not many
people complain. You can't get them in IRC nicknames. You can't get them
in the machine readable part of your IATA-compliant government-issued
IDs. It's still better than just numbers. I'm fine with that as long as
my name is properly written in the places that actually matter.

If you need a name for that option, --allow-non-ascii should be neutral
enough.
--
Julien Plissonneau Duquène
Étienne Mollier
2024-12-03 19:50:01 UTC
Permalink
Hi Marc,
Post by Marc Haber
thank you all for your contributions to this discussion. I have now
finally understood¹ that it is not enough to try creating an UTF-8
encoded user name and see that it correctly shows up in /etc/passwd to
declare UTF-8 support. Please forgive me for not replying to all of you
in this thread individually, I have read everything and if I didnt cater
for your arguments in this message please feel free to remind me.
Thank you for having taken the time to investigate this issue,
as a person concerned, I much appreciated it. Let's see whether
I can contribute one last useful item.
Post by Marc Haber
I'll probably deprecate --allow-bad-names in favor of something that
doesn't use the word "bad" (suggestions appreciated). Otoh, adduser in
the Red Hat World uses --badname to allow such names as well.
The problem is not the name, but the character set, so perhaps
--allow-bad-characters will be better perceived. If you want to
also avoid "bad", maybe try --allow-ambiguous-characters, or
--allow-extended-character-set? The last one is perhaps a bit
long winded, but also sounds more accurate than the rest. What
do you think of these approaches?

Have a nice day, :)
--
.''`. Étienne Mollier <***@debian.org>
: :' : pgp: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
`. `' sent from /dev/pts/5, please excuse my verbosity
`- on air: DGM - Solitude
Marc Haber
2024-12-03 21:10:02 UTC
Permalink
Post by Étienne Mollier
Post by Marc Haber
I'll probably deprecate --allow-bad-names in favor of something that
doesn't use the word "bad" (suggestions appreciated). Otoh, adduser in
the Red Hat World uses --badname to allow such names as well.
The problem is not the name, but the character set, so perhaps
--allow-bad-characters will be better perceived. If you want to
also avoid "bad", maybe try --allow-ambiguous-characters, or
--allow-extended-character-set? The last one is perhaps a bit
long winded, but also sounds more accurate than the rest. What
do you think of these approaches?
Extended sounds good, maybe even "unicode"? or "international"?

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Étienne Mollier
2024-12-03 21:30:01 UTC
Permalink
Post by Marc Haber
Post by Étienne Mollier
The problem is not the name, but the character set, so perhaps
--allow-bad-characters will be better perceived. If you want to
also avoid "bad", maybe try --allow-ambiguous-characters, or
--allow-extended-character-set? The last one is perhaps a bit
long winded, but also sounds more accurate than the rest. What
do you think of these approaches?
Extended sounds good, maybe even "unicode"? or "international"?
I avoided unicode as it would include ascii and the safe subset
documented by posix, and I also considered the unlikely case
where something were to replace unicode. "international" would
make the name technology agnostic, but there is still the case
about also covering the posix-safe subset
 Borrowing the idea
from the other branch of the thread, --allow-unsafe-characters
sounds fine and would carry the idea that certain characters
could cause issues, if used in a login name.

Have a nice day, :)
--
.''`. Étienne Mollier <***@debian.org>
: :' : pgp: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
`. `' sent from /dev/pts/1, please excuse my verbosity
`- on air: Atlas - Hemifran
Alejandro Colomar
2024-12-05 15:10:01 UTC
Permalink
Post by Marc Haber
Post by Étienne Mollier
Post by Marc Haber
I'll probably deprecate --allow-bad-names in favor of something that
doesn't use the word "bad" (suggestions appreciated). Otoh, adduser in
the Red Hat World uses --badname to allow such names as well.
The problem is not the name, but the character set, so perhaps
--allow-bad-characters will be better perceived. If you want to
also avoid "bad", maybe try --allow-ambiguous-characters, or
--allow-extended-character-set? The last one is perhaps a bit
long winded, but also sounds more accurate than the rest. What
do you think of these approaches?
Extended sounds good, maybe even "unicode"? or "international"?
I prefer "bad". It gives the implicit message that it's bad to use that
flag. If you find it offensive, then how about --allow-unsafe-names?

I oppose "unicode", "extended", or "international", as all of them
remove the connotation that you should not use that flag.

Anyway, I vote for removing the possibility of using unsafe names, and
not even exposing a flag.

Have a lovely day!
Alex
--
<https://www.alejandro-colomar.es/>
Chris Hofstaedtler
2024-12-09 17:10:01 UTC
Permalink
Post by Marc Haber
Post by Étienne Mollier
Post by Marc Haber
I'll probably deprecate --allow-bad-names in favor of something that
doesn't use the word "bad" (suggestions appreciated). Otoh, adduser in
the Red Hat World uses --badname to allow such names as well.
The problem is not the name, but the character set, so perhaps
--allow-bad-characters will be better perceived. If you want to
also avoid "bad", maybe try --allow-ambiguous-characters, or
--allow-extended-character-set? The last one is perhaps a bit
long winded, but also sounds more accurate than the rest. What
do you think of these approaches?
Extended sounds good, maybe even "unicode"? or "international"?
I echo Alejandro's concerns. We should stop having the flag
completely, not encourage using it.

If the default restrictions are too tight, then we need to work on
that. What we should not do is to introduce a badly tested because
mostly unused codepath, that will introduce bugs in all sorts of
places.
IOW: if we move towards better character support, we need to do that
by allowing it always. Same for longer names.

Chris
Marc Haber
2024-12-09 20:30:01 UTC
Permalink
On Mon, 9 Dec 2024 18:08:33 +0100, Chris Hofstaedtler
Post by Chris Hofstaedtler
I echo Alejandro's concerns. We should stop having the flag
completely, not encourage using it.
I violently disagree. But I have to accept this.
Post by Chris Hofstaedtler
IOW: if we move towards better character support, we need to do that
by allowing it always. Same for longer names.
I think that our distinction between system users and "normal" users
is fine. Noone needs a package generating "weird" user names.

Greetings
Marc
--
----------------------------------------------------------------------------
Marc Haber | " Questions are the | Mailadresse im Header
Rhein-Neckar, DE | Beginning of Wisdom " |
Nordisch by Nature | Lt. Worf, TNG "Rightful Heir" | Fon: *49 6224 1600402
Chris Hofstaedtler
2024-12-10 11:20:01 UTC
Permalink
Post by Marc Haber
On Mon, 9 Dec 2024 18:08:33 +0100, Chris Hofstaedtler
Post by Chris Hofstaedtler
I echo Alejandro's concerns. We should stop having the flag
completely, not encourage using it.
I violently disagree. But I have to accept this.
Post by Chris Hofstaedtler
IOW: if we move towards better character support, we need to do that
by allowing it always. Same for longer names.
I think that our distinction between system users and "normal" users
is fine. Noone needs a package generating "weird" user names.
I think we're speaking past each other here.

Packages can already create absolutely broken usernames today, if
they want.

To me, the question is more, why do we have a flag that, if used,
allows you to break /etc/{passwd,shadow,group,gshadow} completely?

Chris
Marc Haber
2024-12-10 14:30:01 UTC
Permalink
On Tue, 10 Dec 2024 12:10:14 +0100, Chris Hofstaedtler
Post by Chris Hofstaedtler
To me, the question is more, why do we have a flag that, if used,
allows you to break /etc/{passwd,shadow,group,gshadow} completely?
The user-oriented solution would be to identify the things that break
/etc/passwd and to forbid these. Just forbidding everything is heading
the wrong direction.

Greetings
Marc
--
----------------------------------------------------------------------------
Marc Haber | " Questions are the | Mailadresse im Header
Rhein-Neckar, DE | Beginning of Wisdom " |
Nordisch by Nature | Lt. Worf, TNG "Rightful Heir" | Fon: *49 6224 1600402
Marc Haber
2024-12-03 21:10:01 UTC
Permalink
Post by Marc Haber
I'll probably deprecate --allow-bad-names in favor of something that
doesn't use the word "bad" (suggestions appreciated). Otoh, adduser in
the Red Hat World uses --badname to allow such names as well.
--allow-unsafe-names might be a more helpful name.
Indeed. Would shadow Upstream go with that?

Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Soren Stoutner
2024-12-03 22:20:01 UTC
Permalink
I appreciate your being careful and deliberate about this instead of rushing
into a solution that brings unintended consequences. But I also appreciate
your taking the time to engage with the issue instead of just ignoring it.
Post by Marc Haber
Hi,
thank you all for your contributions to this discussion. I have now
finally understood¹ that it is not enough to try creating an UTF-8
encoded user name and see that it correctly shows up in /etc/passwd to
declare UTF-8 support. Please forgive me for not replying to all of you
in this thread individually, I have read everything and if I didnt cater
for your arguments in this message please feel free to remind me.
https://lists.debian.org/debian-devel/2024/11/msg00491.html correctly
outlines that homograph characters (such as é (UTF-8 0xC3 0xA9 and the
lookalike é 0x65 0xCC 0x81) are not only a nuisance. At the least,
adduser should reject creating étienne if étienne already exists - those
are different user names but look the same, and if you don't
cut-and-paste user names instead of typing them you're bound to hit the
wrong user depending on HOW you type and what input medium you use. Not
good.
https://wiki.debian.org/UserAccounts and
https://wiki.debian.org/UserAccountsPhilosophy are updated accordingly.
After understanding this, I must admit that what's currently left active
on the adduser team (me) doesn't have the capacity to implement this
properly and in time for trixie. To make things worse, the
Unicode::Precis module, which should be in Debian as
libunicode-precis-perl (but isn't) hasnt seen an upstream release in
more than five years.
Additionally, I don't see myself in the situation of writing a proper
checker for the RFC 8264 IdentifierClass (Chapter 4.2) at the moment
since I don't have the time to check out which \p{Foo} character classes
match the classes given in the RFC.
I would appreciate volunteers to help here, but first I need to bring
some sense in adduser's current state of affairs to make an unstable
upload that can eventuall migrate to testing.
- adduser --system's user name validation will not change
- I'll make sure that adduser <normal user account> doesn't accept
UTF-8 user names, bringing it closer to systemd's notion of a valid
user name
- adduser --allow-bad-names will still allow UTF-8 usernames, not doing
normalization. I will document this and make it clear that the local
admin needs to make sure that they don't allow things they don't want
to have
- adduser --allow-all-names will just verbatim pass all user names to
useradd.
All this will be documented in the man page, in README.Debian and/or the
Wiki after the code passes the test suite again.
I'll probably deprecate --allow-bad-names in favor of something that
doesn't use the word "bad" (suggestions appreciated). Otoh, adduser in
the Red Hat World uses --badname to allow such names as well.
I would love to hear your opinion. Silence is agreement ;-)
Greetings
Marc
¹ RFC 8264, RFC 8265, and Unicode TR 15 linked in this thread were
educating for me
--
Soren Stoutner
***@debian.org
Ben Kallus
2024-12-08 20:40:01 UTC
Permalink
Hi everyone!

I second calling it "allow-unsafe-names" for the following reasons:

1. Many programs assume that usernames are so inert that they can be
used in shell strings without proper escaping. For example, a user
named $(touch /tmp/pwn) will create /tmp/pwn upon the first launch of
an interactive bash, because the default bash PS1 interpolates the
username before doing command substitution. adduser doesn't allow
whitespace or forward slashes in usernames, even with
--allow-all-names, but you can still get the same behavior with the
username $(>`printf$IFS"\x2ftmp\x2fpwn"`). How this works is left as
an exercise for the reader. Once you figure it out, see if you can
out-golf us :)

2. There's a path traversal bug in useradd (but not adduser) that can
be triggered by usernames beginning with "../". For example, for the
username "../bin/brangal", useradd will create a home directory at
/home/../bin/brangal (i.e. /bin/brangal). This can be used to place a
directory owned by the new user nearly anywhere on the system.

-Ben Kallus && Jonah Weinbaum
Chris Hofstaedtler
2024-12-09 17:10:01 UTC
Permalink
Post by Ben Kallus
I second calling it "allow-unsafe-names"
This was never on the table, and shadow upstream might even drop the
entire "support" for having bad names.
[..]
Post by Ben Kallus
2. There's a path traversal bug in useradd (but not adduser) that can
be triggered by usernames beginning with "../".
It's not a bug if you disable the guard rails.

Chris
Marc Haber
2024-12-09 20:30:01 UTC
Permalink
On Mon, 9 Dec 2024 18:04:52 +0100, Chris Hofstaedtler
Post by Chris Hofstaedtler
This was never on the table, and shadow upstream might even drop the
entire "support" for having bad names.
Just for the record, I consider this a kneejerk reaction that moves
the world backwards. It's sad.
--
----------------------------------------------------------------------------
Marc Haber | " Questions are the | Mailadresse im Header
Rhein-Neckar, DE | Beginning of Wisdom " |
Nordisch by Nature | Lt. Worf, TNG "Rightful Heir" | Fon: *49 6224 1600402
Loading...