Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for internationalized email addresses #10522

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

aphillips
Copy link
Contributor

@aphillips aphillips commented Jul 25, 2024

Replaces #5799.

Changes the ABNF and surrounding text to include support for non-ASCII characters on both the left and right side of the email address. Discussion in #4562 includes the genesis of these changes.

  • At least two implementers are interested (and none opposed):
  • Tests are written and can be reviewed and commented upon at:
  • Implementation bugs are filed:
    • Chrome: …
    • Firefox: …
    • Safari: …

(See WHATWG Working Mode: Changes for more details.)


/input.html ( diff )
/references.html ( diff )

@annevk
Copy link
Member

annevk commented Aug 15, 2024

I would like to second the comment @hsivonen made in #5799 (comment). In particular I think that we need to replace the ABNF with two algorithms:

  1. Is this input valid? This takes the email address, splits it on @ and attempts to validate both sides. This cannot rely on "valid ___domain string" as that is not a primitive browsers have. It will need to use the host parser.
  2. Convert this input for submission. This needs to do the normalization @hsivonen discusses on the host side.

I suspect what we want is that 1 is a parsing job resulting in failure or a data structure and 2 is a serialization job of that data structure. That seems ideal.

WebKit can be considered supportive of fixing this and I'd also be willing to help nail something down along these lines, but the discussion in #4562 has made me rather unmotivated. 🙁

@annevk annevk added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Aug 15, 2024
@hsivonen
Copy link
Member

hsivonen commented Aug 15, 2024

@annevk , We don't necessarily need two algorithms but one algorithm that converts input to submittable form but can fail. Also, we need clarity on what the value exposes to JS (when the conversion succeeds and when it fails) and what form pattern matches on.

@aphillips , What's the purpose of hosting "." out of utext when localpart is defined as in the PR? AFAICT, the ABNF would mean the same thing if "." was in utext. Am I reading the ABNF wrong or is it intentional that the ABNF doesn't end up ruling out leading, trailing, or consecutive dots? (In this comment, I'm not taking a position on whether it should.)

@hsivonen
Copy link
Member

The algorithm could look like this:

Find the last instance of character U+0040 @ in input. If there isn’t one, return failure.

Let the part of input before the last U+0040 @ character be local and let the part after the last U+0040 @ character be ___domain.

If local is the empty string, return failure.

If local contains any of
Unpaired surrogates
U+0020 or below
U+007F
U+0022 ”
U+0028 (
U+0029 )
U+002C ,
U+003A :
U+003B ;
U+003C <
U+003E >
U+0040 @
U+005B [
U+005C
U+005D ]
return failure.

Let normalizedDomain be the result of https://url.spec.whatwg.org/#concept-___domain-to-ascii with ___domain and true.

If normalizedDomain is failure, return failure.

Let normalizedLocal be local normalized to Unicode Normalization Form C.

Return the concatenation of normalizedLocal, U+0040 @, and normalizedDomain.

@hsivonen
Copy link
Member

hsivonen commented Aug 15, 2024

I think pattern should match on the submittable string.

@hsivonen
Copy link
Member

Per discussion with @zcorpan , the algorithm should probably return failure if the label in the TLD position consists entirely of ASCII digits.

@annevk
Copy link
Member

annevk commented Aug 16, 2024

Oh, I missed that you don't run the host parser on the final argument. Why is that?

@hsivonen
Copy link
Member

Oh, I missed that you don't run the host parser on the final argument. Why is that?

Two reasons:

  1. Host parsing runs percent decoding, which seems inappropriate for the email address validation use case.
  2. Host parsing and RFC 5321 disagree on how IPv4 addresses are represented, and it would be weird to support IPv6 but not IPv4 after the @ sign. Since there aren't strong use cases for either flavor of IP addresses after the @ sign in the HTML form context, it seems simpler to support neither.

Rejecting domains whose TLD consists entirely of ASCII digits ends up rejecting things that look like URL-style IPv4 addresses that are wrong in email addresses.

@hsivonen
Copy link
Member

Demo of the above-described algorithm for experimentation

Code

The "Presentable" display isn't (at least at this time) a suggestion for browsers to change the user-entered visible contents of the field to the "Presentable" form, but it's what a server could derive from the submittable value by running ToUnicode (or mere labelwise Punycode decode) on the ___domain.

The point is to offer an opportunity to experiment with different input to see if the rejections seem appropriate.

Key differences compared to the current state of the PR:

  1. Algorithmic rather than BNF.
  2. The local part is normalized to NFC.
  3. The submittable representation of the ___domain part is ASCII.
  4. If the TLD is all ASCII digits, reject.

Notable non-differences:

  1. What's valid in the local part is intended to match the current state of the PR.
  2. What's valid in the ___domain part is intended to match the current state of the PR with the exception of the digit TLD check.

@hsivonen
Copy link
Member

Oh, and let's surface the only comment I put in the code:

Do we want to support TLDs with MX records or require at least one more level in the ___domain name?

@annevk
Copy link
Member

annevk commented Aug 20, 2024

I looked a bit more at not using the host parser:

  • We end up passing beStrict is true (aside: we'd have to make this a named argument if we're going to invoke this from outside of URL in this way). This ends up preventing domains such as test@test_test.example.com from sending email. This does not seem great to me. It would also be the first instance where we have a normative dependency on beStrct for user agents, I think.
  • We end up missing the forbidden code point check. That's okay as it's redundant with beStrict, but see above.
  • We end up not percent-decoding. Seems okay.
  • I think the TLD check you mention should be https://url.spec.whatwg.org/#ends-in-a-number-checker for parity.

I also see the host parser has an assert on non-empty input. Not sure if ToASCII relies on that, but we probably want to double check input such as test@ or @ is handled properly.

@hsivonen
Copy link
Member

I looked a bit more at not using the host parser:

  • We end up passing beStrict is true (aside: we'd have to make this a named argument if we're going to invoke this from outside of URL in this way). This ends up preventing domains such as test@test_test.example.com from sending email. This does not seem great to me.

Not catching various non-DNS characters as typos when the ___domain is supposed to be for DNS naming isn't great, either. In the Gecko case, it would be easy to apply an ASCII deny list that's the STD3 ASCII deny list with the exception of allowing the underscore if underscore in domains is a thing that's relevant to input type=email. (From DNS usage that I've seen so far, the underscore seems to occur in various DNS-based protocols that look up stuff stored in TXT records, not MX records.)

It would also be the first instance where we have a normative dependency on beStrct for user agents, I think.

Part of what I'm trying to demo is showing how little code this demo needs when using the libraries that Gecko uses. That is, in terms of Gecko-developer-facing code, beStrict=true is not onerous. (Not sure what the binary size impact is if new combinations of code get inlined, but the code is supposed to be designed to be defensive against excessive repetition caused by inlining.)

I also see the host parser has an assert on non-empty input.

This is implied by beStrict=true by the way of VerifyDnsLength.

Not sure if ToASCII relies on that, but we probably want to double check input such as test@ or @ is handled properly.

As can be seen from my demo, these get rejected.

@hsivonen
Copy link
Member

AFAICT, an underscore in the ___domain part currently results in .validity reporting typeMismatch: true in Firefox, Safari, and Chrome. As far as I'm aware, this isn't a source of complaints, so it seems OK to keep using the STD3 ASCII deny list.

@aphillips
Copy link
Contributor Author

@hsivonen asked:

@aphillips , What's the purpose of hosting "." out of utext when localpart is defined as in the PR? AFAICT, the ABNF would mean the same thing if "." was in utext. Am I reading the ABNF wrong or is it intentional that the ABNF doesn't end up ruling out leading, trailing, or consecutive dots? (In this comment, I'm not taking a position on whether it should.)

I think it's a mistake, since backtracking through various specs leads to localpart using something like 1*utext *("." *utext). I'm liking what you propose elsewhere on the thread.

@hsivonen
Copy link
Member

@hsivonen asked:

@aphillips , What's the purpose of hosting "." out of utext when localpart is defined as in the PR? AFAICT, the ABNF would mean the same thing if "." was in utext. Am I reading the ABNF wrong or is it intentional that the ABNF doesn't end up ruling out leading, trailing, or consecutive dots? (In this comment, I'm not taking a position on whether it should.)

I think it's a mistake, since backtracking through various specs leads to localpart using something like 1*utext *("." *utext).

I now see that the BNF oddity comes from the pre-existing BNF in the spec. It seems prudent not to reject ASCII local parts that the ASCII-only formulation currently in the spec accepts, and the BNF in this PR and the algorithmic formulation in my previous comment here accomplish that.

I'm liking what you propose elsewhere on the thread.

Great!

@annevk said:

I'm OK with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Development

Successfully merging this pull request may close these issues.

3 participants