ICANN : IDN Committee : Briefing Paper on IDN Permissible Code Point Problems

Internationalized Domain Names (IDN) Committee

Briefing Paper on IDN Permissible Code Point Problems
27 February 2002

OVERVIEW

This briefing paper addresses a set of Internationalized Domain Name (IDN) policy issues referred to as "permissible code point" problems. The paper begins with a definition of the issue, and then summarizes some specific policy problems that appear likely to arise upon implementation of the IDNA standard that has recently been advanced by the IDN Working Group of the Internet Engineering Task Force (IETF).

Accompanying this paper is the committee's input to the IETF on ways to minimize these policy problems.

DEFINITIONS

(This paper proceeds from an understanding that the IETF is proceeding with consideration of the IDNA protocols, which are based upon a method for ASCII-compatible encoding of the character repertoire defined by the Unicode Standard. For purposes of this paper and the companion input to the IETF, the committee takes IDNA as a given.)

By "permissible code point" issues, we refer to policy problems that might arise from the use of certain non-ASCII characters included in the Unicode Standard as elements of an IDN domain name label. The draft IDN standards that have recently been advanced by the IETF's IDN working group would, as currently drafted, exclude the use of some, but not all, of these problematic characters. See, e.g., [INDA], [NAMEPREP]. In the latest drafts of these standards, only a minimal number of characters would be excluded.

At present, the DNS host name specifications limit permissible code points in domain name labels to a restricted subset of 7-bit ASCII: the letters a-z and A-Z (interpreted in case-independent fashion), the digits 0-9, and the hyphen-minus ("-"). [RFC 1034], [RFC 1123]. "LDH" is an abbreviation for "letters, digits, hyphen," and the term "LDH code points" refers to this set of characters.

The IETF's IDN Working Group is in the final stages of work on a standard that will allow the representation of "internationalized" characters in the DNS. See, e.g., [IDNA], [NAMEPREP], [PUNYCODE]. By "internationalized domain name," we mean a sequence of characters that can be used (through some process) as a DNS hostname, but that contains one or more characters outside the LDH code points.

For purposes of accuracy, some observers prefer to distinguish between a "domain name" (which, under the current standard, can include any octet in the 0x00 through 0xFF range) and a "hostname" (which, under the current standard, is limit to the LDH code points, ). In other words, the "hostname" definition represents implementation choices which may be narrower than all of the possibilities allowed by the definition of a "domain name." Similarly, the term "Internationalized Domain Name" can be defined to include all of the possible code points (i.e., in the case of IDNA, the entire Unicode repertoire), while the term "Internationalized Hostname" can refer to the specific sets of code points that are allowed to be added to the LDH code points.

The Unicode Standard (code points identical to ISO/IEC 10646) is a coded set of characters that contains tens of thousands of characters from all major scripts. See [UNICODE, ISO 10646].

INTRODUCTION TO IDN 'PERMISSIBLE CODE POINT' PROBLEMS

As part of the process of finalizing the IDNA standard, the IETF must decide a number of permissible character issues, including which sets or collections of characters within the overall Unicode repertoire will be permitted and which ones are prohibited. For example, the existing specification for DNS hostnames and services prohibits all punctuation characters except the hyphen-minus ("-"), along with the label-separating period. See [RFC952], [RFC1034], [RFC1035].

In addition to the characters of every language script that could be identified and standardized by the Unicode Consortium, the Unicode Standard contains several sets of "characters" that do not, in fact, appear in any conventional human language. At a minimum, these characters include:

line and symbol-drawing characters,
symbols and icons that are neither alphabetic nor ideographic language characters, such as typographical and pictographic dingbats,
punctuation characters, and
spacing characters.

The Unicode Standard provides clear and precise definitions for only the first of these four categories. In the case of the picture-drawing characters, the Unicode Standard includes them "solely to facilitate the support of legacy implementations…." The Unicode Standard does not state rules for the construction of such pictures, which can include block elements, fractional fills, and geometric shapes. [UNICODE (ISO/IEC 10646), Chapter 12].

In the case of symbols and icons, the Unicode Standard includes currency symbols, letter-like symbols (e.g., the "degree Celsius" character), number forms (e.g., the fraction slash, Roman numerals), trademark and copyright signs, mathematical operators, arrows, control pictures, musical symbols (including Byzantine musical symbols), pictographic dingbats, and miscellaneous symbols (weather and astronomical symbols, pointing hands, religious and ideological symbols, the I Ching trigrams, planet and zodiacal symbols, chess pieces, card suits, and musical dingbats). While some of these characters generally appear as normal text, others "are typically used for text decorations…." [UNICODE, Chapter 13].

In the case of punctuation characters, the Unicode Standard states that they may be identified by their "common function": "They separate units of text, such as sentences and phrases, thus clarifying the meaning of the text." The Unicode Standard goes on to note that punctuation codes appear in several widely separated locations in the Unicode character blocks, including Basic Latin, Latin-1 Supplement, General Punctuation, and Chinese/Japanese/Korean Symbols and Punctuation, as well as "occasional characters in character blocks for specific scripts." [UNICODE, Chapter 6.1]. Among the punctuation characters in the Unicode Standard is the no-break space (U+00A0), the zero-width space (U+200B), and the zero-width no-break space (U+FEFF).

Use of some of these classes of characters will increase the risks of user confusion, will make it harder to index Whois (and equivalent) databases, and will create vast opportunities for spoofed names which would not otherwise exist. There may well be market demand for DNS labels that contain such characters or symbols; there will certainly be commercial interests (potentially including some registrars) who will take advantage of the potential for confusion and cybersquatting as a business opportunity to sell multiple registrations on a "protect your name, or else" basis.

Indeed, the problem of confusingly similar characters exists within the current LDH set. Some commentators have described the exploitation of similarities (such as the o and the 0) to generate intentional confusion as a serious security problem. Known as the "homograph attack," the use of character resemblance to divert traffic currently occurs through the registration and use of domain names such as <G00GLE.com>, <MICR0S0FT.com> or <YAH00.com>. Unless careful, a user may click through to an identical-seeming site, and proceed to disclose passwords, credit card numbers, or other sensitive data. See [ATTACK].

Beyond security considerations, the various unconventional code points noted above may pose potentially significant policy problems, which, because they are policy rather than technical issues, the IETF is not likely to address. It appears that the IETF's IDN Working Group is inclined to permit most such characters: there is no technical reason not to do so, and it is technically rational to take the position that their implementation and use is the responsibility of the implementers and users (or, perhaps, the registries or registrars). The current draft of [NAMEPREP, Section 5] specifies a prohibition table that includes some space characters, control characters, private use and replacement characters, non-characters, surrogate codes, change display properties, tagging characters, and other inappropriate codes. However, many of the codes falling within the four problematic groupings identified above do not appear on the prohibition table.

The policy implications of adding these types of characters to standardized DNS are potentially quite profound. They would permit creation of symbols for which there are no names, symbols that might match other, existing domain names (for example, by inserting a non-spacing break between two visible characters), but for which the DNS matching rules are not capable of detecting the apparent similarity. Such characters would enable easy spoofing of existing domain names, with all the confusion, hijacking of traffic, and fraud that would inevitably follow. Another possible outcome would be renewed conflicts and pressure on the DNS over the registration of legally protected symbols (such as trademarked names). Indeed, some DNS registrars have reportedly begun suggesting to customers that these symbols would permit them to generate and utilize their corporate logos directly in the DNS, using the logo as part of a URL.

To summarize: Some characters within the Unicode repertoire might, if allowed as permissible IDN code points, cause significant problems from the standpoint of policy considerations. The next question, therefore, is: What (if anything) should be done to enable the resolution of these potential policy concerns?

The Unicode character set is extremely large; consequently, it would take a large number of people a significant amount of time to make decisions on a character-by-character basis, particularly since many of those decisions would require every character to be compared against every other character, using a variety of criteria.

Consequently, a more practical model must be devised. The IETF could either:

Start with the current restricted LDH ASCII characters (a-z, A-Z, 0-9, -) and then extend it to include relevant, non-problematical "international" characters. Another way to state this model is: "Everything that is not explicitly permitted is prohibited." This is often referred to as the "inclusion-based" model, because you begin with a baseline, and make decisions to include.

Or:

Start with the entire Unicode set, and eliminate only characters that can be explicitly demonstrated as being harmful. This is often referred to as the "exclusion-based" model, because you begin with all possible elements, and make decisions to exclude. Neither model/principle automatically makes the right per-character, or per-script, or per-block, decisions; it just provides a framework for the difficult cases. (A reminder: these points refer to the definition of "Internationalized Host Names", which might be considered a separate and distinct IETF work item, to be undertaken in parallel with the standardization of "Internationalized Domain Names").

The fundamental advantage of the "inclusion-based" model is that it is far easier to restrict something initially and then later relax the restriction, than it is to permit something and then later attempt to remove it from use.

A related question is whether the exclusion (temporary or permanent) of the problematic IDN code points in the hostname definition should be performed through the filtering (string preparation or "nameprep") that is part of the advancing IDN standard, or by advising registries and registrars not to permit their registration. See [NAMEPREP]. In principle, either is possible. Registry/registrar-based exclusion could be performed at any time; however, many registries may simply choose to ignore the advice, and any registrant could do so below the second or third level of the gTLD registries. (It is conceivable that a registry could implement a policy providing for revocation of IDN domain names whose registrants use prohibited characters in lower-level labels).

Notably, the exclusion of certain code points is already a part of the [NAMEPREP] documentation, and was anticipated both in the original DNS documents [RFC 1034], and in the most recent draft of "Requirements for Internationalized Domain Names," (Z. Wenzel and J. Seng, eds.), draft-ietf-idn-requirements, work in progress, November 2001. In Section 2.1[3], that document stated: "A service defined on top of the DNS, for instance the IDN-to-address function, MAY limit the code points that can be used."

References

[RFC952] K. Harrenstien, M.K. Stahl, E.J. Feinler, "DoD Internet Host Table Specification," RFC 952, October 1985.

[RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities," RFC 1034, November 1987.

[RFC 1035] P. Mockapetris, "Domain Names - Implementation and Specification," RFC 1035, November 1987.

[RFC1123] R. Braden, "Requirements for Internet Hosts - Application and Support," RFC 1123, October 1989.

[UNICODE] The Unicode Standard, Version 3.1.1, defined by The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 <http://www.unicode.org/reports/tr27/> and the Unicode 3.1.1 Update Notice <http://www.unicode.org/versions/Unicode3.1.1.html>: The Unicode Consortium.

[IDNA] P. Faltstrom, "Internationalizing Domain Names in Applications (IDNA)," draft-ietf-idn-idna. Most recent version: <http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-06.txt>.

[NAMEPREP] Paul Hoffman and Marc Blanchet, "Stringprep Profile for Internationalized Host Names," draft-ietf-idn-nameprep. Most recent version: <http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-07.txt>.

[PUNYCODE] Adam Costello, "Punycode," draft-ietf-idn-punycode. Most recent version: <http://www.ietf.org/internet-drafts/draft-ietf-idn-punycode-00.txt>.

[ATTACK] Evgeniy Gabrilovich and Alex Gontmakher, "The Homograph Attack," Inside Risks 140, CACM 45, 2, February 2002, <http://www.csl.sri.com/users/neumann/insiderisks.html#140>.

Comments concerning the layout, construction and functionality of this site
should be sent to webmaster@icann.org.