INTERNET-DRAFT John C. Klensin December 13, 2000 Expires June 2001 Internationalizing the DNS -- A New Class draft-klensin-i18n-newclass-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 except that the right to produce derivative works is not granted. The above restriction will be removed by the author in a subsequent version of this draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 0. Abstract Several mechanisms have been proposed for placing multilingual names (more properly, names normally written in non-ASCII character sets) into the DNS or addressing the need for multilingual access to the Internet in other ways. Most of them involve, to one extent or another, workarounds to the current system. This document proposes a "go back and fix it" approach, replacing the "IN" Class in the DNS with one that is not limited to ASCII from its initial definitions. Some of the deployment issues, politics, and other drawbacks are also briefly discussed. A mailing list has been initiated for discussion of this draft, its successors, and closely-related issues at ietf-i18n-dns-newclass@imc.org. To subscribe to the mailing list, send a message to ietf-i18n-dns-newclass-request@imc.org with the single word "subscribe" (without the quotes) in the body of the message. To unsubscribe from the list, use that same address with the single word "unsubscribe" in the body of the message. Issues related to the relationship of the model proposed here to general issues of multilingual access to the DNS should be raised in the IETF IDN WG working group, see http://www.ietf.org/html.charters/idn-charter.html. 1. Introduction and Context There have been a large number of proposals, both inside and outside the IETF, for getting multilingual (or "internationalized") access to the DNS. With the exception of a few proposals that focus on doing the work in a separate directory system (see [Klen2000b] and several suggested or deployed commercial products), all involve inserting the names into the existing DNS structure. This would be done using a character set repertorie that includes more than ASCII [ASCII] and some encoding to place characters from that repertorie into the system. To a considerable degree, while the objective is multilingual access to names (i.e., access from multiple languages), very few of the proposals address languages at all. For a number of good reasons which are adequately discussed elsewhere (<>) discussion has focused on access to the system with characters other than those that appear in ASCII and, in particular, characters drawn from ISO 10646 [IS10646]. This document, too, addresses characters, not languages, and "non-ASCII", "international", "multinational", and, following ISO's example, "universal character set" ("UCS") terminology is used interchangably below. When "multilingual" is used, it refers to the languages in which words or names appear, not what is placed into the DNS. When the DNS was designed, it was anticipated that there might be future extension through the use of "Classes". All common current uses on the public Internet use the "IN" class. Two additional classes are known to have been defined and used: one for the old Chaosnet protocols and one for Hesiod-based protocols. Potential extensions via the Class mechanisms may have been one of the reasons that DNS labels and other fields were defined as binary, rather than ASCII, forms. Applications using the IN Class have historically assumed labels limited to seven-bit ASCII characters and, more specifically, a "protocol element" character set derived from ARPANET "hostname" rules (see [Klen2000a]). This document explores the question of what the DNS, and "DNS names" would have looked like had we been designing it today, with multilingual usage as a priority but with current technology and existing standards available, and proposes a way to get there. The proposal is radical in the sense that it implies a major restructuring of DNS usage and, indeed, of the Internet, to make the DNS seamlessly capable of working with multinational character sets. Such a restructuring is, and should be, quite frightening. It is worth considering only if the long-term risks and problems of other proposals are severe enough to justify a radical approach. It is the working hypothesis of this document that they are. At a relatively technical level, this would require changing every DNS resolver and server, and application that accesses either, on the Internet that wished to use non-ASCII names. Legacy (unconverted) systems would be at a significant disadvantage in referencing new names (some of which might use only a subset of ASCII characters but might still not be registered in the older Class), just as legacy systems were during the transition between Hostnames and the DNS. There are also a number of problems, such as the weaknesses of the DNS as a directory system, which it does not solve (see section 3.6, below). 2. Overview of the Proposal Suppose we introduce a new Class (let's call it UC for "universal characters" just as a placeholder, but I hope that isn't what we would choose), which is just like IN (i.e., inherits its RR definitions, but see below), except: * Labels and all fields containing text are defined as IS 10646 characters, coded in UTF-8 (or, in principle, some other _single_ system, i.e., we do not permit multiple character sets in the structure). * A new RR is introduced that maps a new-type label (in the new class) into a restricted-ASCII (traditional) label. The intent is that the resolver then looks up the restricted-ASCII label in the IN class and proceeds as usual. Probably it would need to have "nothing else there" restrictions like CNAME, but (to put it mildly) I haven't thought that through. An alternative would be to adopt a search rule strategy, looking first with Qclass=UC and then Qclass=IN. The new RR might not be needed if one adopted a search rule strategy, or strong administrative rules requiring that all Class IN records be transposed into Class UC (see below). * A second "NS" RR is introduced that indicates that a delegated subdomain of a zone in Class=UC is in Class=IN, and not in Class=UC. (Not that a references from Class=IN to Class=UC makes no sense and mechanisms for it should not be provided.) It is not clear at this time whether both the second and third "crossreference" RRs would be necessary, but almost certainly at least one would be. This brief outline obviously leaves out many critical details which would need to be worked out, only some of which are explored in this draft of the document. 3. Technical alternatives and the deployment and transition nightmare A "new class" proposal would obviously not be easy to deploy, but, realistically, neither are any of the other ideas if the definition of deployment involves users having access to Internet names drawn from a broad range of languages. It would cleanly separate "international character set" name spaces from the "ASCII" one -- i.e., "old" clients and systems would never see the non-ASCII types. In the international name space, English, and the character code points used to represent it, becomes just one of many such languages and their corresponding character code points. It might even let us fix a few other things along the line, as long as they were sufficiently straightforward to not create significant delays. E.g., there are several RR types in the current Class that are either obsolete or have never been widely used, and we might be able to eliminate them by not carrying them forward. While other transition models are possible, the cleanest one would be to conclude that the new Class was intended, over time, to simply obsolete and replace Class=IN. If registrations in Class=IN were transferred into (or explicitly referenced from) the new class (or a "search rule" system was employed), then the transition model would be very similar to that of the hosttable-> DNS transition. In particular, "old" clients and systems would see a smaller and smaller fraction of the Internet until they converted and we would expect some user-level tools to arise to work around slow conversions. 3.1 Preparation and comparison of names Subsets of ASCII, or character codes whose character repertoires are themselves subsets of ASCII, have long been the character repertoire of choice for the protocol elements of protocols that use characters in such elements. While some of the reasons for this --arguably including the decision to use characters from a Roman- (Latin-) based alphabet-- are simply historical, ASCII has the advantages of containing a very small (by world averages) set of characters, of permitting an extremely easy case-mapping algorithm (and case-mapping is important in Roman-based languages), of requiring no "composed" characters, and of raising no significant issues with canonicalization or identity-matching. To varying degrees, as soon as the character repertoire moves beyond the requirements of ASCII, comparison issues intrude: it is necessary for a DNS server to determine whether the name specified in a query matches the name that appears in its tables for a domain. And that, in turn, requires either that strict rules be applied to how names are stored and how queries are presented or that the server be able to interpret a somewhat-ambiguous (or "fuzzy") query. The latter option is infeasible given the design of DNS servers (although non-DNS systems might permit it -- see section 3.6 and [klen2000b]). The former has been the focus on the "nameprep" efforts within the IDN Working Group. In general, the mechanisms and rules being developed as part of the "nameprep" effort would need to be applied to a "new class" system, just as they would need to be applied to "edns/UTF-8" or "ACE" systems. 3.2 Registrations in both places A "multilingual" name registrant could choose whether to register the multilingual name exclusively or whether to register ASCII-based names as well (giving most of the useful properties of the two-sided business card analogies). Such ASCII-based names could be registered in the new class; they could presumably also be registered in the old class for compatibility with legacy systems. We would not expect reverse mappings to work from IN-space to UC-space; PTR lookups in Class IN would yield ASCII names; PTR lookups in Class UC would yield IS 10646- based names. And we would expect all other fields that contain text to contain IS 10646-based as well. There is probably no practical way to automate dual registrations in the general case (keeping in mind that naming and identification issues that exist near the root also exist deep in the DNS tree), so decisions about legacy registrations would need to be administrative and policy based. See section 5. Search rules (see next section) might be an acceptable substitute for dual registrations, but have their own disadvantages. 3.3 Search rules and search failures Unless all relevant records in Class=IN are copied into Class=UC and the two are kept synchronized (very difficult if not impossible to maintain), there will be a requirement for searching from the newer class to the older one. That requirement could, in principle, be kept in the servers and off the network by providing for new servers to automaticaly search in Class=IN if nothing is found in Class=UC without intervening interactions with the resolver. This could introduce significant complexity and a number of special cases into the server and might or might not be wise. Since a new Class causes multiplicative effects on the number of probes potentially required to complete a search, minimizing the number of similar RR types in the new Class becomes technically advantageous as well as aesthetically so (see section 4). The potential need to make queries in both Class=UC and Class=IN for a given user-supplied name provides an immediate, and strong, reason why the fundamental domain hierarchy structure of both Classes should be identical, even if the servers are not. If identity of servers is not practical (it probably would not be significantly below the top level, if there), the portion of the Class=UC tree that shared names with the Class=IN tree would need to be identically structured. Almost by definition, as non-ASCII non-terminal nodes are introduced into the Class=UC tree, that tree would diverge from, or become a superset of, the Class=IN one. The alternatives would, at best, be hopelessly confusing to users. But, if searching mechanisms from Class=UC to Class=IN will be in regular use, it is tempting to rely on those mechanisms rather than doing any forward copying of data. This would increase overhead in comparison to having all information copied into the UC class as early as possible, but the range of alternatives needs to be studied carefully, especially with regard to domain trees that contain non-ASCII names and UC-capable servers at some nodes and only ASCII names and legacy servers at others. The special, cross-Class NS RR suggested above would help with such trees, but some searching strategies might make strict bottom-to-top conversion of subtrees (rather than level-skipping) very valuable if not necessary. Having two classes also raises issues for which the answers seem obvious, but decisions must be made and made explicit. For example, it seems clear that one should search ((QClass=UC, Qtype=MX, ...) (QClass=UC, Qtype=A6, ...) ... (QClass=IN, Qtype=MX, ...) ... But, at least in theory, a case could be made for looking for MX RRs in "IN" before looking for address records in "UC". 3.4 A "new class" solution versus an "edns/utf-8" one. Some of the proposals before the IDN working group (and elsewhere) depend on the use of "extensible DNS" ("edns") facilities to permit extended labels and the use of UTF-8 encoding in them. Proponents point out that edns is extremely useful for IPv6 and DNNSEC, so will probably deploy quickly anyway; its use for non-ASCII DNS labels would both benefit from and reinforce those deployment pressures. One of the barriers to the deployment and heavy use of extensible DNS [RFC2671] is that its use in the current, Class=IN, environment depends somewhat on updating of intermediate servers. In other words, an updated client and updated primary server may not be able to properly interoperate because caches or secondary servers may still be running older code. In principle, and probably in practice, this is not an issue with a new Class: absent serious errors of configuration, name servers delegations and caches for the new class would be only to servers supporting that class. 3.5 A "new class" approach versus an "ACE" one. Several of the proposals in the IDN working group (and elsewhere) depend on encoding ISO 10646 characters into an ASCII-compatible format (an ASCII-compatible encoding, hence "ACE") so that the names, however ugly, would survive passage into applications that have intrinsic seven-bit limitations. That group of applications is somewhat more diverse than what is usually thought of as "internet applications". For example, X.509 certificates are used in SSL and assume seven-bit characters. The ACE codings would work with those applications, although they would look nothing like the graphic characters of the original character set and language. While this document assumes using the UTF-8 encoding of IS 10646 directly in the names and labels of the UC class, UTF-8 is just another encoding. There may be applications-based arguments for using an ASCII, or ASCII-compatible, encoding to represent character codes in the new Class as well. However, since all codes in the new Class would be using the same system, one could devise a system that did not require a switching or labeling mechanism to identify the use of the coding system versus the appearance of codes in the ASCII range intended to be interpreted as ASCII. I.e., prefixes or suffixes might become unnecessary and it might be possible to use higher-density encodings, such as MIME's base64, rather than those more commonly suggested as ACE mechanisms. 3.6 A "new class" approach versus a "directory" one. Many of the issues raised in [Klen2000b] are not addressed by this proposal. Neither it, nor any other DNS-based solution, would turn the DNS into a searchable directory. Nor can they address imprecise matching, keyword matching, nearest applicable server location, searching on the content of data fields, and so on. This proposal does provide a plausible solution to reverse-mapping problems, deployment, and has known scaling properties: all areas where the notions outlined in [Klen2000b] are weaker. It would probably be somewhat faster than a directory approach layered on the DNS, since there would be no requirement for a two-stage lookup process. But, ultimately, the two proposals are complementary: There is a strong applications case for introducing a directory layer. While the directory layer could be used to support multilingual names -- treating the ASCII-based names in the DNS as protocol elements rather than names that ought to be user visible -- it could also be used with a DNS that actually and cleanly supported multilingual names as suggested here. 3.7 Another look at legacy applications As suggested above, there are some applications, many with origins outside the IETF, that cannot be easily upgraded to use of non-ASCII (or, generally, non-seven-bit), character codes. It is difficult to know what to do about those applications. If we are really serious about converting the Internet to support applications in all languages (which is ultimately the assumption underlying this document), then the answer may be more clear: the overhead of dealing with the UCS to ASCII interface ought to fall on those applications, as an intermediate step until the protocols themselves can be upgraded. In other words, we would anticipate a four-stage conversion process for those applications: (i) Completely legacy (non-updated) code would continue to reference Class=IN (no other option is possible). (ii) Applications code would be upgraded to make QClass=UC inquiries and to represent the UCS codes for their databases and presentations in some ASCII-compatible form compatible with their protocol definitions. (iii) The protocols would be upgraded to international norms and usage. (iv) The applications code would be changed to conform to the new protocols, eliminating the workaround of stage (ii). If these conversions and downgrades are incorporated into the DNS, we are stuck with their overhead and appearance forever. And different applications, with different constraints, may have to convert them to application-specific formats anyway. 4. Bringing RR types forward In principle, one could populate the UC Class with all of the types of the IN one, possibly eliminating those that are clearly obsolete, as mentioned above. A narrow reading of many of the existing definitional documents might even require this, although we can be assured that no queries or registrations in class=UC exist today. But it might be interesting to evaluate the implications of taking a harder line, partially to shorten search paths and minimize the size of zone files. Several RR types have been added "experimentally"; it would probably be wise to leave those for which there isn't considerable deployment and justification behind. We might also consider leaving AAAA RRs behind as obsolete or redundant, since A6 is the more general of the two. And, more radically, one might consider eliminating type A RRs, writing IPv4 addresses in IPv6 form, e.g., kakameymi.example. UC A6 0 ::FFFF:10.0.0.44 and letting APIs to resolvers translate them back into IPv4 format if needed. By doing this, the potential need to query for A6, AAAA, and A RRs in sequence would disappear, resulting in some performance improvement, especially for the "not found" cases. This might raise entry barriers too much to be worthwhile, but, if feasible (and we really believe in IPv6 deployment), it would yield a much cleaner environment for both forward and reverse mappings. On the other hand, eliminating other types in favor of A6 might be advancing the technology in too many ways at once. For example, while the A6 RR appears to be fully general, there is so far little few real experience on using it (or other IPv6 RRs) and none of that experience is at large scale. It may be wise to go somewhat cautiously into directions that tie the new class to such less-than-complete-tested approaches. 5 Pushing into layers eight through ten (For those who don't know, these layers have become an internal joke in the IETF community whose exact origins are unknown to this author. The layers are characterized as "financial", "political", and "religious", with some debate about the order.) A new DNS class really would be new. Current Internet administrative procedures and lines of authority with regard to the DNS have assumed that Class=IN is the only class at issue. It is to be hoped that IETF could identify technical criteria but leave the non-technical issues to ICANN and related bodies. 5.1 The root server question The design of the DNS is such that there is no inherent reason why root servers for a new class would need to be the same as those for IN. However, the same considerations for root server selection that apply to the Class=IN root [rfc2870] would presumably apply to the Class=UC root as well. There would be several other administrative and operational advantages for keeping the root servers the same --or at least co-locating them-- as long as loads and similar factors permitted. 5.2 Other administrative challenges Just as the root servers would not need to be the same, the introduction of a new Class would, in theory, permit revisiting the entire top-level structure and administration of the the Class=IN DNS. Doing so would probably be unwise if we wanted to see this deployed in our lifetimes, but the possibility must be identified and, at least briefly, considered. 5.3 Thinking about deployment It is clear that, like any other multilingual approach, software supporting a new DNS class would deploy much more quickly in areas which clearly need it than in areas that perceive they do not. The requirement for communication with, and access to sites in, non- English-speaking areas would tend to drive deployment in other areas with this and many other approaches. The so-called "ACE" approaches within Class=IN (and perhaps some others) using the IN Class would permit non-updated sites to see multilingual names in their ugly encoded forms; it is possible that would actually act as a disincentive to updating and conversion since the names would still be somewhat visible; this proposal would not make multilingual names available in any form to legacy systems. Whatever thinking is done about deployment tradeoffs should consider Internet growth rates, especially in non-English-speaking areas. Whatever solution is adopted, we will need to live with it for a long time. If no old systems are ever converted, but new ones installed after a particular date have updated software installed, and "doubling every year" behavior continues, then the legacy base represents half of the Internet a year later, a quarter a year after that, and so on. So, a sufficiently important change, incorporated into relevant shipping software, has a very large percentage impact even if there is no actual updating of systems. Something to keep in mind, especially if the alternatives are overhead-laden kludges that we will need to support forever. 6. Summary <> 7. References [ASCII] American National Standards Institute (formerly United States of America Standards Institute), X3.4, 1968, "USA Code for Information Interchange". ANSI X3.4-1968 has been replaced by newer versions with slight modifications, but the 1968 version remains definitive for the Internet. [IS10646] [RFC2671] Extension Mechanisms for DNS (EDNS0). P. Vixie. August 1999. [RFC2870] Root Name Server Operational Requirements. R. Bush, D. Karrenberg, M. Kosters, R. Plzak. June 2000 [klen2000a] Klensin, J., "Reflections on the DNS, RFC 1591, and Categories of Domains", work in progress (http://www.ietf.org/internet-drafts/draft-klensin-1591-reflections-01.txt) [klen2000b] Klensin, J., "Role of the Domain Name System", work in progress (http://search.ietf.org/internet-drafts/draft-klensin-dns-role-00.txt) 8. Acknowledgements Rob Austein and Randy Bush made very significant contributions to the thinking and some of the text that went into early versions of this draft through a series of email discussions. Others, including Marc Blanchet, Vint Cerf, Kilnam Chon (with apologies for writing his name in ASCII characters rather than Korean ones), Patrik Faltstrom (with apologies for the ASCII transposition), Paul Hoffman, and Zita Wenzel have made suggestions or challenged some of the ideas in their embryonic form, leading to clarifications and clearer thinking. The author, of course, bears ultimate responsibility for the ideas as presented. 9. Author's address John C Klensin AT&T Labs 99 Bedford Street Boston, MA 02111 klensin@research.att.com Expires June 2001