Workshop on IDNs ICANN Meeting - Paris Thursday, 26 June 2008 >>TINA DAM: Good morning. So I think we're about ready to start. We started a little bit late to see if we'd get some more people into the room. But it looks like it's just too early or too late in the week, I guess. So welcome to the workshop on the IDNA protocol revision. We have a couple of small changes to the agenda that was posted. One of the reasons for it is that one of our panel members unfortunately wasn't able to join us. That is John Klensin. His flight was cancelled, and he had some troubles getting out of U.S. So he's not with us today. So we reshuffled the agenda a little bit, and you can see the way that it's going to be for the next couple of hours up on the screen. You're going to get a general introduction and review of the rationale from me. Then some revision details from Cary. Then we're going to move into the two revision topics, which is the character list generation rules by Patrik, the right to left or bidirectional issues and solutions from Harald. And Harald is not here yet. He's actually at the board workshop. But he's going to come over and join us. Then we're going to move into implementation topics, and you're going to see registration responsibilities by Cary, resolution by Patrik, and then we have invited the Arabic Script Working Group that are coordinating the Arabic script or the different languages that are using the Arabic script, to come and give us a status on their work as an example of implementation and work that's being done on top of the protocol layer. And then it's time for the conclusion. So that is how the agenda's going to be for today. Cary and Patrik, I might just want to ask you guys to introduce yourself for those who don't know you yet. >>PATRIK FÄLTSTRÖM: Patrik Fältström, employed by Cisco, and I'm one of the coauthors of the original IDNA proposal, and I'm also the editor of the table document that I will describe later today. >>CARY KARP: Cary Karp. I'm employed by the Swedish Museum of Natural History and have been involved in numbers of issues relating to internationalization, most recently, the IDN stuff. >>TINA DAM: Okay. Great. One last thing before we get started. We really would like this to be as interactive as much as possible. So if you have any questions, you know, even during presentations, just go ahead and come up to the microphone and ask them. I know we're sitting up here, like, a little bit far from you on the stage. But this was the best room for us to be in today. So just don't hold back questions. And there is going to be -- I am going to do, like, open microphone after each topic as well, and also at the end of the workshop. So please don't hold back on that. All right. So I'm going to go ahead with some general information leading up to the protocol. The DNS can actually handle all U.S. ASCII characters. And that is something that's usually misunderstood in the nontechnical world of our community. So you see some of the examples of U.S. characters up here. There's ABC through Z and 0 to 9 and a dash. But there's also other characters that the DNS can handle. If not all, then at least most of the TLD registries have implemented a rule called the host name rule. You also know it as LDH. And LDH stands for letter, digit, and hyphen. And what that means is that domain names can only contain these characters. So ABC through Z, 0 to 9, and a dash. But that was before internationalization. With internationalization, internationalized domain names are names that have characters other than those in the standard ASCII set that I mentioned. And what that means, really, is that we're moving from these 37 characters today -- and there's about 100,000 in Unicode -- in the Unicode database today. And although not all of them can be used for IDNs, it is a big shift in the number of characters that can be used. IDNs are about localized solutions. And if you followed the previous IDN sessions, you've heard me say this many times. It really is important to understand that we're not doing this so that everybody can use and understand all characters used in all languages in the world. We're doing it for localized solutions. But it has to work as -- it has to be used globally, so it has to be internationalized and work for everything. Now IDNs have existed at the second level since 2003 under the protocol standard. And the protocol standard is called IDNA. It also existed back in 2001 under a test bed of that protocol. But the protocol standard came out in June 2003. The e-mail protocol is still underway, and it has been released from the IETF in experimental status, which means that they would like for application developers to implement it so they can get some experience with how it's working before it is moved into being finalized in the standards track. In addition to the protocol, we have the IDN guidelines. And the guidelines were issued in 2003 as well to support the protocol and to make it easier for registry operators and application developers to implement the protocol. Those guidelines have been revised since and are anticipated to be revised again after the protocol revision is done. So this is quite outside the protocol, but I thought I would put it up anyways. This is what we have today. We have the ASCII domain names that should be fairly familiar to you. And then we have IDNs at the second level, as I mentioned, under various existing top-level domains. And it really has been up to the different TLD registry operators to make a decision on whether they wanted to implement the IDNs or not. And some have, and -- whoops -- some have, and others have not. Some are waiting for IDNs at the top level. So you can see where the string is fully localized. Okay. So moving on to some IDN definitions. And these definitions, I should mention, are new in the revised protocol and may or may not be so familiar to you. There's the A-label. And the A-label is what is being transmitted in the protocol. This is the xn-- strings that you sometimes have seen. It is also what is being stored in the DNS, because this is the ASCII string that the DNS can use for lookup and resolution. And then there is the U-label. And the U-label is what the user usually believes he or she is making the registration in. So these are the local characters. And you see the example up there in Hindi. And, finally, we have the LDH-label. The LDH-label refers to a label or a name that is in all ASCII, meaning that it obeys completely the host name rule. So it's only in the characters ABC through Z, 0 to 9, and a dash. And one example of that is ICANN up there. Now, the IDNA protocol is what is providing the transition back and forth between the local characters and the xn-- string. So it's the transition back and forth between the A-labels and the U-labels. And you're going to hear these terminologies throughout this session. So this means that we have two different forms suddenly of domain name. Historically, the name you registered in the ASCII characters was also the name that was used and stored in DNS. But now we have this A-label, the xn-- that is stored in the DNS, and we have the Unicode labels, which is what the user sees. Usually the stored form does not give any meaning, so you see xn-- and then a sequence of ASCII characters that doesn't mean anything to you. But sometimes it does. And that is not intentional. The intention was actually that the user would never see the xn-- version of the string. But because of the way that things have been implemented in some browsers, you will, as a user, sometimes see this xn-- string as well. It is just a prefix to indicate to application software that there is a need for the label to be decoded back into Unicode so that it can be properly displayed to the user. Okay. So moving a little bit closer in to the protocol, and the rationale for the revision. It started a couple of years ago, and the formal start of it was with RFC4690, which is one of the technical standards from -- or documents from the IETF that is requesting a revision and providing some suggestions to some of the issues that existed with the old protocol. And this slide is giving you sort of like a quick overview of some of the reasons and differences between the two versions. So the current version that we have today was fixed on Unicode version 3.2, and revised version is going to be Unicode version-independent. That was sort of like the main reason for the initiation of the revision. We couldn't have a situation where only characters that are in Unicode version 3.2 would be available for domain names, because, as you know, characters continue being added into Unicode, and without a revised version that is Unicode version-independent, not all characters would -- that goes into Unicode would be able to be used. So in the revised version, all characters that are in Unicode or that are unassigned in Unicode, will have a status. It is still not possible to represent all words in all languages in domain names. And that's the same with both versions of the protocol. So that hasn't changed. But the current version is what we call exclusion-based. And that means that it's on a table. So it's basically -- it takes Unicode 3.2 and says, "These characters are valid and these characters are not valid." In the revised version, it's an inclusion-based model. And it's based on properties and procedures. And Patrik is going to get more into details with that. But it's going to give you a result based on the properties that characters have in Unicode that are either going to be protocol-valid, disallowed, or sometimes characters are unassigned. Finally, another reason why the revision was conducted was that application developers had some difficulty in completely understanding the description of the standard. And that has been improved significantly in the draft revision or the revised version as well. It has separated the registration and the resolution steps and provided some more detailed explanations of it. And Cary and Patrik are going to go into details of both registration and resolution a little bit later. Then it just happened to be that as the revision was taking place, we found other issues. And one example of that is that bidirectional problems that Harald is going to come and talk about. So I'm not going to go into much details about that. But that was sort of, like, just pretty much by coincidence, which was a little bit scary, but it was nice to have found the problem, because it was something that could be fixed in the new version as well. Finally here on this slide, there's an overview of documents. If you follow that link. Patrik is holding that site, and he's updating it with the latest versions of the draft revisions. There's a number of different documents describing the overall rationale and explanations, the protocol revision for registration and resolution, the tables and procedures for reaching these tables, and then the bidirectional issues. But all documents are on this site, and also older versions of the documents are on this site. So, with that, I'm going to -- I see there is nobody at the microphone yet. But please come up if you have any questions to what I said. And if not, then I'm going to hand over the microphone to Cary for some more revision details. >> (inaudible). >>TINA DAM: Can you come to the microphone. >>CARY KARP: You have to come to the microphone. >> Yeah. My name is (saying name). I have a question for you, Tina. Your slides, will they be available at the ICANN site? >>TINA DAM: Yes. So my slides should be up there now, and if they're not, they will be -- okay. So I apologize about that. All slides will be available. So not just mine, but everybody who's presenting, all the slides are going to be on the -- where the agenda is on the Paris meeting site. >> Okay. Thanks. Because the pointer to the Web site you have, I didn't get the time to note it. >>TINA DAM: Of course. >> Thanks. >>TINA DAM: Sure. >>CARY KARP: Okay. I'm going to be working with some slides that John Klensin forwarded. This is largely -- well, this is what I would have been saying whether he were here or not, and what he would have been saying were he here, if I correctly assume -- if my assumptions about that are correct. But it's my presentation, his slides. Okay. I suspect before proceeding that it might worth my asking. Be absolutely honest now. Nobody else can see you, but we can up here. How many of you are comfortable with the technical level on which this presentation is proceeding? We can either raise it or lower it, depending on what is suitable. Is there anyone here who does not really feel comfortable with what has been said so far? Doesn't quite understand what we're talking about? Nobody. >>TINA DAM: Actually, I would ask differently. On a raise of hand, who's comfortable? >>CARY KARP: Fine. Okay. Wonderful. >>PATRIK FÄLTSTRÖM: Well done, Tina. >>CARY KARP: Okay. This entire IDN issue is largely regarded as very complex because there's a fundamental aspect to it that is not really understood. And with that one insight, everything else becomes reasonably easy, especially given the level of technical erudition represented in the room today. Woody Allen after 17 years published a little book of fictional essays, and this is a quote from one of them. This is just hot off the presses. "Sygmnd was a poor Austrian who'd lost all the vowels in his name in a boating accident." And in one sense, IDN is the boating accident that we have all suffered, and we can be redeemed from it, saved from it, with varying degrees of ease or difficulty. And just looking at the name Sygmnd, there are no vowels there. And there are people in this room who think using vowels is absolutely unnecessary when you're writing anything, very silly idea. So he probably came to his senses in that boating accident. There are people in this room that would say "Y" is, in fact, a vowel. And one of the things that needs to be recognized is that the domain name system does know what these VeriSign little squiggles are that we use in writing our languages, but it has no idea whatsoever what those languages themselves are. So we're looking at a large number of characters that we can use in strings, four, five, six, seven, eight, nine, ten of them at a time, which may or may not mean something in a language. The DNS was not intended for these things to be words. It was intended for them to be easily rememberable sequences of letters and numbers. And somehow, during the discussion of internationalization, the notion has been generated that any word that can be found in any dictionary in any language in current use has to be useful as a domain name label. And that is simply not a reasonable expectation. The system was not designed to make that possible. If something is useful -- if a word is useful in this context, that's great. But there is no expectation that a domain name label even should be a word. I mean, XYZ-123 is a perfectly reasonable name for some computer somewhere. But it is not a word in any language. And, again, if we make it possible to have the equivalent of that XYZ-123 with other letters and other number systems, that's fine, too. But they're not words in any other language for that. The primary issues are based in Unicode. There is one grand numbered listing of about 100,000 of these little squiggles. Each one of them has a number. Some of them have several numbers. And sorting this out, avoiding the confusion of a character that looks like an "A" in Latin, the English alphabet "A" or the Swedish alphabet "A," appears at several points in the Unicode code chart. The Greek alpha looks the same, appears elsewhere a couple of times. Same with Cyrillic and numbers of others. And we need to be absolutely certain which of the various alternatives that we might have in mind when looking at Unicode code chart is the one that's intended when addressing the IDN issues. A distinction is now being made between the -- or a distinction was not made in the first version of the Unicode protocol between the process of registering a name and the process of looking up a name. You typed something into the address line of a Web browser or not quite ready to do it, but put it in an e-mail address, and your software goes and looks up what it believes to be the name that you are looking for in the domain name system and it returns something to you. And there's been some confusion there. There are numbers of things that are changed subsequent to you typing them into your computer and the thing that you believe you're looking for being returned to you. And this is being addressed in the protocol revision by separating the process of registration from the process of looking something up. There's an awful a lot of explanatory material being generated in the revision documentation, which is absolutely necessary, but, in fact, the substantive changes that are being made are rather few. And, again, this LDH concept, letter, digit, hyphen, which is, actually, the only thing that we use in domain names, whether we realize it or not, even after the IDN process is fully under way, but, again, what we're talking about is ways of using a far larger number of characters than the 26 letters of the basic Latin alphabet, the European digits 0 to 9 and the hyphen. How do we internationalize that character répertoire as it's called? Again, we're not possible about making it possible for any word that you might wish to use in any language being a suitable domain name label. And also the process of encoding these things. Tina showed it to you briefly. And I'm not going to illustrate it here. If you write a sequence of letters, say a sequence of Cyrillic characters, those are actually being converted. They're being encoded as a sequence of just plain old A to Z and 0 to 9 numbers. And you're not supposed to see that. The encoded version -- you see a Unicode string. And that is encoded into what is called a Punycode string. And these things go back and forth. And that process is currently asymmetrical. You will start with a Unicode string, go to a Punycode string, and then go back to a different Unicode one. And that's not really what people want. And that's being addressed. But there's simply no way that it's possible to take this very, very basic Latin alphabet-based system and internationalize it, make it equally elegant when it's a thousandfold larger, without making significant compromises. There's some confusion that attaches to this. I've already mentioned that. But it's very, very important to note that the standard itself is not about this FQDN, fully qualified domain names. The standard does not address a domain name. It addresses a label in that name, the thing that's separated by two dots. So ABC.XYZ.EU. That's three labels. And the protocol will look at each of these individually. It does not look at the entire name. And this is extraordinarily important when the discussion gets under way, which is another one of the standard misconceptions about this, is that there are going to be domains that contain nothing but names in a given language, that there will be a Russian domain, there will be an Arabic domain, there will be a Chinese domain. Although it is true that any given label may have recognizable language characteristics, the protocol itself can do absolutely nothing to ensure that an entire name displays nothing other than the characteristics of that language. This is all stuff that needs to be dealt with by policy in registries, and I'll be revisiting that issue a little bit later. Problems become very, very keen -- and Harald, when he turns up, will be talking about these specifically -- Latin characters -- the Latin alphabet is written from left to right. There are alphabets that are written right to left, most notably, Arabic and Hebrew. And there are problems that arise when the directionality changes from left to right. And there are problems that arise when the directionality changes from left to right, and there are even more problems that arise when a name contains labels that have differing directional properties. So the bidirectional issue, the bidi issue, causes some significantly nightmarish things that, again, are not going to disappear with the protocol revision. Same thing happens with label separators. We use a dot to separate labels. That dot may not be appropriate in all scripts. I'm going to use the term scripts from here on out and not languages. A script is a set of squiggles that is used for writing a language. But again, the DNS cannot begin to identify language. You know what the language is, but Unicode says that this squiggle belongs to the following script and that squiggle belongs to the following other script, and that's all we've got. The number of languages that share a script can often be large. Latin script is used for a large number of languages, Arabic script is used for a large number of languages, and many other scripts. Its intricate. So making a distinction between script and language is another thing that's not done enough in this discussion. We talked about U-labels, A-labels, LDH labels. I am not going to get further into this than noting that the sequence of Unicode characters, what you expect to see, what you believe is your name, that's the Unicode representation, the U-label. The A-label is the ASCII encoded version of that, the thing that starts with xn-- and then a whole bunch of nothing other than A to Z das and numbers, European Arabic numbers. And there's the LDH label, which is the classic thing. That's what we have been using all along. And there are many, many other issues that are involved in this, but they don't need to be enumerated separately. So what we are trying to do is make IDNs significantly more useful than they currently are. Make them predictable, understandable, no surprises when you type in one thing and see something else. But these are mnemonics. These are as aids to memory. This is not turning, and jumping to the last bullet point, the DNS into a medium of literary expression. It's, in fact, entirely possible to write small poems with dots in between them, a 14-syllable thing, a haiku or however you want to do it, the six-word memoirs that are so popular these days where you have a word and a dot and a word and a dot and a word and a dot. And it's very poetic, but that's a coincidental application of all of this. It is not an intended primary purpose. We want to improve the understandability of these things, the usability of these things. And in present context, the marketability of these things. And we want to do this in way that ensures that there's not going to be another major change. And the scenarios that might result in that will also be discussed in greater detail later. Okay. That's it for this part of it. And we'll get into the registration stuff later. >>TINA DAM: Okay. So let's see if there's any questions for the introductions and the reasons behind why the revision was initiated and where we are going towards. Just go to the microphone, please, if you have.... >>ROBERT MacHALE: Robert MacHale, California, XML user group. I don't know if this is the right time for this question. Can a registry that in the future theoretically implements a dot Hebrew top-level domain enforce a policy that says all the second level names must also comply within the same script as that top level? Is that discussed? Is that possible or what's the path for that? >>CARY KARP: You are asking the wrong organization. You would need to ask the registry who is operating this domain what they have got in mind. The Domain Name System has no intrinsic mechanism that makes this possible. You can't propagate anything across a dot boundary. The IDN protocol looks at a label and then it looks at the next label and it looks at the next label, and although it might be possible for a software implementation to look at these three things and at least expect them to be the same script, because you can't get any closer than that, sure, if a software application wants to do this. But otherwise, ICANN will certainly -- ICANN delegates the top-level domain authority. And you said that you are going to be populating the second level in Hebrew script and nothing else. And we expect you to do that. But then what happens on the third level is beyond that horizon, that contractual horizon. And it's up, then, to the top-level domain operator to require of every second-level registrant that they don't put anything on the third level other than Hebrew script. And somehow, the third level registrants have to be bound to propagate under the fourth and the fifth and the sixth level. So it's essentially, it's something that cannot be known and controlled from the root of all of this. >>TINA DAM: Let me try to explain differently, because I think there's probably something behind your question that's probably important for everybody to understand. The protocol takes all of the characters in Unicode and it says whether or not a character can be used in an IDN. Now, that is across a lot of different scripts being used in a lot of different languages. But it's really up to the TLD registries to decide do they want to support all of those characters that are valid across all the different scripts and the languages, or did they only want to support a subset of it. So the protocol gives you like the very basic of yeah, this character can be used in an IDN but that does not necessarily mean that it's made available for registration under a certain TLD. >>ROBERT MacHALE: So it sounds like my question is not an IDN question. It's a gTLD question. What does the registry choose to do. >>CARY KARP: It is a registry question on all levels. There's this notion of a domain that represents a language. >>ROBERT MacHALE: Right. >>CARY KARP: And every label that might appear in any -- on any level in that top-level domain is that language and that language only. But there are registry operators who are very distant from the operator of the TLD and maybe different from the operator of the second level domain. >>ROBERT MacHALE: Due to the fact that I probably missed the right session to ask the question, can I just bounce one similar question? >>TINA DAM: Sure. >>ROBERT MacHALE: The difference between a gTLD and a sponsored gTLD I am somewhat vague on. A TLD where the public can purchase a dot address versus a registry where it's confined to a limited set of users, a dot pro. >>TINA DAM: So the gTLDs have previously been split up into sponsored and unsponsored TLDs. The sponsored ones usually have registration restrictions associated. So they are dedicated to a specific market. The unsponsored are open for anybody. Yeah, it has nothing to do with the protocol at all. So maybe we can -- >>ROBERT MacHALE: A theoretical -- a theoretical registry that was a dot Hebrew that wanted to implement a requirement on the second level must, then, be a sponsored registry that has that control. Otherwise, the public could do whatever they wanted to. >>TINA DAM: No. In the future we don't look to the difference between sponsored and unsponsored. >>ROBERT MacHALE: Okay. >>TINA DAM: And it could be a ccTLD as well. >>ROBERT MacHALE: Okay. >>TINA DAM: So no. And it has nothing to do with the protocol. I am happy to talk to you afterwards. >>ROBERT MacHALE: I want to say thanks to you all. Tina, you have done a great job. I am enjoying this. Thank you. >>ERIC BRUNNER-WILLIAMS: Hi, I am Eric Brunner-Williams, and in 1999 in Working Group C I wrote the definition of what a sponsored registry is. So if you have a question, come and see me. >>OLOF NORDLING: Olof Nordling, ICANN staff. Could you tell us a little more about the horrors you discovered in bidirectional situations? For all of us who like scary stories. >>TINA DAM: Sure. The bidirectional situations are going to be presented by Harald a little bit later in the session. >>BERTRAND DE LA CHAPELLE: Good morning, this is Bertrand De La Chapelle. It's segue to the first one. When we are moving into a multiscript space, we sometimes do not understand how many assumptions have been implicit in the situation we have with the Roman scripts. The fact that the script was limited to Roman ASCII characters had an obvious consequence, is that all strings were coherent. When I mean -- sorry, not strings. All addresses were coherent from the beginning to the end. They were all in one single script. It was not a policy decision. It was just a given because there was one script. But I have never heard the discussion, so far -- "never" is maybe excessive, but I don't know what is the status of the discussion on whether we want to establish a policy of having coherence all through the addresses with all the problem of the successive responsibilities -- >>TINA DAM: Right. >>BERTRAND DE LA CHAPELLE: Or not. >>TINA DAM: Right. I'm sorry, but this workshop is on the protocol revision, and the topic you are raising is relevant but it has nothing to do with the panel that's up here today. >>BERTRAND DE LA CHAPELLE: Okay. But in the nutshell, has this been discussed and where is the right place? >>TINA DAM: It has not been discussed as a global policy restriction, which means that it might be different for different registries. And some of it might go into implementation. Patrik, did you want to -- >>PATRIK FÄLTSTRÖM: Yeah, the only place where I remember it has been discussed is, of course, in the various registries and top-level domains that of course the top-level domain today is in ASCII, but they have introduced IDN registration on the second-level domain. That's one example where you actually have a mixed script or mixed language environment. So -- >>BERTRAND DE LA CHAPELLE: Okay. >>PATRIK FÄLTSTRÖM: But it's also the case that when it has been discussed, it has all the time been discovered exactly what Cary said; that from a technology standpoint, it is impossible or extremely hard to police that case, if it is the case that that rule is actually agreed upon. Because at each level the registry can decide on the policy and do whatever they want. Like in Cisco.com, we at Cisco decide what's delegated there. And many domain holders feel that they should have the power of screwing up their own domain name. >>SULIMAN MOHAMED: Thank you. My name is Suliman Mohamed. I would like to point out you mentioned that regarding the reversing of Arabic language from right to left to cope with other language. In my opinion, that is a bit difficult; okay? Because Arabic language based on right to left. If you change it from left to right, it will change a lot of its meanings, you know. Furthermore, Arabic language has supported other Asian language such as Urdu and Bishtu (phonetic) and others; okay? So in my opinion, there is a group from Arab league that is working hard to come up with some ideas which you can help ICANN and other peoples to have IDNs in Arabic language. So I hope if there is any kind of integration between those efforts and your efforts can be -- can be meaningful. Also, the point is that Arabic language is fully linked by Holy Koran, which is already built by Arabic, and it is too difficult to come out with revising it. So I think that if your efforts come out to half a solution for Arabic language -- because as I said, almost 90% of other characters of other language such as Urdu, Bishtu, Zezar (phonetic), same characters of Arabic. So there is a good chance to have Arab language can support and solve. Too many other languages can be -- come out from using Arabic language in IDNs. >>CARY KARP: There are two brief comments. There will be two presentations that address exactly this. And as Patrik pointed out, at the moment every single IDN that is not Latin based is going to be mixed scripts. And every single Arabic second-level label is going to have this bidirectional conflict. And it is the intention of this exercise to make it possible for right-to-left scripts exist, the full name. And as you also pointed out -- and there will be a separate presentation about this. There is a working group addressing the variant uses of the Arabic script in the languages that are written with them and you will hear about that in detail. So the issues that need to be addressed to ensure that the entire -- the whole range of language communities that use the Arabic script will have their needs reflected in this -- and again, the impending availability of top-level domains using non-Latin scripts, will make it possible for an Arabic domain name, all labels, to share the same unidirectional, right-to-left property. That doesn't mean that problems end, but we will be going into that in greater detail in a little while. So you are welcome back to the microphone when the presentation -- when Ram Mohan has spoken about the Arabic script initiative and Harald has spoken specifically about the bidirectional problems. >>SULIMAN MOHAMED: Thank you. >>YAO JIANKANG: My name is Yao Jiankang from CNNIC. My question is because another IDNA 2003, there are already a lot of IDN registered under top-level domain names. Because now we all soon move to IDNA 2008 or 2009, some IDN may be not suitable for already registered domain names, may not be suitable for IDNA 2008. So my question is what is ICANN's policy for dealing such, already, the IDN -- IDN -- not suitable for IDN in 2008. >>PATRIK FÄLTSTRÖM: Let me start, then. So far, we are, first of all, as I will describe more in my document, we are aiming towards making sure that IDNA 2008 is backward compatible with IDNA 2003. There are, of course, some changes made; otherwise, we would not have need add revision. But when we have investigated the today registered domain names, we have not found anyone that so far is not possible to register both in 2003 and 2008. Okay? So the first thing I want to say is we have not found any and the risk that we will end up in a problem. And Tina, do you want to say something? >>TINA DAM: Sure, I can add something to it. The view from ICANN is that if -- so what Patrik is saying is it's backward compatible. But if there should -- since the revision isn't final yet, if there should come up any issues later on, the view is that it's better to make that revision now than it is to make it, you know, in a year or two years from now when we have IDNs at the top level and the problem is going to be much larger. So there's some history there that could be unfortunate, but there's not going to be, as far as I know, right now, an ICANN policy against it. >>YAO JIANKANG: Okay. Thank you. >>OLOF NORDLING: Olof Nordling, ICANN staff. I just wanted to briefly respond in addition to Bertrand De La Chapelle's question regarding restrictions to a single script across labels within a domain name. And the -- that idea was brought up, actually, in the GNSO IDN working group, but didn't get much traction for the simple reason that Patrik brought up. Well, it's hard to justify that restriction, and it's impossible to police. >> This is (saying name) from Taiwan. Patrik, I would like to know, when we do the IDN e-mail development, what about the Microsoft and Google? They are involved in part of the working group in the e-mail. >>PATRIK FÄLTSTRÖM: This is actually a question that is better for Harald to respond. He is, unfortunately, not here yet because he is the one running that working group in the IETF. So please ask him. What I do know is that all major e-mail vendors, including Google and Microsoft and the FireFox people, the (saying name) people, they participated very well both in the IDNA work in the IETF and also in the e-mail -- in the e-mail effort. >> But we don't know what is the development plan for the Microsoft and Google; right? >>PATRIK FÄLTSTRÖM: At least I don't -- I am not aware, actually. So I can't say that they have said something, but I can also not say that they have not said anything. >> Thank you. >>TINA DAM: Okay. I think that ends the questions for the first part of the presentation. So Patrik, do you want to continue with the first revision topic on character list generation rules? >>PATRIK FÄLTSTRÖM: Yes. Yes, thank you very much. So what I will go through here is one of the core documents regarding the technical side of the IDNA 2008 standard. This document, which I am the editor of, is describing an algorithm that is used to calculate what codepoints of the Unicode standard can be used for domain names -- for Internationalized Domain Names. As Tina said, this doesn't imply that a registry, according to the policy they are using, allow registration of all of these codepoints, but codepoints that are not allowed according to this table, this algorithm, they are just out. First of all, if it is the case that you would like to follow the development of these documents, I just wanted to show a screen dump of what you can see on this Web site that Tina pointed at. This Web site of mine is automatically -- you can automatically fetch each one of the documents that are part of the IDNA effort, but you can also very easily look at what has changed. So you can look at what has changed in the Web page, and you can see with colors what has actually changed between the different versions of the proposed standard. And the closer we are coming, this is something that is pretty important to have a look at. And I will show you a couple of things that have actually changed in the tables document. So this feature is actually pretty interesting to use, specifically regarding the actual codepoints because you can? See what has changed in them. So I would like to point out, first of all, a couple of important things from the beginning of the document. From the abstract, it says that this document specifies rules for deciding whether a codepoint, considered in isolation, is a candidate for inclusion in an Internationalized Domain Name. As you will see later in Harald's presentation, there are some bidirectional rules that are to deal with in what context the codepoint exists. So this document only talks about a codepoint in isolation. So the big difference regarding the codepoints that are allowed and not allowed between IDNA 2003 and IDNA 2008 is that IDNA 2003 defined a series of codepoints that were allowed to use in an Internationalized Domain Name. So the table was normative in IDNA 2003. In IDNA 2008, we define an algorithm, and the table that exists in this document is non-normative. I have developed this table with software that I have written myself. I have compared this table with the result of using software that other people have written, but there might still be errors. The reason why we have moved to define an algorithm is that we would like to have the standard independent of Unicode version. The current non-normative table in the document is calculated using Unicode 5.1, but when Unicode is coming out with new versions, the intention is that the same algorithm will be used with this new version of Unicode and produce a different table. And agreements and work, one of the hardest points we have been working with in the working group and the IETF, together with Unicode consortium, is to ensure that the algorithm is defined in such a way so it matches the backward compatibility requirements in Unicode when they are coming up with new versions. So I have to emphasize really, really, really hard that the list of codepoints that can be found in the appendix is norm-normative. Please have a look at that, but it's the algorithm that is normative. When doing the calculations, if we are now moving to the algorithm, the end result when applying the algorithm to a codepoint is that the codepoint end up being in one of four categories: Protocol valid, contextual rule required, disallowed, or unassigned. The exact definition of each one of those and what they can used for is specified in some of the other documents. And a side note here is that the definition of unassigned has changed between version 00 and 01 of the tables document that I am now going through. It might be the case that you think it's weird that the definition of "unassigned" has changed because unassigned is unassigned, but this is one example to show you that this is not an easy task; okay? To be able to come up with a good algorithm, first of all we need to define various classes of codepoints. And the first class, all the codepoints which are letters and digits. One can say that these are the good codepoints that we would like to be able to allow. And the actual formula that is used is using meta data that exists in the Unicode database. And the attribute that we are using for this class is called the general category. And the codepoint belongs to this general category if the codepoint itself has a general category of Ll, Lu, Lo, Nd, Lm, Mn, or Mc. This is basically upper letter, lower letter, other letter, et cetera, et cetera. So here you see a list -- on slide 8, you see a list of the various categories in the Unicode standard. So these are the ones that we are accepting, in general. And you can see on this list that there are certain categories which are removed from 2003. In IDNA 2003, we also allowed some graphics characters and musical notation and other kind of things. But what we have seen is that these things can actually create quite a large harm if it is the case that those characters are used in domain names. So we are restricting the set of characters from IDNA 2003. The next category has to do with codepoints which are not stable when normalizing or case folding. So what we are trying to do here is that, just because in original DNS, the matching algorithm that is in use says that if it is the case that the character that is stored in the DNS is a U.S. ASCII uppercase or lowercase character, there should be a case insensitive match. So you can actually look up characters either in upper or lowercase and you will get a match in DNS. This is something that is extremely hard to implement in the DNS for other character sets or other scripts than U.S. ASCII. So already in IDNA 2003 we made the decision that we can only allow the registration of lowercase letters. So what we are verifying in this formula here is that if it is the case that we are doing normalization according to NFKC, and we are case folding, which means moving towards lowercase, and then repeating normalization and case folding repeated times, the codepoint will stay the same. So what matches this formula or, for example, the uppercase letters or things that are not stable under normalization. The next category catches codepoints which have properties that we would like to ignore. We want to ignore the default ignorable codepoints. We want to ignore the white space. And we want to ignore the noncharacters. This is actually something that is noncontroversial. It's also the case that we'd like to ignore some blocks of characters. Currently, the working group has a consensus to ignore the combining diacritical marks for symbols, ignoring the musical symbols, ignoring the ancient Greek musical notation, and also ignoring the private use area. So even though the whole idea here is that although a codepoint is a lowercase letter, if it is the case that it is part of the private use area, in that case, it should not be allowed for use in IDNA. Of course, there are no codepoints like that, but it was just an example. The next set of codepoints are the ones that we are already using in U.S. ASCII. And we would like to be able to ensure that those are still possible to be used. The next set of codepoints are codepoints that we have found in the work in the working group that if it is the case that we apply the rules that we so far have listed and that I have been explaining to you, there are some codepoints that would not be allowed, but we want to allow them, or they're codepoints that would be allowed, but we don't want to allow them. So here we have some exceptions. And this table is actually, at the moment, still in the working group, a moving target. So this exceptions table is changing now and then. At the moment, the exceptions are these, the following. There are some codepoints which are valid, they would not have been valid otherwise. There are some codepoints that would have been valid or not possible to use, but we only want to use them in specific context. And all the ones that are marked as being context "O," each one of them will, in other documents, have a regular expression that explains in what context the codepoint can be used. And there are some additional context characters which are going to be added for the next version of this document. For example, there are some issues with the Sinhala script that we are at the moment working on resolving. The next category is backward compatibility. The reason why we have this category is that if it is the case that Unicode Consortium comes out with a new version of the standard, and they have to make a change that is not backward compatibility with an earlier version of the Unicode standard, the IETF and the Unicode Consortium together can make a decision whether either the incompatibility does not lead to any harm, and in that case, we leave it as it is. Or we added the codepoint to this now empty list of codepoints, just like the exception table, and say that, no, this codepoint should actually have this specific value. But, of course, to add codepoints to this category, in that case, revision of the RFC is needed. The next category consists of the join control. And the join control characters are special. And we'll hear more about those later, I guess, a little bit about the join controls, maybe. It has to do with the nonspacing marks and other kind of things. The next category are the unassigned characters. And as I said, this definition has changed from version 00, because what we found was that just saying that the codepoint was unassigned, which means that it did not exist in the Unicode table, that was not as exact as we needed. So this is actually a proper definition according to the Unicode standard. Given all of those specifications of various different kind of class, either one of the codepoints exists in one or more of these categories. And what we then are doing is that we're using an algorithm, we are test -- doing various tests, we are trying to test one at a time. And if we have a match, then we stop. The various things we are looking at is, in order, first we are looking at the exceptions. Then we are looking at the backward compatibility. Then we are looking at what codepoints are unassigned. Then we are looking at whether the codepoint is an ordinary letter, digit, hyphen. In that case, it's valid. Then we're looking at the join controls and the contextual rules for those. Then we're looking for the unstable codepoints, the ones that are not stable under case folding and normalization. And then they are disallowed. Then we're looking at everything that is part of ignorable properties, and disallow those, et cetera, et cetera. And at the end, we will see that what we end up in the end are the ones which are valid and the rest, which are not allowed. So this calculation leads to -- if you apply this calculation, that leads to the nonnormative table that you see in the appendix of this document. So that's it. Any questions? >>TINA DAM: Thanks, Patrik. Maybe you want to hand the cable over as we see if there are any questions. You can also think about questions. This is one of the major revision topics and one of the main things that are different from or between the two protocol versions. So I don't see anybody at the microphone, so I guess I'll just hand the word to Harald. And, Harald, thank you for joining us. I know you have a busy day as a board member. Maybe I can ask you to introduce yourself as well as you get started on the topic of right to left and bidirectional issues. >>HARALD ALVESTRAND: Hello, I'm Harald Alvestrand. In this aspect, I'm speaking as a technical contributor to the IDNA specification. And one disclaimer: I am, unfortunately, language-challenged in that I cannot read or write any language that goes right to left. So when I have been looking at this, I have been forced to not rely on what makes sense, but to rely on figuring out what the rules are and what the consequences of the rules are, and then checking with the language community. That's you. So the impetus for actually realizing that the bidi algorithm in IDNA needed change was from two languages that are written right to left: Hebrew and Dhivehi. My friend Cary is showing -- I think that's a piece of a Dhivehian newspaper. So what you will notice on this page is that every single character, almost, has something on top of it. It's called a combining mark. And in IDNA 2003, we had this very simple rule for when a label is allowed when it's right to left, which are that the first and the last character has to be right to left. Well, we thought it was simple. It turned out not to be. And so, okay, what's the last character of that string? Turns out to be a combining mark. Now, which direction does a combining mark go? It doesn't. So computers, doing what computers do, say, "This is not a right to left character. We'll just reject the label." So having a whole language that didn't work, I mean, having a whole language for which no words could be used with IDN labels seemed a bit unfair. So I started to dig into what the algorithm actually was and why and tried to write down a reason why it should be the way it should be. Well, I suppose it made -- I made the assumption that it makes sense that if you have mumble, mumble, mumble, dot, mumble, mumble, mumble dot something, and display that, you would like to have the stuff between the dots actually stay between the dots and not jump all over the -- all over your line. So I made up a test that tested some strings against this criteria. And, oh, I saw characters jumping. The Unicode bidirectional algorithm is several pages long and not what I would call the most easily comprehensible algorithm in the world. And so I started working out the rules in more detail, saying, okay, if I want the things to stick together, what are the requirements? And can I get by with having those accents allowed so that Dhivehi or Yiddish could be -- words in those languages could be used as IDN labels. And it turned out that this was achievable at some cost. In particular, the Unicode bidi algorithm does some very peculiar things with numbers. I cannot read the mind of the author of the algorithm, but he has actually tried to explain it to me. If you have character, character, character, number, dot, number, character, character, character, the Unicode bidi algorithm thinks -- it's written with the assumption that the number dot number is actually a single number, and so it will keep that piece together when it rearranges things to fit the bidi structure. The result is that if you have a label that ends with "3," followed by a label that begins with "4," what you will see on the display in some circumstances is, second label 3.4, first label, which -- how can I say? -- it's going to be unexpected. So after a few rounds of fiddling around with this, the details are in the draft. I very much recommend reading the draft. Always useful. We came up with a set of requirements that if you follow these requirements, you are likely to not get into trouble. This, unlike the character rules, actually talks about which characters are next to each other. So you have to look at the entire label. And what's even more worrying is the fact that with this number restriction, you have to look at the next label. So you can't just look at the label alone and say, "It's allowed." Or, "It's disallowed." Or if you want to just look at one label alone, you have to say, anything with a number at one end is disallowed. So I made a decision on the principle that it's better to make a decision than to worry forever, said that, okay, if you want to do something in connection with bidi, don't start a label with a number. I'm sorry. If your 3COM, you just have to take certain precautions so that your domain name doesn't end next to a right to left label. So that's the basic thing that has been done. For technical details, read the draft. And if you have a case where this algorithm for deciding what bidi characters are reasonable to have in the domain name really, really gives you heartburn, and say you cannot live with that, please tell us, and please tell us why. I'll be around for a few more minutes for questions and comments, and then I'm going to join back to my other meeting with one of my other hats. So if you have questions or comments, please come now. >> Hello. I just want to repeat the question previous, because you were not here. Is Google or Microsoft working on this e-mail, IDN e-mail together? Or are they participating in the workshops, too? >>HARALD ALVESTRAND: I work for Google. And Google does not comment on unreleased products at all. But as you can see, I'm present. >> Okay. So in that case, if the IDN of the e-mail protocol I see is coming out from the IETF, it sounds like we need to wait for a while to get really mail servers really in operations. >>HARALD ALVESTRAND: As I said, I cannot comment on unreleased products. In the IETF, knowledge that I have without my Google hat on, I know that there are people working on interoperation tests. They have actually implemented the protocol both in user interfaces and in servers. And they are testing that it actually interoperates. I have no doubt that when the market pressures are brought to bear, things will happen. But you are the market pressure. >> But that means in the certain period of time, actually, the people, for example, I was in Taiwan, I sent in an e-mail in IDN, and to somebody who does not have a Chinese font his address name would be very possible come out with junk code, right, or kind of funny code, because he doesn't have a Chinese font. >>HARALD ALVESTRAND: Yes. The trouble of sending e-mail to recipients who support the extensions -- now I'm speaking with both of my other hats, which is chair of the EAI Working Group in the IETF -- the problem of sending mail to someone who supports the extension but does not have the fonts for all of Unicode is an unsolved and probably unsolvable one. If you're sending to someone who is not supporting the extensions, the address he sees will be in ASCII. Luckily, the Unicode Consortium has in fact available complete Unicode font, the so-called font of last resort. So I'm hoping that this situation will -- well, it will not go away. But I'm hoping that it will become less common as time goes by. >> So in that case, you actually don't know who sent you the e-mail? Because it's -- the code is not readable for period of times. >>HARALD ALVESTRAND: Well, that occurs even when I don't have that. I actually have an issue with Google Calendar. I invited about ten people to a meeting, using their e-mail address. And Google Calendar presented me a list of people who have accepted the invitation. One of those people was shown -- was actually using Google Calendar, so it was his display name that was shown. And this was in Japanese characters. So I knew that he had accepted the invitation, but I had no idea who had accepted the invitation. >> Okay. >>HARALD ALVESTRAND: I was confused. >> Okay. >>HARALD ALVESTRAND: So this is a general -- special case of general problem, and I don't think there's a general solution for it. >> Thank you. >>TINA DAM: Okay. Thank you, Harald. So let's see if there's any other questions around these two topics that Patrik and Harald presented. The character list, generation list in the properties, or the bidirectional issues. >>BOB HUTCHINSON: Bob Hutchinson. I had a question about -- >>HARALD ALVESTRAND: Closer to the Mike, please. >>BOB HUTCHINSON: I had a question the codepoint isolation. As far as I can see in your algorithm, there's no codepoint filter for glyph folding based on shape. And is that going to be something that will be done at the gTLD levels? Or how will that occur in this mechanism? >>PATRIK FÄLTSTRÖM: Yeah, very good question. Thank you very much for asking that. That's something which I obviously should add to the presentation. The -- the answer is that, no, you're absolutely right, there is no rule for shape or similarities between glyphs. This is partly because, of course, from an IETF perspective, because there is no such table that says what glyphs actually look similar. But the reason why such tables doesn't exist is because this is largely a font problem and a rendering problem, and the same character might look very different depending what font you are using and also what you are rendering on, what the resolution is, and many, many other things. So you're absolutely right that in this context, dissimilarities -- and that is one of the requirements that -- or one of the issues that a registry can look at when they are coming up with a policy for what subset of this large list of codepoints the registry allow. It's also the case that the suggestions in the ICANN guidelines to be extremely careful if it is the case, with the situation where the registry might and he registrations of codepoints from multiple scripts inside the same label, applause if you are limiting one label to one script, then, of course, the number of confuseables decrease quite a lot. We should, though, remember that we have confuseables already in ASCII, for example, between the digit 1 and letter "L," et cetera, et cetera. But this is, like you are saying, left to policy. Where this is left in policy is something that is called the language tables. And there is a recommendation for -- and coordination between the IETF and ICANN on creation of language tables that are registered with IANA. And a registry is suggested to come up with a language table for specific languages, or use of specific languages in specific top-level domains. And that language table can include things like, for example, if it is the case that a domain name is registered with this codepoint, in that case, this other label with this other codepoint is blocked for registration. Or if you register something with this codepoint, you automatically will get this other domain name registered as well. And this language table is, for example, something which the IDN ccTLD fast-track document is referring to, that the language table should be part of the application that is sent in. So the applicant that would like to have an IDN ccTLD in the fast track should already have created this language table. And when creating the language table, the applicant is implicitly requested to have a look at confuseables. >>SIAVASH SHAHSHAHANI: Shahshahani from dot IR. This is to Patrik. Could you explain some of the controversy regarding the disallowed things. For example, I understand from my Greek colleagues that they have trouble with the case foldings in the letter sigma. Could you explain some of that. Let me also ask my other question, then I'll sit down and wait for your answer. My other question is about diacritical marks. As you know, some languages, like languages that use the Arabic script, these diacritical marks are important. There is a lot of development taking place regarding the possible use of these. So I want to make a recommendation that you do not extend the list of disallowed things, you know, before giving a chance to people to maybe come up with -- you know, so you can disallow something if it's really critically important to disallow it for stability of Internet. But otherwise, you know, I would suggest that you, you know, proceed very cautiously about this. Thank you. >>PATRIK FÄLTSTRÖM: Yeah, we would like to -- I gave an overview -- I'll give an overview response first and then I'll ask Cary to talk explicitly about the final segment, which is what we're talking about regarding Greek. So, first of all, we have decided in the IETF that we are not experts on codepoints. So we are not the ones that can look at each codepoints and decide whether it can be used for this or that, whether a codepoint is a letter or a digit or what it is. We are using the data that the Unicode Consortium came up with. Many of the issues that have been brought up are based on the fact that a character is classified -- sorry, that a codepoint is classified as a certain kind of codepoint in the Unicode tables. For example, not a letter. And then we, in the IETF, say, we only want to use letters. Then, by definition, by using an algorithm like this, that codepoint will not be allowed, just because the Unicode Consortium has said that it's not a letter. The question for us in the IETF in the working group is, okay, is it the case that this codepoint is so important that even though it's not a letter, according to Unicode, and we want to use it in -- as a domain name, that we should add it to the exception list? And so far, we have, as you saw, the order of -- the order of magnitude is ten exceptions. And we have said in the working group that as long as we stay with ten, 15, that order of exceptions, that's fine. But we don't want to have 10,000. Because if we get too many exceptions, then we are overriding all of the work that has been done in the Unicode Consortium. So that's, like, the overall answer. So let's see if I remember the -- maybe you want to say something about the final sigma thing. >>CARY KARP: Approaching this from your own recognition of the fact that there's a caution boundary someplace in this and that someone is going to be injured, someone is going legitimately to wish to be able to do something that just has too dire consequences to make it possible. And, in fact, there are two characters that we're talking about that are near that boundary. But since you asked specifically about the final form sigma, because it is almost a unique case -- the country name Cyprus cannot be written correctly in its own language the way things are. It's a Unicode anom- -- it's not an anomaly. It's deliberate. But it's not possible to preserve the sigma in the way it appears. There are two forms of lowercase sigma. The one is used in initial or medial position in a word, and the other is used in final position. So the word "qupiros" (phonetic) requires this. This will not get through the IDNA process intact. And the question is not whether it needs to be available. It needs to be available. The question is whether the way to make it available is by exception in the IDNA protocol or expecting software implementations, for example, to recognize the fact that a user has just entered a string with a final form sigma, do all of the resolution, do all of the looking up, and then display the answer with the final form sigma. So does software recognize user expectation and ensure that it is provided, because it has to do that on some level anyway, or do we address all such detail on the protocol level? And this is a question that isn't resolved yet. Again, the need for the final form sigma is absolute. It has to be provided. But, how and where. And there are still question marks there. Is that enough? Yeah, okay. >>PATRIK FÄLTSTRÖM: Okay. And then the last issue regarding being cautious. You said that we will have to be cautious on not adding things to the forbidden list. It's actually the case that the conclusion very early in the working group is that being cautious means the other way around. It means -- it implies having too many things on the forbidden list. So you can only see this from two ways. The reason why we don't want to move things away from the -- sorry. The reason why we don't want to move things to the forbidden list is that it might be the case that if something is allowed, then someone is registering out the domain name. And then if we move it to the forbidden list, vendors implement that, then suddenly that domain name will not be possible to be looked up. On the other hand, if we have things in the forbidden list early and then add them to the allowed list, then you have software all over the place that will not be able to handle that domain name if it's not the case that the software is upgraded. So moving things to or from the allowed or the forbidden list is extremely dangerous. Originally in the working group in the documents, before -- actually, before the working group existed, there was a suggestion to have a "maybe" category. Maybe yes and maybe no. And the codepoints that we were uncertain of should be in this "maybe" category. But after long, long, long, long discussions, the agreement was that we should in the working group, and consensus, that the "maybe" category should not exist. Unfortunately, that will lead us to the risk to have something allowed that we move from there, or the other way around. And that's why we are looking very carefully at the various codepoints, each one of them. >>BOB HUTCHINSON: For those of us who were not part of this process and community, can someone give us a picture of what they believe the update cycles for IDNA protocol will look like? Is that an annual thing? I didn't hear that. Or is it going to be a regular thing? Or is it going to be under the IETF? Or who's going to regulate that? >>TINA DAM: Right. So protocols are developed and revised within the IETF. So even though I'm ICANN staff, it really has nothing to do with ICANN. It's just important and -- well, it's important always. But in the case of IDNs, it becomes more important, because it has to be implemented by registries for registration purposes, and application developers. And it has, you know, an -- it's an important topic right now, and it has a different effect. The protocol is not intended to be continuously revised. This is supposed to be the revision. And that's why what Patrik just talked about is really important and what Siavesh was asking about is really important. Because once you have a character that's disallowed, it's disallowed. And it's not going to become suddenly allowed in a next revision or anything like that. It's disallowed, and it's disallowed for always. And it's understood that that can be unfortunate for some languages. However, as I said in the beginning of my presentation also, not all words in all languages are going to be represented. So there are some characters that are just not going to be valid. And the reason why Patrik wants people to look at the table, even though they're nonnormative, is that it may be easier to look at a table and see if a character is valid or not valid and have an opinion on that, rather than looking at the algorithm. But things have to be changed in the algorithm if the result is going to be changed in the table. >>BOB HUTCHINSON: Okay. That addresses -- >>CARY KARP: The primary reason for this revision is to remove the version-dependence on Unicode. As new characters are needed, as communities step forward and wish to participate, to manifest themselves in the digital realm and realize that the fact that their script is not available in the Unicode code chart, they will approach the Unicode technical consortium, have their script appear in another version. And that's the version cycle that you may be referring to. >>BOB HUTCHINSON: Yes. >>CARY KARP: And this will automatically be available in IDNA. That's the whole idea behind the revision. >>ROBERT HUTCHINSON: Because of the filtering he is talking about -- >>CARY KARP: No. Once something is disallowed, that disallowal is on the basis of Unicode character properties. >>PATRIK FÄLTSTRÖM: Yeah. So disallowed is based on the Unicode properties plus the algorithm that is defined. >>ROBERT HUTCHINSON: Right. >>PATRIK FÄLTSTRÖM: Not the Unicode version. >>ROBERT HUTCHINSON: Okay. >>PATRIK FÄLTSTRÖM: So when I talked about moving things in and out of disallowed, that was if it is the case that IDNA 2008 is replaced by IDNA 2010 or whatever, and we add the codepoint to the exception list, that is the only way that things can change. Or if the actual properties in the Unicode standard is changing. >>ROBERT HUTCHINSON: And for those of us who are not in the Unicode community anymore, how frequently is that occurring today? >>CARY KARP: All the time. It is a continual process. A revision -- If you go to the Unicode consortium's own site, you will see their roadmap. And that will let you know it is in the queue, it will let you know how long it is before these codepoints become available. But a rough answer to your question is, certainly once a year a major new crop of characters will become available in all contexts that use Unicode, including the impending Unicode version independent IDNA. >>PATRIK FÄLTSTRÖM: And this is why we have found it so important to be Unicode version independent, because the current version of IDNA is locked to Unicode 3.2. >>CARY KARP: They are now at 5.1. >>PATRIK FÄLTSTRÖM: And they are now at version 5.1. And that is a big problem, specifically as in the implementations of Unicode that exist in operating systems. It is, in reality, impossible for an application to know what version of Unicode the operating system has. >>ROBERT HUTCHINSON: So not to belabor this, but from a codepoint viewpoint, revision of the protocol, of Unicode, will result in a superset of the previous version in almost all cases? Is that a correct statement? >>PATRIK FÄLTSTRÖM: That is absolutely correct. And this is why Unicode consortium and the IETF has been working extremely hard to make sure that the rules in the algorithm are selected such that it matches the backward compatibility requirements Unicode consortium is using themselves. So when the Unicode consortium is coming out with Unicode version 5.2, it should not change the value, whether it's allowed or disallowed, for any of the codepoints that were in 5.1. If we look at in reality -- so now the question, then, is the rubber hits the road. How good is that backward compatibility? From version 4 of Unicode, which is more than one major version, we had, I think when I had a look, we talk about changes of maybe four codepoints. So it's extremely few. And all of those four codepoints, even I can say, even though I don't understand those scripts, by looking at the codepoints, I can see that those four changes are actually bugs that are corrected in the Unicode standard, so I am not worried? >>MING-CHENG LIANG: Yeah, this is Ming-Cheng Liang from TWNIC. I have a question, maybe not completely related to the Unicode itself, but when -- I think you were talking about is Arabic, and maybe different -- different language maybe writing from right to left; right? And I am just wondering, because I heard a lot of disturbing news about that. Will the -- Of course the writing is from right to left, that's okay. But will the field be changed? Because at the present time, we go from right to left from the top level to the bottom; right? And will that be reversed in this language, or it will keep the same? >>HARALD ALVESTRAND: I have bad news for you. This is going to be confusing. If you have a domain name that consists of only labels in right to left, the top-level domain will be to the left. If you have, on the other hand, a domain name where the top-level domain is left to right, whether it appears to the left or the right of the right-to-left part will depend on the embedding the -- on direction of the peg (inaudible) on which it is embedded. This is described further in the draft. It's -- It is going to behave consistently but it won't be pretty. >>MING-CHENG LIANG: My second question will be, if that's the case, then suppose we send an e-mail to people using the right to left, or when they send it to us, when you want to return it, what will happen? It will cause a problem; right? Because the top level is reversed. Will that cause a big problem in these applications there? >>HARALD ALVESTRAND: As soon as you have the thing into the computer, it will be in the normal order of things where the lowest level domain comes first, in whatever is first inside the computer. It is only on the display that it will look very confusing. So we hope that people who are used to handling mixtures of right-to-left and left-to-right text will do the right thing. But that's a user interface issue. The computers will handle it. I'm not sure the people will. >>TINA DAM: I just want to remind people in the room that if you think what Harald was just trying to explain is confusing with the right to left, well, then, maybe now you understand how people who are used to using right-to-left languages are feeling when things are in ASCII. So yeah, it's going to be more difficult for us who are very comfortable with the current system, but, you know, a lot of people are not very comfortable with the current system. So things are confusing for them today. Now, is there any other questions on these two topics we just went through? So the character property rules with Patrik or the bidirectional with Harald? And we spent about a small hour on these two topics, and these are the major topics in the revision. So if you have any questions about it, now is a really good time. And if not -- >>HARALD ALVESTRAND: Then I will say thank you for listening. And I'll have to run, but I will be available on the e-mail, always. >>TINA DAM: Thank you, Harald. I'm sorry you are double booked and you couldn't stay with us. But we are going to move on to the next topic, and that's implementation. And I think this is going to be a little less technical than the previous section. You are going to have three presentations. There is going to be one on registration, there is going to be one on resolution, and then we are going to hear from the Arabic script working group on their example on how to implement it. So Cary, I think you are up first; yeah? >>CARY KARP: Nonetheless, I am going to ask, just for the sake of my own curiosity, the question about how many of you have understood this on the level that is being presented, still a whole bunch of hands? Did you guys get through Patrik's presentation? Ah, fewer hands. Okay. All right. Now I'm going to start talking about what is probably of most immediate concern to the present audience and that is what the changes are going to -- how the changes are going to influence the process of registration. And currently, under IDNA -- IDNA 2003 does a lot of character remapping. There are any number of ways that a sequence of Unicode characters, a user has an expectation, this is what I believe the label to be and in my computing environment, I am now keyboarding that -- I am keyboarding that sequence of characters. And, in fact, there may be a number of normalizations conversions. Using the Swedish alphabet, for example, which is a 29 letter, Roman-based alphabet, there is an A with a ring over it, an A with two dots over it, and an O with two dots over it, and these are not regarded as diacritically marked O's and A's. These are three integral letters. They are separate. And everybody in Sweden realizes that the outside world might think that a Swedish two-dotted A is the same thing as a German umlauted A, but they are wrong. And all of the rules about collation and the things that Germans do with the umlauted A are different than what the Swedes do with theirs. It's the 28th letter in a 29 letter alphabet. But it is entirely possible for that to be represented in different ways. And we can take an A, and then we can add an umlaut to it, or we can take the Unicode character which is the umlauted A, and IDNA 2003 is very, very good about simply tending to these details much it's actually using Unicode functionally. But there are other examples that aren't as clear-cut. If we take an Arabic-based character and put a combining mark on that, there is no normalization. These are two separate codepoints. And all of these things have been dealt with entirely transparently thus far. But these remappings result in a situation where someone types in a sequence of characters, and the changes are made, and they get back the result that isn't what they were expecting. For example, the final form sigma will be replaced by a medial sigma. The level of literacy of the user has been reduced to that of a small school child first learning to write and not quite understanding these distinctions. In IDNA 2008, all that remapping is going to disappear. What it is that is registered is exactly what it is that a user will be seeing and can input. And the registry may wish to somehow reintroduce the functionality that currently exists, but it's not going to work reliably. There is a distinction being made now between what one -- how one prepares a string for its registration and how one deals with addressing a search to the underlying database. And application software, that "A" in IDNA, means it's not the DNS that's doing this stuff. It's the application doing this stuff. And there's a far greater degree of clarification here, which means that registry operators, and I am not going to talk about top level registry operators. It's absolutely crucial for you to understand that all that we are saying applies on every single level of the Domain Name System. The protocol is going to be doing less, but it is going to be doing it far -- with far less ambiguity, as well. And the registry operator needs to understand the consequences of the things that they are accepting for registration. In order to keep this safe and reliable, there are going to be clearly articulated policies necessary on the registry level. Patrik's principle of people regard it as entirely within their right to mess up their own registries, that's actually true and there is a general principle which is fundamental to ICANN, and that is that the further something gets away from the root of the DNS, the lower level we are talking about, probably the lower the risk is of the entire Internet being brought to its knees by an improper label. So again, the parent of any zone is capable of dictating policy requirement, but there's no mechanism inherent to the DNS that propagates at the lower levels. And registries are simply going to have to start thinking about these things. There's a shared responsibility between the registries and their registrars for enforcing this. And I hold the pen on the ICANN IDN guidelines, and I'm pretty sure that once the protocol revision is done and the discussion about the introduction of top-level labels -- IDN -- localized top-level labels into the root is no longer tentative, those guidelines are going to need to be expanded significantly. They ever going to need to be intrinsically compelling that any registry that looks at this stuff, again on any level, is going to say wow, now we understand what we need to do and as good network citizens, we are going to do it. In the familiar environment -- you can actually register anything you want. You can register strings that are really going to cause damage and we are trying with the IDNA revision that the potential for doing that damage is reduced but there's no way that it can possibly be eliminated on a protocol level. That's right, isn't it? >>PATRIK FÄLTSTRÖM: (Nods head.) >>CARY KARP: See? Now, this is one of John's slides. I have it here simply to show to you. I'm not sure about the significance of the arrows, in fact. But again, there are no specifics in the protocol. What it is that a registry puts into its database is either permissible on the protocol level or not. And any ramifications that that might have for user expectations being thwarted is the registry's concern, no the protocol's concern. So there's this pre-protocol thing where somebody will type something using an arcane Unicode codepoint, and at some point that has to be normalized. We realize you can write it this way but we can't register it the way you are asking us to but we are going to register something that's absolutely equivalent and we are going to need to explain to you what the difference is. There is a concept of bundling where there are genuine orthographic alternatives where some people in the writing community will do it this way and some will do it that way, and both have to be accommodated. The registry can deal with that as long as both ways are permitted by the protocol, we will bundle. You will register this one and you will automatically get that one, which means that a user looking for whichever will still find you. That's probably something that's going to be developed a lot more than is currently understood, perhaps also on the top level. I'm not sure I am allowed to say that. But the notion at some point will be revealed that someone has a very legitimate case to make for a top-level label that cannot be represented uniquely. That there are too -- The community is too large and orthographic practice simply varies. So the notion of discomfort and accommodation is bidirectional here. Not in the bidi sense, but the protocol engineering component of this needs to accommodate perspectives that it has not even been aware of before. There's something implicit in this. The language communities themselves have to be aware of all of this and put -- and articulate their needs clearly. And we're assuming that registries and registrars will be mediating in that process as well. So in one very, very warming sense, this height tense the sense of commonality, communality in our single Internet. And if we are not capable of meeting all of the challenges, well then it will become less our single global Internet. Registries have to talk about the scripts that they will handle. And as Patrik pointed out before, the permissibility of a given character in one context does not imply that it's permissible in all contexts. And the contextual restrictions have to be very, very carefully articulated, and again by the registries on all levels. If a top-level domain wishes to propagate a given policy throughout its domain, there will need to be some contractual intricacy that is yet again entirely on the other side of the protocol horizon. There are no global rules here. One registry using one -- a language that requires one script and another registry using another language that requires the same script may have entirely different perspectives on the notion of -- notions of bundling, what is permissible, the contextual rules may vary. The crucial thing here is that this is all validated. And it's on the public record. There is this IANA repository of IDN registry practices? Something like that. But there is, nonetheless, a single repository which is not structured to enable automated lookup and validation but does make it possible for a generic TLD registry wishing to implement support, for example, for Arabic to see which of the Arabic CC domains that do conduct their business in Arabic and support the Arabic speech community, what are they doing. And the other way around. A gTLD registry may need to devise policies that are globally applicable, and a ccTLD registry supporting one instantiation of that may receive guidance from what the G's have figured out. And a lot of it is credibility, but this is a consumer environment. If you wish to purchase an IDN label in a registry that can't possibly understand the intricacies of your language, and you are doing so deliberately because you know that the policies are correspondingly lax, well, caveat emptor and caveat whatever the Latin word is for "user." I don't really think much more needs to be said about that. The validation and all, we've talked about. And again, you end up looking at -- in summary, you end up looking at what the user wants, modifying it as you regard absolutely necessary, accommodating user expectation, and then it goes into the zone, and everything that happens from there on out is up to what the applications developers decide should happen. We have already talked about this, that a given script can be used for a large number of languages and the requirements and perspectives and traditions and literary expectations can vary significantly, and it will be necessary to make tradeoffs within script communities, so that the -- let us assume that there is a dominant language using a script; that its manifestation in the identifier space doesn't call real anguish for the smaller communities. A large part of this is enabling the smaller voices to be heard in all of this. And the pioneering effort here was the Chinese, Japanese, Korean, JET unification -- well, it was Unicode that decided to unify all of this, but the notion of variant. That in varying language context, the following characters are all to be regarded as identical. So this concept of bundling was born then. And the most recent effort is underway in the Arabic script community, and Ram will be speaking about that next. No, Patrik will be speaking next and then Ram will be speaking. Okay. So cooperation is the name of the new game, simply. It's very difficult to imagine that a registry would deliberately wish to cause user confusion and anguish, because ultimately, the user community is going to note this and is going to react. And one of the things that's currently happening, which is really unfortunate, and we don't know how the protocol revision is going to effect, is that software developers -- most particularly, Web browser developers -- are really skittery about this. They don't want anybody to be injured by being lured to some nefarious Web site that is exploiting the similarities between different scripts. This has happened and it continues to happen. So the one browser developer will reduce this risk, will protect its users in one way. Another browser developer will protect in a second way, and the third browser developer in a third way. Which means that user expectations -- what will happen depends on what browser you are using. And that is so counted to the notion of uniform procedures and supporting user expectations as it possibly can be. And the ultimate fall-back, these actual encoded ASCII representations of the expected labels, xn--, nobody wants to see these things and the current browser practice is the moment there's any suspicion that anything is going on, we'll show people xn--, which will thoroughly confuse them. And that's the whole idea. It will call their attention to the fact that things are not what you expect them to be. Some of that does need to be addressed. That's it for this. So if you want to open the mic. >>TINA DAM: As I mentioned earlier on, and not all of you were here in the beginning, but we have open microphone throughout. So if you have any questions, you can always come up. You can also interrupt speakers if you want. But we're into the implementation part. And then what I had mentioned earlier on was that one of the new things with the protocol was that the registration step was separated from the resolution step. So Cary just talked about the registration, and Patrik is going to take over and talk about the resolution area. So if there's no questions, then Patrik, you should go ahead. >>PATRIK FÄLTSTRÖM: Okay. So resolution. The resolution consists of basically four steps if you look at things from a very high level. You are to define the domain name, the application needs to know what the domain name is. We need to convert it into normalized Unicode string. Apply Punycode, which implies turning it into ASCII, and then we do a DNS lookup. If we look in more detail what this actually might mean, we have URI, and the URI itself is actually entered over the keyboard or in one way or another, and it might include funny characters, non-ASCII. From this, the application is extracting the host name. And the host name in this case is Swedish name ledåsa.se which is name of our forum. So this is actually a domain name that exists. So you go all go there, test your browser, see whether it works, and see, what I think, is a pretty okay picture on my house. There's also a map so you can see where I live and road directions if you want to come by, have a cup of coffee and talk more with me about this issue. So after finding the domain name, we need to convert it into Unicode. And let's say that when entering the characters, we are not using Unicode or UTF-8. We are using ISO 8859-1. In that case, the A with the ring is represented by OxE5. This is a hexidecimal notation of the hexidecimal number E5 and how it is stored in the computer. We turn this into Unicode, and the Unicode store this -- and the Unicode codepoint is actually also E5, because for ISO Latin 1, the codepoints are actually the same. So the only thing that happens is that we get led, and then the Unicode character for A with a ring, S and A. We are still talking about characters here. He. The next thing that might happen is we take this character and if it's the case that this character was stored as a UTF-8 encoded string, then it will be stored as C3A5. So if this is UTF-8 encoded Unicode, we have it stored as what you see there in the red on the third line. We turn this into Punycode, and you will get back xn-- ledsa dash ora dot se. We look it up in DNS and get back an IP address. If this was a little bit confusing, let me summarize and take this one more time. The left-hand side, you see -- left column you see what we are starting with and in the right column you see what the result is. First row, take the URI, extract the domain name. Next row, take the characters in the local character set used and turn it into the characters in Unicode. Third row, turn the representation in Unicode into the Punycode representation of the string, and the last row is the DNS lookup. So if we take these one upon a time. First you enter the domain name, for example, via keyboard. You might receive the domain name or use data via an e-mail. You have a link in a Web page you want to click on. You have some mapping of characters that you would like to do. Maybe map the full -- ideographic full stop into period or you might want to map short S into double S, for example. So all of these things may be very much locale independent according to localization of your operating system, your application or other kind of local policies. Local policies that make the computer easier to use. When you have done all those kind of things, you extract the domain from the international URI, from the e-mail, et cetera. The next thing we have to do is convert the local -- from the local character set to Unicode. For example, from ISO 8859-1 to Unicode. We also have to turn the nonnormalized string into normalized string. Here's one example of how that can be done. It might be the case that A with a ring is represented either as one character in Unicode, which is the character with the number E5, but it can also be represented as the two characters, the first one is "A," and the other one is a combining diacritical mark, which is the ring. So normalizing a string which happened to have "A" followed by the combining diacritical mark, normalizing that means that we are replacing the "A" with a composing ring into the character "A" with a ring. The next thing we have to do, of course, is that we have to ensure that this domain name is valid. And that implies looking at the tables, looking at the algorithm that I just -- that I presented in my previous presentation, looking at the -- look at the bidi rules that Harald explained, and all other kind of things. When this step is done, then we know that we have something that can be a domain name. The next thing we have to do is to turn this what we call a U-label into what we call an A-label. We do that by applying Punycode. And this is exactly the same definitions as we had in IDNA 2003. It is defined in RFC3492, and it turns ledåsa.se to xn--ledsa-ora.se. When we have this domain name, we do a DNS lookup. And in this example, I just use the example dig on a Unix host and get the address back. Done. So, you see, it's very easy. >>TINA DAM: I don't think everybody here actually feel that things are as easy as they felt like in the beginning. But, again, this is pretty important. There's registration steps and there's the resolution steps. So is there any questions on it? And you can also just come up and say, "I didn't understand this at all. Can you do something to help me on it?" But.... >>CARY KARP: Or if anybody doesn't feel adequately confused. >>PATRIK FÄLTSTRÖM: You have to wait until it's red. >>CARY KARP: Or if anybody is not adequately confused yet, we can also accommodate you. >>TINA DAM: Right. Well, if there's no questions, or as you can -- if you may want to think about questions, I'm going to ask Ram Mohan to come join us. And, ram, did you have some colleagues from the Arabic Script Working Group you wanted to -- maybe I can ask you to introduce them. But as you're setting up, maybe I can just mention -- I think it's been mentioned a couple times in this session already -- there are some scripts that are used across languages. And that can make things more complicated than if they were just used within one language. And because of that, it's really important that there is coordination among those language groups. Arabic is one of those scripts. And Ram, along with a lot of people, have been working on an Arabic script coordination group and have been making a lot of great progress. So he's going to give an example of how things are being implemented on top of the protocol. Cyrillic is another one of these scripts. And it was just mentioned earlier this week. -- different Cyrillic language groups with getting together as well. And they're doing work, as we're going to hear from Ram on the Arabic. So, Ram, I think -- you're plugged in. Jason, can you switch to Ram over on the podium. >>RAM MOHAN: You probably had -- thank you. Before I start, I'd like to invite my colleagues from the Arabic Script Working Group up on stage, if you would take a minute. We have Ayman Al-Sherbiniy from UN-ESCWA; we have Alexa Raad from PIR; Manal Ismail from TRE in Egypt; we have Siavesh Shahshahani from IRNIC. I invite these folks on stage primarily because I don't read or write Arabic, either the language or the script. I recognize certain characters in it. And that's because there are characters in it that are used in Hindi, which I happen to know as an Indic script. But this is representative. There are many more. So I'm going to take just a few minutes, and I'd like to start with a little game, if you will, a question for you. Can you spot the difference between this, what you see on your left, and this? How many of you can spot the difference? Can you? Okay, there's a bunch of folks who can spot the difference. And you think the difference is that there is a circle versus a diamond; right? Okay. Well, it's actually a bit more complicated than that. How about this? What's on the left in red is a particular character. It can be written two ways. They look the same to us as human beings; right? What's in red versus what's the first in black, they look exactly the same. However, the way they're composed to a computer, they actually look like two completely different sets of characters, and, therefore, if I registered a domain name with what's on the left as 623, that would, to a computer, look completely different and unique, as compared to 654 plus 627. How about this? You know, on the two sides, you see a dot and then a character; right? Do they look the same? Do they look the same to you? I mean, does anybody see anything different from -- in the first character? They look the same to us; right? But, actually, to a computer, they're completely different. The first one is 6F1. And the other one is a 661. So when you look at the way it's -- technology and things have been digitized, things that look the same to us on a computer have actually been coded quite differently. And the Arabic language itself, which many people end up saying it's a language, the Arabic language is only a part of a larger Arabic script table. And it is for these kinds of interesting problems that the Arabic Script IDN Working Group got started. We're a self-organizing group. Interested folks got together, because we think that to actually get Arabic script implemented in internationalized domain names, you've got to go beyond looking at things like gTLDs and ccTLDs. You've got to go beyond looking at how do you implement language. And you have to start thinking about what do you do when you have a one-to-many problem, one script, many languages. Our goals are to establish a framework for the implementation of IDNs in Arabic script, which, as you can see from the examples at the start, you have to do some work. You have to say, "What looks confusing?" And how do you make sure that what looks unique to a computer but the same to a human eye can be harmonized and brought together. And to do that, you consider and bring together technical, linguistic, policy perspectives, and, you know, find out what the problems are, come up with recommendations and guidelines, where it's feasible. And potentially what we're doing may actually serve as a best practice for other scripts to follow. We currently have some level of representation. There's a list of countries out there. I won't read through the list. There are a few companies. The United Nations, UN-ESCWA, is a sponsor. APTLD, Afilias, PIR, ICANN, ISOC in Africa. We've invited experts from Unicode, from the IETF. And several languages are currently represented in this working group. Our principles are pretty clear: Go with standardized solutions, standardized methodologies, try to make it as extensible as possible. Try and keep it simple and transparent. And you'd be shocked at how easy it is to write it versus actually practice it. And try and do fast and easy, which, given that much of the DNS was written or was thought of without Arabic script in mind, fast and easy is actually a pretty significant challenge. And as I had mentioned earlier, we really don't -- aren't looking at gTLDs versus ccTLDs. It's an open working group. We don't vote. It's consensus-based. We have one requirement. If you want to participate, if you want to be in the group, you better come and do something and contribute real knowledge, talent, and ability. There are some people who just lurk on the group. That's fine. But if you have -- if you want to actually participate, we ask for DNS and standards experts. We ask for linguistic community experts and registries that are implementing the Arabic script or plan to implement the Arabic script, we want them to come work with us. Because what we're doing is not sitting and talking about things for a long time, but actually moving from speech to action as quickly as we can. We've met twice so far. They're pretty good meetings, don't you think? We begin at 8:00 in the morning. We go to 11:00 at night. And then we begin the next day the same way. And -- but what we have done as a result, we have actually reviewed the Arabic code block. We've agreed on a set of recommendations. And we've sent them to the IETF, to the IDNA protocol group. And what that has done is to say what characters are allowed, what characters have been disallowed. That part is done. We've also come up with definitions of the variants at script level versus registry level. Siavesh, can you jump in and say why we have started to look at the variants at script versus registry level. You were talking about case -- you know, the case example, or the lack of cases in Arabic. >>SIAVASH SHAHSHAHANI: If I understand your question, you are talking about why we don't have just capital and small letters? >>RAM MOHAN: Right. >>SIAVASH SHAHSHAHANI: Okay. I guess in the Arabic script, you don't have capital and small letters as such. One letter could be represented in up to four forms, depending on where in a word it occurs, and depending on what letter that is. So is that what you wanted from me? >>RAM MOHAN: Yes. >>SIAVASH SHAHSHAHANI: So, I guess, instead of two cases, it could have up to four cases of a letter, if I understood your question correctly. >>RAM MOHAN: Yes. Thank you. So it's -- even the same Unicode character, even though it looks the -- it doesn't look the same, but it actually changes form based on where it is inside of a given word. >>SIAVASH SHAHSHAHANI: No chance for case foldings here. >>RAM MOHAN: Right. So this is one of the outcomes. What you see up on the screen is across Arabic, Persian, Urdu, and Pashto. Let me skip back to the screen with the Arabic language. What's highlighted in red is the Arabic language, but what we've actually come to in the space of a few meetings and a lot of intense discussion is, actually, this. Arabic, Persian, Urdu, Pashto. We have made significant progress in actually coming to identifying what are the characters that should belong in a unified script table. And we've also come up with some recommendations. There are several on the screen. Again, I won't go through all of them. But I'm actually wondering, Ayman, would you take a minute and just could you speak to the numerals, what we -- what the problem is on the numerals and what we actually have tried to do. In the meanwhile, I'll bring up the slide with the numerals on them. >>AYMAN EL-SHERBINIY: Okay. In fact, if you go back to the Unicode table, you would show that we have, like, these two rows that you see horizontally, you will find on the big table. >>RAM MOHAN: You want the table? >>AYMAN EL-SHERBINIY: Yeah. >>RAM MOHAN: There. >>AYMAN EL-SHERBINIY: To the -- we see to the left side, we have two borders with red, this one that he is pointing to is what we usually know as the Arabic letters. In fact, they are the -- what we like to say better, used in the Arabic world. They are, in fact, called Indic. But, anyways, these are 0, 1, 2, 3, until 9. If you go to the very right-most side, here you find 0, 1, 2, 3, then some changes in the 5, 6, and then the 7, 8, 9. The problem is we have identical sets which are 0, 1, 2, 3, and then 7, 8, 9, very identical, exactly the same looks, same glyph, but used by different codepoints, which is -- this is the worst-case scenario, identical, and not only letters, but digits, which cause all the problems. So we had to discuss a lot about what should we do about that, eliminate, for example, the right-most side and keep only the left? But this is not useful for some languages, like the Pashto or the Urdu and other languages. So this is still one of the very controversial issues. And we decided to at least not allow mixing between the digits within the same label. So for the moment, either use completely the 0, 1, 2 until 9 from the middle side of the table or the right side of the table, but not mix them together. But I want to go back again to the table of outcomes. >>RAM MOHAN: Oh, table of outcomes. I'll get there in a second. There we go. >>AYMAN EL-SHERBINIY: Yeah, the one before you. >>RAM MOHAN: This one? >>AYMAN EL-SHERBINIY: No. The table itself. >>RAM MOHAN: Ah, the table itself. There. >>AYMAN EL-SHERBINIY: This one. In fact, this is not an easy outcome that we reached. And the title, it says accepted characters for Arabic, Persian, Urdu, and Pashto. This is actually a simplified form that we not only identified the other languages. It could look like we just identified as a codepoints. But this is not the only underlying fact of the activity. The activity was, in fact, to define the language tables per se, which is something that is of essence and essential, as one of our colleagues, probably Patrik, has been describing, the need for a language table. The need for a language table is very important for registration and for IANA and for ICANN purposes. And this is what we call the bottom-up approach. We defined language tables, and we, like, created union or the meta table of the languages. So these are not only the languages defined on the table, but this is the script that is a result of defining the tables and identifying the variants as well. So I wanted to point out that as languages are defined, the language tables are defined, this table would be fine-tuned. And this is what I wanted to say. Thank you. >>RAM MOHAN: Thank you. I'm just wondering also, Manal, would you be able to spend just a minute speaking about what we did where you have the blocked off, you know, 065, 051, et cetera, our approach, because these are in some cases honorifics and other characters that we have actually agreed that should not be there. Can you speak for a minute about what we have done when it comes to -- from the code table, we have taken an approach about honorifics, about other characters and mixing of characters. >>MANAL ISMAIL: Actually, we started with deciding on what codepoints that are needed by one of those languages and are not currently Pvalid. And we identified, actually, two for, I think, the Sindhi language. And we communicated this with the IETF people. The other thing was things that we were 100% definite are not going to be used in domain names. And those were recommended to be disallowed. And we kept this set really to a minimum, because this is something that is irreversible, basically. Then we tried to come up with the script language, which basically will contain all the Arabic scripts across all languages. We tried to remove from this script language anything that we do not feel appropriate right now for registrations. The approach we took was trying to solve problems as possible at lower levels, because this is more enforceable. When it's left to higher levels, then there is no means to ensure they are going to be followed. One such example are the diacritics in the Arabic language. We decided that, currently, we cannot support diacritics in Arabic domain names. But we could not take such an aggressive decision of making the diacritics disallowed. We kept them Pvalid, but we are not including them in the script table right now, until all security issues and further investigation -- until we come up with more comfortable solution. Thank you. >>RAM MOHAN: We have other ongoing work which is, again, up on the screen. Now, one of the things that I was wondering, Alexa, if you would take a moment and speak a little bit about some of the outreach efforts, you know, bringing in ISOC from different communities, as well as funding and sustainability types of issues. If you could just take a minute to speak to that. >>ALEXA RAAD: Sure. One of the reasons, I think, this group is very successful in terms of the work that we've done so far has been the representation of the group from various communities from gTLDs, ccTLD, because a problem, actually, goes across both. And also having linguist, participations from linguists and DNS experts. To continue the work of the group, it is imperative that we continue at the same pace. So we already have everything, as Ram just put up on the page, is completely on our main page, which is ASIWG.org. All of our documentation, all of our conversations are archived, recorded. Presentations are already up. So if anyone's interested, they can go and look and pull down the resources. We have a mailing list. Again, all of the conversations on the mailing list are archived. But going back to how can we sustain this momentum, there's been some conversations already about including some of the folks that are not necessarily represented, some language communities. We have done our best to really try to recruit some of those through -- folks within the group themselves, some of the folks have already invited folks that they knew. However, this being at ICANN and having the opportunity to present this gives us a greater opportunity to get participation. And as Ram said, real meaningful contributions. And we're also hoping for sponsorship. And we thank our colleague at .AE, who is, unfortunately, not here, who provided sponsorship of the last meeting, along with UN-ESCWA and PIR and Afilias. We need more organizations to step up and perhaps sponsor folks who would otherwise not be able to be at the table. >>RAM MOHAN: Thank you. Questions. >>YOAV KEREN: Hi, Yoav Keren from Domain Israel. One slide, if you can get back, the one talking about variants -- variants should be registered, reserved to the same registrant. When you're talking about variants, that variants across languages? Or is that the variants of the letters themselves? When you're saying that? So if you have, like, the -- the same letter that can come in different forms, so would you reserve that if it's in the end of the -- end of the word? So what are you -- what is the meaning of this? >>RAM MOHAN: So far as -- >>TINA DAM: Jason, microphone. >>RAM MOHAN: Let's give it to Manal. >>MANAL ISMAIL: Actually, each of those languages already came up with their own language table. What we are currently working out with the script table is, apart from having the whole script table, is also identifying variants across languages. >>YOAV KEREN: Not in the same script? >>MANAL ISMAIL: No. Be- -- you mean not in the same language? >>YOAV KEREN: Like you have the -- a letter that can come in some form at the beginning and some form at the end. Are you suggesting to reserve the variants of that letter when you register, reserve the variants of that letter in all forms? Or are you just saying we need to reserve it when we have different scripts, different languages? >>MANAL ISMAIL: It's across languages. I don't really get your first point. Across languages or what else? >>YOAV KEREN: Or you have, like -- as you said -- >>RAM MOHAN: I would think within a language, we all have the same Unicode codepoint anyway. So there is really not a variant there. It looks different. >>MANAL ISMAIL: You mean for the loop? >>YOAV KEREN: Yeah, like if you write it one way at the beginning and a different way at the end. Is it the same in Unicode? I'm not sure. It doesn't have the same codepoint. They would know. >>MANAL ISMAIL: It has the same -- it has the same codepoint in its different forms. >>YOAV KEREN: Okay. >>MANAL ISMAIL: But, again, why should it look different in the same string? I mean, you're going to have one form of each letter within one string. So we don't have, really, variants for the different forms. >>RAM MOHAN: Ayman, did you want to say something? Siavesh. >>SIAVASH SHAHSHAHANI: It's across languages. Like, you have one letter representing "K," for example, which has two cases. They have different Unicodes, because in one case, they look entirely different. They look different in Arabic and in Persian, say, or Persian Urdu has one version, Arabic has the other version. In the small case, in the other case, they look exactly the same. So there's a possibly for phishing. For that reason, you have this -- this is what we do. And as to what we decide to do about this, that depends -- that's a registry decision, really. It's not at the level we're considering now. >>RAM MOHAN: Thank you. >>OLOF NORDLING: Very simple question. I saw you noted on a recommendations that restrict the sets of three Arab -- well, the three sets of Arabic numerals. And you showed two of them. Well, the third one, is that what we usually have in ASCII? Is that what you -- >>RAM MOHAN: Yeah, I think even though we call it ASCII, it's actually Arabic. >>OLOF NORDLING: Might have hidden one somewhere. >>RAM MOHAN: Yes. Back to you, Tina. >>TINA DAM: All right. Well, I just wanted to see if we can push people to ask more questions on these topics. So if you think of anything, just -- yeah. >>BOB HUTCHINSON: Is there any effort at ICANN at this point to begin to collect functions that will be able to tell registries whether this string is Urdu or Farsi or is it, you know -- which language it is? In other words, you should be able to write functions that sort of give you pretty good indication from the characters that are in the string, and that would be very helpful, I think, for standardizing or making uniform the ability to support languages across registries. >>TINA DAM: Yes. If I understand your question right, we won't have, like, new functions like that, because, first of all, the DNS does not understand the differences between languages. For the DNS, it's just a character. So -- >>BOB HUTCHINSON: I didn't say this has anything to do with the DNS and the DNS mechanism. >>TINA DAM: Right. >>BOB HUTCHINSON: It has to do with the registration and the policies of the registrars, okay. >>TINA DAM: So, yeah, I was warming up to that. >>BOB HUTCHINSON: Okay. >>TINA DAM: Just to start at one level and then we're moving up. The variant tables we've been talking about can be based on languages and on scripts. So what Ram and everybody here were just presenting is one example of that. So there's going to be variant tables in the Arabic script, and, in some cases, it will be language-specific; right? Because not all of the languages are going to use all of the characters in the Arabic script. So that is being developed locally. That doesn't mean that ICANN doesn't care about it. It just means that we don't have the linguistic resources or the expertise or the mandate for that. So it is something that's going on locally, which is why it's really important to hear from, for example, the Arabic Script Working Group or the future Cyrillic Working Group or other working groups that are gathering in the local communities. Now, when it comes to identifying what script is this character representing, we are actually following the Unicode script identifications. So you will at all times be able to look up a character in Unicode, and it will identify what script it's belonging to. That is not the same as saying what language it belongs to, because there are some languages that, you know, one character can be used by more languages. Things are split up like that, and it's -- it can't -- you know, I can't make it more simple, because that's the reality of how language communities use the different scripts. >>RAM MOHAN: Tina, I saw Ayman, Alexa, and Manal all wanting to respond to this question as well. >>AYMAN EL-SHERBINIY: Okay. Thank you for the question, because this is really the essence of the story, which is the relationship between the languages and scripts. And, in fact, we would like also again to stress the fact for most of the audience here that we are not speaking here just about the Arabic language, Pashto, Persian, per se. But we are trying to help set an example for -- or an exercise for these kind of issues, maybe for other scripts as well. The idea here is between the multilingual and monolingual registries. So, in fact, we are solving now the problem on, like, the language level and the script level. When you take, for example, the case of IDN ccTLD in a country that speaks one language that shouldn't be a problem. They have their table, they have created their variants. They do whatever they want. The same for any gTLD that's going to be monolingual, why we don't think that this is going to be the case. The problem only lies in the ccTLDs. gTLDs who are multilingual. And this has to, at the end of the day, be dealt with at the registry level. And this is the layered model that is on the screen now, that we tried to come up with -- trying to, like, identify some of the specific issues and say, "This is to be handled either on the protocol level or the script level or the registry level." So, for example, take a case where a country speaks Arabic and Urdu. Then they have their own implementation on the registry level to identify certain cases where language, slash language, might collide. Another country ccTLD, they have Arabic, Farsi, and -- so so on and so forth. So these are solved on the registry level. And I think this is a dichotomy of the layered approach here and the reality of the fact that it's going to be registry specific at the end of the day. Although that in the ideal situation, we might ask the protocol guys to find and identify -- to identify the language from a string or whatever. But for the moment, I think the layered approach might solve it. The last point I want to make here is the importance of standardization, the effort, even if it's a work in progress. And this is something that I think all of us would really benefit from, like trying to put what we are doing in a sort of a guideline for request for comments or whatever for registries, and see if this is going to develop a best practice document. And the work on it. At the same time, we are working on introducing new gTLDs and IDN ccTLDs. So the importance here of standardization. And this is actually the work we started to do this week, that are we are going to put our effort in an Internet draft and propose it for comments before the end of the year, hopefully. >>ALEXA RAAD: You talked about the guidelines that we're working on. The reason that -- Ram started off the presentation with those two words that look extremely similar. And if any of you guys were thinking phishing when you saw that, you are on the right track. One of the -- A lot of the discussion that we had at our working group, particularly about use of looking at confusing characters, characters that can be combined, looking at use of Harakat or honorifics, was to limit the use of phishing. That cannot be done on an individual ccTLD or gTLD perspective if there is no broad-based guidelines that are followed across. And I think that the layered model that we have, which basically solves the fundamental problems at the protocol and the script level, allows us to solve the majority of that and then leave the decision to the registries in terms of just, you know, how do you price the variants? Do you just allow the variants to resolve or not resolve? So that would be at the registry level. But I'm glad you brought it up, because that's -- the adoption of those guidelines is a major move forward for kind of making sure that the security and stability is maintained. >>RAM MOHAN: Manal. >>RAM MOHAN: Manal. >>MANAL ISMAIL: Yes, I think a good part of my answer was already covered. But let me also say, like Ayman said, this has to do with multilingual registries. But again, our recommendation is reserving all the variants to the same registry and activating it upon request. This is to reduce phishing or things like that, because if we don't block the other variants, someone else -- I mean, for trademarks, for -- someone else could register a variant for one of the labels. And this will cause a problem. So our recommendation was that automatically, all variants are blocked, and activated as the registrant requests. >>TINA DAM: So you can see things are done differently at different registries, but the most important thing is that variants are blocked and taken care of in some way so that we avoid these phishing problems. Is there any other questions on registration rules, resolution rules, or to the Arabic example? If not, then thank you, Ram, Manal, Ayman, Alexa. You guys are doing a lot of great work, so I really appreciate it. Yeah, Manal. >>MANAL ISMAIL: Actually, I would like to answer a question that was asked early in the session. I'm not sure if we have the gentleman who asked with us in the room, but he was asking about the Arabic work group within the league of Arab states and how this fits within what we are doing right now. And, actually, we have members from the Arab -- Arabic domain names work group within the Arabic script, and we are continuing on the same results we have reached within the Arabic group. We are basing things on the same language table we all agreed on. We tried to communicate this language table. We actually supplemented it to the IANA via one of the Arab registries, the dot SE, Saudi Arabia. So we're building on what was already achieved within the Arab group. So I hope you will read it somewhere. Thank you. >>AYMAN EL-SHERBING: Can I add to that issue? And what Manal said is already published in an Internet draft. We have worked on standardization on the Arabic or the language level, and now we have to continue the standardization rationale also on the script level. So what she said already is Internet draft published since a long time, and we welcome all your comments on the Internet draft. Thank you. >>RAM MOHAN: Tina, thank you for inviting us to ICANN. And for those of you who are interested, this is an open working group. Please join up the presentation will be up on the ICANN site. But please do join the working group and bring your knowledge, your expertise, and your questions. We would not be successful without your input. Thank you. >>TINA DAM: Okay. So -- And, yeah, you are welcome to stay seated or whatever you prefer. But we are almost at the end of the workshop. We only have the conclusion remarks left, and one is on timing of the protocol revision. And the other one is just summaries. So Cary is going to talk a little bit about timing and I don't think you have any slides; right? But what you are going to learn is that the IETF does not work in the same way as the Supporting Organizations at ICANN, so it's probably relevant to hear a little bit about how that's expected. >>CARY KARP: The work with the protocol revision was initiated in an informal context that generated a reasonable amount of documentation. Everything that would have been necessary to make the initial envisioned modifications as initially envisioned was prepared prior to this being put forward in the IETF process. And the upshot is now that there is a chartered working group for the IDNABIS of carrying that forward to conclusion. And it will be meeting first time physically at the IETF meeting a month from now in Dublin. It is not going to be possible prior to that meeting to make any real assessment of how much time remains until things have concluded. But there are three basic scenarios, and they can be named here. The one is that the working group, however much discussion may be generated about it, basically accepts the preliminary documentation with some minor modifications, and this is all put to bed so that we, indeed, do have an IDNA 2008 replacing IDNA 2003. The next scenario is that it is decided that some significant additional consideration is necessary before anything conclusive can be said, at which point it becomes anybody's guess what will happen. It will likely not be an IDNA 2008. It will be an IDNA 20-something. And the third scenario is that the basic approach, this ASCII compatible encoding approach, is simply not the appropriate solution, and that we're going to need to do something completely different, in which case IDNA 2003 will be IDNA, that will be it forever and on. The current set of applications will be adequate for their purpose until something new is introduced which will exist in parallel. There will never be any question of backwards compatibility between the next version of IDNA and the previous one because there will be only one version of it and the world can wait with baited breath and keen interest to see what solution is introduced. So it will either be a long-term discussion, which means that IDNA remains where it is right now, or it will be the intended near-term revision -- the removing the Unicode dependency being the absolute crucial issue. The bidi issues that Harald spoke about might actually be segregated out of this and simply dealt with separately. Or we can tell you better after we have had the meeting in Dublin, in case everybody is comfortable with the basic Punycode encoding mechanism, but wishes to address additional detail that the informal documentation didn't. It would probably be very useful if Patrik were able to say the same thing from his perspective. >>PATRIK FÄLTSTRÖM: No, I don't have much more to add. At the moment, in the working group, things look like if it actually can be done pretty soon. On the other hand, in the IETF process, just like in many other processes which are similar, when people are asked, "Please have a look at the documents and comment now or keep your mouth shut forever," that is when they are coming with input. So although it's quiet very much now and people seem to agree, you never know what can happen in the last minute. >>TINA DAM: Okay. Thanks. Cary, did you mention the next meeting in Dublin is in July. And the IETF is meeting three times a year, I think, approximately; right? So there will be one more meeting before the end of this calendar year. >>ROBERT HUTCHINSON: Are these documents in IETF RFC draft form at this point or.... -- yeah. >>TINA DAM: Yeah. >>ROBERT HUTCHINSON: And do we have the RFC numbers? >>CARY KARP: The resource that Tina initially pointed to. There's a dynamic list that Patrik maintenance which simply gives you the latest version -- it lists all of the relevant documents and their latest version. >>ROBERT HUTCHINSON: Okay. >>CARY KARP: Can we put that slide back up on the screen? Here it comes. >>ROBERT HUTCHINSON: I guess the other question I have is for Tina. This could go one way or the other with the 2003 standard being left alone for a long period of time. Do you regard that as any problem at all for the IDN progress at ICANN, if we just stay on the 2003 standard, or do you really think that we need to be moving to this? Or how do they dovetail? If we do move to this, is -- what happens in both directions, from your standpoint? >>TINA DAM: So if it's being delayed and we are staying on the 2003 standard, and the 2008 or whatever you want to call the proposed revision is not being finalized within this calendar year, to may that's problematic. And it's problematic because that means that we do not have a new and, what I consider, a much more stable and much more adequate protocol to base TLD applications upon. And that's one of the topics that's been discussed this week in a lot of different forums, is what is ICANN going to do if the protocol revision isn't done. Are you going to allow IDN TLDs to go ahead? And my answer has been that we prefer to have the protocol revision done. However, we also understand that if it's being delayed, we can't let it delay the introduction of IDN TLDs, in which case we would probably come up -- well, we would ask these guys if they would help us come up with some additional rules and restrictions around what we are going to allow at the top level until we have the new protocol in place. So it's not going to be either 2003 or 2008. If it's being delayed, it's going to be 2003 with something on top of it. Or it's going to be draft 2008 with something on top of it. >>PATRIK FÄLTSTRÖM: Can I please have the video turned on? Thank you. So here is this Web page that Tina and I displayed, which is dynamically generated. So you always see the latest versions of the documents here. Let me see if I can do something intelligent here. Let me see. Look at this. So this is where you can see the various documents, draft IETF IDNA bidi, the protocol, the rationale, and look and see the latest versions, also watch the differences between the different versions. So here is one example. Let's see. So here is one example where we can, on this page, directly see what kind of changes have been made on these documents. And there are infinite drafts, and RFC numbers are not indicated until the drafts are actually finalized. >>TINA DAM: Great. I think that takes us to summary, and I am going to make that really short because it's almost noon and I guess everybody wants to go to lunch. But I also want to see if there are more questions, so you are welcome at the microphone, also here at the end of the session. We started protocol revision because there are a number of issues with the current version. And one of the main things was that we wanted to have a forward-looking solution so that we didn't have to continue having revisions done. Right? So Unicode version independence is the main piece of this revision. And it's being done in the way that Patrik explained with the properties and the characters. So no matter what character is being added into Unicode, it will automatically work in the protocol. Of course that is to just enable as many characters from as many scripts and languages as possible into Internationalized Domain Names. But we also fixed other problems, such as the right-to-left that Harald was talking about, so that if a string or a domain name ends with a character that doesn't have a direction, the protocol won't fail anymore. It will actually accept that for characters that are needed in that right-to-left scripts but that in Unicode do not have the right-to-left directional property. So that is being fixed. And then also Harald had some information about new rules around how you use digits, because in right-to-left, that is creating some problems. On the implementation side, we had the registration and resolution steps split up into two separate descriptions to ease the way that is understood and implemented by registries and application developers. And then we had one example of how you then apply things on top of the protocol, because the protocol is just going to give you the general description of whether a character can be used or not. Then comes all of the guidelines and the registration policies along with the variant tables on top of it that most -- in most cases is a registry decision on how to implement things for their users. Of course there's some global guidelines, and then there's the local guidelines that the Arabic script working group and other working groups are applying, and they are applying them to make it safe for the users. So it's going to make some restrictions, but it's for user safety and so forth. And, yeah, I think that's pretty much it. Timing-wise, we hope the protocol revision will be done, but as you heard, that is not a guarantee. So let's see if there are any other questions. And it looks like we exhausted the use of the microphone for today, so, Cary and Patrik, do you have any final remarks or are you done as well? Yeah, so thank you, and thank you for joining us and sitting in on this session. We may want to try to do additional educational sessions on the protocol revision, but this was the first major one that we had, so thank you for taking part in this. [ Applause ]