What and How of IDNs - The Complexity of Rule Generation

Attached please find a very useful infographic with some quantitative measures about the IDNs. People go for the IDNs to reflect a functionality that could simply be branding. In 2014 India had just 8% of the Global Internet Users and 22 Official Languages. However, there are a total of 121 languages and 270 mother tongues that are in used by 10,000 or more people in India. Complexity studies in rule generation is quite challenging in this space.

My approach to IDNs begins with the Natural Language Processing.

"Natural Language Processing" is a field at the intersection of computer science, artificial intelligence, and linguistics. The goal is for computers to process or “understand” natural language in order to perform tasks like Language Translation and Question Answering. With the rise of voice interfaces and chatbots, the idea of IDNs has more dimensions to work on in addition to be merely passive strings that serve as "label like TAGs to the URLs".

Human language is a system specifically constructed to convey the speaker / writer’s meaning. It’s NOT just an environmental signal but a deliberate communication. Besides, it uses an encoding that little kids can learn quickly; it also changes.
Human language is mostly a discrete/symbolic/categorical signalling system, presumably because of greater signalling reliability.
The categorical symbols of a language can be encoded as a signal for communication in several ways: sound, gesture, writing, images, etc. human language is capable of being any of those.
Human languages are ambiguous (unlike programming and other formal languages); thus there is a high level of complexity in representing, learning, and using linguistic / situational / contextual / word / visual knowledge towards the human language.

Root Zone Label Generation Rules (RZ-LGR) provide a conservative mechanism to determine valid IDN TLDs and their variant labels, for stable and secure operation of the DNS Root Zone. A community based panel provides the Label Generation Rules for each language. The Generation Panels start with a broad set of code points for the relevant script(s), known as the Maximal Starting Repertoire and propose relevant Label Generation Rules.

There is a ICANN specified procedure that provides a mechanism for creating and maintaining the rules with respect to IDN labels for the root. This mechanism can be used to determine which Unicode code points are permitted for use in U-labels in the root zone, what variants (if any) are possible to allocate in the root zone, and what variants (if any) are automatically blocked.

EPDP on IDNs as I understand is not about exploring the technical space for alternatives, expansions or solutions.

Observation #1: Only a subset of the character set that makes the Natural Language are used to make the URL strings. They may also be words in the Natural Language. There are a number of possible ways to subset the collection of code points from a given script that are used in connection with a particular language.

Observation #2: Principles for Inclusion, Exclusion or Deferral of Code Points may be specified for a given script of a specific language

The challenge of the inherent complexity in machine (automatic) translation / transliteration of IDNs is obvious. In my humble opinion, it should be available by default and instantaneously in all chosen languages of the applicant for a Domain Name.

Even though it is indulgence in the technology space, to my mind conversion from business based "Root Zone - Label Generation Rules" in any natural language to generative formal rules for a computer system where they can be controlled via a "generic" rules engine is very important.

Gopal T V

Content

Space Tools