Punycode

Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, the German München (English: Munich) is encoded as Mnchen-3ya.

While the Domain Name System (DNS) technically supports arbitrary sequences of octets in domain name labels, the DNS standards recommend the use of the LDH subset of ASCII conventionally used for host names, and require that string comparisons between DNS domain names should be case-insensitive. The Punycode syntax is a method of encoding strings containing Unicode characters, such as internationalized domain names (IDNA), into the LDH subset of ASCII favored by DNS. It is specified in IETF Request for Comments 3492.[1]

Description

As stated in RFC 3492, "Punycode is an instance of a more general algorithm called Bootstring, which allows strings composed from a small set of 'basic' code points to uniquely represent any string of code points drawn from a larger set." Punycode defines parameters for the general Bootstring algorithm to match the characteristics of Unicode text. This section demonstrates the procedure for Punycode encoding, using as an example the German string "bücher" (English: books), which is translated into the label "bcher-kva".

To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded values from encoding inadmissible Unicode values: however, these should be checked for and detected during decoding.

Punycode is designed to work across all scripts, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates. It is optimized for the case where the string is composed of zero or more ASCII characters and in addition characters from only one other script system, but will cope with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been normalized using nameprep and (for top-level domains) filtered against an officially registered language table before being punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string.

Separation of ASCII characters

First, all ASCII characters in the string are copied from input to output, skipping over any other characters. For example, "bücher" is copied to "bcher". If any characters were copied, i.e. if there was at least one ASCII character in the input, an ASCII hyphen is appended to the output (e.g., "bücher" → "bcher-", but "ü" → "").

Note that hyphens are themselves ASCII characters. Thus, they can be present in the input and, if so, they will be copied to the output. This causes no ambiguity: if the output contains hyphens, the one that got added is always the last one. It marks the end of the ASCII characters.

Encoding the non-ASCII characters

The non-ASCII characters are sorted by Unicode value, lowest first (if a character occurs more than once they are sorted by position). Each is then encoded as a single number. This single number defines both the location to insert the character at and which character to insert.

  • i is an index into the result to insert the code at, starting at 0 (for insertion at the start).[citation needed]
  • n is the number of possible insertion points (current length of the result plus one).
  • j is the Unicode code point to insert minus 127.[citation needed]

The encoded number is n × j + i. By dividing by n and also getting the remainder, a decoder can determine j and i.

There are six possible places to insert a character in the string "bcher" (including before the first character and after the last one). There are 124 code points between the last ASCII code point (127 = 0x7F, the end of ASCII) and ü (code point 252 = 0xFC, see Unicode's Latin-1 Supplement). The ü is inserted at position 1, after the b. Thus the encoder will add the number (6 × 124) + 1 = 745, and the decoder can retrieve these by ⌊745 ÷ 6⌋ = 124 and 745 mod 6 = 1.

These numbers are strictly increasing. For the second and subsequent inserted characters, the difference between the number and the previous one is written.

The number is encoded using the letters a through z and the digits 0 through 9. It is not base-36 but a more complex scheme described below, which allows the numbers to be concatenated, with nothing separating them.

Variable-length number encoding

Punycode uses generalized variable-length integers to represent these values. For example, this is how "kva" is used to represent the code number 745:

A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits vary.

In this case a number system with 36 symbols is used, with the case-insensitive 'a' through 'z' equal to the decimal numbers 0 through 25, and '0' through '9' equal to the decimal numbers 26 through 35. Thus "kva", corresponds to the decimal number string "10 21 0".

To decode this string of symbols, a sequence of thresholds will be needed, in this case it's (1, 1, 26, 26, ...).[2] The weight (or place value) of the least-significant digit is always 1: 'k' (=10) with a weight of 1 equals 10. After this, the weight of the next digit depends on the first threshold: generally, for any n, the weight of the (n+1)-th digit is w × (36 − t), where w is the previous weight and t is the threshold of the n-th digit. So in this case, the second symbol has a place value of 36 minus the previous threshold value of 1, which equals 35. Therefore, the sum of the first two symbols 'k' (=10) and 'v' (=21) is 10 × 1 + 21 × 35. Since the second symbol is not less than its threshold value of 1, there is more to come. However, since the third symbol in this example is 'a' (=0), we may ignore calculating its weight. Therefore, "kva" represents the decimal number (10 × 1) + (21 × 35) = 745.

Number 745 will be encoded as 10 + 21 × 35 + 0 (base 35 used for second digit, the most significant digit 0 needed as terminator), 10 → 'k', 21 → 'v', 0 → 'a', so "bücher" → "bcher-kva".

The thresholds themselves are determined for each successive encoded character by an algorithm keeping them between 1 and 26 inclusive.[3] The case can then be used to provide information about the original case of the string.[4]

Because special characters are sorted by their code points by encoding algorithm, for the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes codes representing insertion of ý, the Unicode character following ü, starting with "ýbücher" with code "bcher-kvaf" (different from "übücher" coded "bcher-jvab"), etc.

ACE prefix for internationalized domain names

To prevent hyphens in non-international domain names from triggering a Punycode decoding, the string xn-- is prepended to Punycode sequences in internationalized domain names. This is called ACE (ASCII Compatible Encoding).[5]

Thus the domain name "bücher.tld" would be represented in a URL as "xn--bcher-kva.tld".

Examples

The following table shows examples of Punycode encodings for different types of input.[6]

Input Punycode Description
The empty string.
a a- Only ASCII characters, one, lowercase.
A A- Only ASCII characters, one, uppercase.
3 3- Only ASCII characters, one, a digit.
- -- Only ASCII characters, one, a hyphen.
-- --- Only ASCII characters, two hyphens.
London London- Only ASCII characters, more than one, no hyphens.
Lloyd-Atkinson Lloyd-Atkinson- Only ASCII characters, one hyphen.
This has spaces This has spaces- Only ASCII characters, with spaces.
-> $1.00 <- -> $1.00 <-- Only ASCII characters, mixed symbols.
Б d0a No ASCII characters, one Cyrillic character.
ü tda No ASCII characters, one Latin-1 Supplement character.
α mxa No ASCII characters, one Greek character.
fsq No ASCII characters, one CJK character.
😉 n28h No ASCII characters, one emoji character.
αβγ mxacd No ASCII characters, more than one character.
München Mnchen-3ya Mixed string, with one character that is not an ASCII character.
Mnchen-3ya Mnchen-3ya- Double-encoded Punycode of "München".
München-Ost Mnchen-Ost-9db Mixed string, with one character that is not ASCII, and a hyphen.
Bahnhof München-Ost Bahnhof Mnchen-Ost-u6b Mixed string, with one space, one hyphen, and one character that is not ASCII.
abæcdöef abcdef-qua4k Mixed string, two non-ASCII characters.
правда 80aafi6cg Russian, without ASCII.
ยจฆฟคฏข 22cdfh1b8fsa Thai, without ASCII.
도메인 hq1bm8jm9l Korean, without ASCII.
ドメイン名例 eckwd4c7cu47r2wf Japanese, without ASCII.
MajiでKoiする5秒前 MajiKoi5-783gue6qz075azm5e Japanese with ASCII.
「bücher」 bcher-kva8445foa Mixed non-ASCII scripts (Latin-1 Supplement and CJK).

See also

References

  1. ^ RFC 3492, Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDN), A. Costello, The Internet Society (March 2003)
  2. ^ This is true for the first encoded character (or, in terms of RFC 3492, the first "delta"): see RFC 3492, Sec. 6.
  3. ^ RFC 3492, Secs. 3.4, 5.
  4. ^ RFC 3492, App. A.
  5. ^ Internet Assigned Numbers Authority (2003-02-14). "Completion of IANA Selection of IDNA Prefix". www.atm.tut.fi. Archived from the original on 2010-04-27. Retrieved 2017-09-22.
  6. ^ The Punycode in this table was created using the builtin codec "punycode" of the Python programming language version 3.8 (s.encode("punycode")). See talk page.

Read other articles:

D-Crunch디크런치Informasi latar belakangAsalSeoul, Korea SelatanGenreK-popTahun aktif2018 (2018)–2022 (2022)LabelAll-S CompanyMantan anggota Hyunwook Hyunho O.V Minhyuk Hyunwoo Hyunoh Chanyoung Jungseung Dylan D-Crunch (Hangul: 디크런치) adalah grup vokal pria asal Korea Selatan yang dibentuk oleh All-S Company di Seoul, Korea Selatan.[1] Grup ini debut pada tanggal 6 Agustus 2018 dengan 0806 dan bubar pada tanggal 9 November 2022.[2] Mantan Anggota Hyunwoo...

 

 

Artikel ini memberikan informasi dasar tentang topik kesehatan. Informasi dalam artikel ini hanya boleh digunakan untuk penjelasan ilmiah; bukan untuk diagnosis diri dan tidak dapat menggantikan diagnosis medis. Wikipedia tidak memberikan konsultasi medis. Jika Anda perlu bantuan atau hendak berobat, berkonsultasilah dengan tenaga kesehatan profesional. BabesiosisBabesia dalam sel darah merahInformasi umumNama lainPiroplasmosisSpesialisasiPenyakit infeksi, kedokteran hewanPenyebabProtozoa Bab...

 

 

Series of alternating glacial and interglacial periods This article is about the series of glacial periods during the last 2.58 million years. For the glacial period lasting from 110,000 to 11,700 years ago, see Last Glacial Period. Extent of maximum glaciation (in black) in the Northern Hemisphere during the Pleistocene. The formation of 3 to 4 km (1.9 to 2.5 mi) thick ice sheets equate to a global sea level drop of about 120 m (390 ft) The Quaternary glaciation, also kno...

العلاقات الجامايكية السيراليونية جامايكا سيراليون   جامايكا   سيراليون تعديل مصدري - تعديل   العلاقات الجامايكية السيراليونية هي العلاقات الثنائية التي تجمع بين جامايكا وسيراليون.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجعية للد�...

 

 

Mauro Bergonzi Informazioni personali Arbitro di Calcio Sezione Genova Professione Assicuratore Attività nazionale Anni Campionato Ruolo 1997-19991999-20022002-20102010-2014 Serie DSerie CSerie A e BSerie A ArbitroArbitroArbitroArbitro Attività internazionale 2009-2013 UEFA e FIFA Arbitro Esordio Malta-Finlandia 1-23 marzo 2010 Mauro Bergonzi (Genova, 30 dicembre 1971) è un ex arbitro di calcio e dirigente sportivo italiano, club manager dell’ASD Albissole 1909. Indice 1 Biografia 2 Ric...

 

 

Si ce bandeau n'est plus pertinent, retirez-le. Cliquez ici pour en savoir plus. Cet article ne cite pas suffisamment ses sources (février 2022). Si vous disposez d'ouvrages ou d'articles de référence ou si vous connaissez des sites web de qualité traitant du thème abordé ici, merci de compléter l'article en donnant les références utiles à sa vérifiabilité et en les liant à la section « Notes et références ». En pratique : Quelles sources sont attendues ? ...

La palazzina della Secessione, a Vienna Nella storia dell'arte viene definita secessione una proposizione culturale che si manifestò verso la fine del XIX secolo in diverse città europee, soprattutto tedesche e austriache, che coincise con il distacco di alcuni giovani dalle istituzioni ufficiali dell'arte, dalle accademie e dai loro circuiti espositivi, come segno di protesta nei confronti dell'eccessivo conservatorismo del loro tempo.[1] Indice 1 Storia 1.1 La terminologia 1.2 Var...

 

 

Voce principale: Borussia Verein für Leibesübungen 1900 Mönchengladbach. Borussia Verein für Leibesübungen 1900 MönchengladbachStagione 1970-1971Sport calcio Squadra Borussia M'gladbach Allenatore Hennes Weisweiler Bundesliga1º posto Coppa di GermaniaQuarti di finale Coppa dei CampioniOttavi di finale Maggiori presenzeCampionato: Kleff, Köppel, Müller, Vogts (34)Totale: Kleff, Köppel, Müller, Vogts (42) Miglior marcatoreCampionato: Laumen (20)Totale: Laumen (28) StadioBökelb...

 

 

2019 EP by StarboySoundMan Vol. 1EP by StarboyReleased6 December 2019Recorded2019Genre Afrobeats reggae R&B Length24:46LabelStarboyProducer P2J Blaq Jerzee Kel-P London Soundman Vol.1 is a debut EP by Nigerian record label Starboy Entertainment.[1] It was released on 6 December 2019 without any announcement.[2] The 7-track EP primarily features vocals by Wizkid, along with assistance from Chronixx, Blaq Jerzee, Kel P, London and DJ Tunez.[3] Composition So...

Shenergy Group Company Limited 申能集团有限公司Company typeState-owned enterpriseIndustryEnergy investmentsFounded1996HeadquartersShanghai, People's Republic of ChinaArea servedPeople's Republic of ChinaKey peopleChairman: Qiu WeiguoSubsidiariesShenergy CompanyWebsitewww.shenergy.com.cn/ Shenergy GroupSimplified Chinese申能集团有限公司Traditional Chinese申能集團有限公司TranscriptionsStandard MandarinHanyu PinyinShēnnéng Jítuán Yǒuxiàn Gōngsī Shenergy Group...

 

 

Gaulish goddess In Gallo-Roman religion, Damona was a goddess worshipped in Gaul as the consort of Apollo Borvo and of Apollo Moritasgus. Name Look up Damona in Wiktionary, the free dictionary. The theonym Damona is a derivative of the Proto-Celtic stem *damo-, meaning 'bull' or 'deer' (cf. Old Irish dam 'bull, deer'; also *damato- > Middle Welsh dafad 'sheep', Old Cornish dauat 'ewe'), itself from Proto-Indo-European *dmh2o- ('the tamed one'). The Latin noun damma, which is the source of ...

 

 

Japanese anime television series This article does not cite any sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: The Brave Express Might Gaine – news · newspapers · books · scholar · JSTOR (September 2019) (Learn how and when to remove this message) The Brave Express Might GaineCover art from box 1 of the DVD release勇者特急マイトガイン(Yūsha Tok...

此条目序言章节没有充分总结全文内容要点。 (2019年3月21日)请考虑扩充序言,清晰概述条目所有重點。请在条目的讨论页讨论此问题。 哈萨克斯坦總統哈薩克總統旗現任Қасым-Жомарт Кемелұлы Тоқаев卡瑟姆若马尔特·托卡耶夫自2019年3月20日在任任期7年首任努尔苏丹·纳扎尔巴耶夫设立1990年4月24日(哈薩克蘇維埃社會主義共和國總統) 哈萨克斯坦 哈萨克斯坦政府...

 

 

Long bone that serves as a strut between the scapula and the sternum Collarbone redirects here. For the band, see Collarbone (band). This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Clavicle – news · newspapers · books · scholar · JSTOR (July 2009) (Learn how and when to remove this message) Clavicle (collarb...

 

 

الحرب العالمية الثانية مع عقارب الساعة من الأعلى يساراً: قوات الكومنولث في الصحراء المصرية، مدنيون صينيون يدفنهم جنود يابانيون أحياء، حاملة الطائرات الأمريكية بانكر هيل، بعد إصابتها من قبل انتحارِيَين يابانيين، طائرات يابانية على متن حاملة طائرات تستعد للإقلاع، جنود س�...

تواصل دفاعيالتواصل الدفاعي يؤدي إلى تدهور الحوار في المجموعة.تعديل - تعديل مصدري - تعديل ويكي بيانات التواصل الدفاعي (بالإنجليزية: Defensive communication)‏ هو سلوك تواصلي يحدث داخل العلاقات الشخصية وبيئات العمل والمجموعات الاجتماعية.[1][2] عندما يتفاعل الفرد بطريقة دفاعية رد...

 

 

Singaporean politician This biography of a living person needs additional citations for verification. Please help by adding reliable sources. Contentious material about living persons that is unsourced or poorly sourced must be removed immediately from the article and its talk page, especially if potentially libelous.Find sources: Lim Swee Say – news · newspapers · books · scholar · JSTOR (March 2017) (Learn how and when to remove this message) Lim Swe...

 

 

All of space observable from the Earth at the present Observable universeVisualization of the observable universe. The scale is such that the fine grains represent collections of large numbers of superclusters. The Virgo Supercluster—home of Milky Way—is marked at the center, but is too small to be seen.Diameter8.8×1026 m or 880 Ym (28.5 Gpc or 93 Gly)[1]Circumference2.764×1027 m or 2.764 Rm (89.6 Gpc or 292.2 Gly)Volume3.566×1080 m3[2]Mass (ordinary matt...

この記事には複数の問題があります。改善やノートページでの議論にご協力ください。 出典がまったく示されていないか不十分です。内容に関する文献や情報源が必要です。(2012年10月) 独自研究が含まれているおそれがあります。(2012年10月)出典検索?: ワンロマ – ニュース · 書籍 · スカラー · CiNii · J-STAGE · NDL · dlib.jp · ジ...

 

 

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Sauto Theater – news · newspapers · books · scholar · JSTOR (January 2021) (Learn how and when to remove this message) Theater in Matanzas, CubaSauto TheaterTeatro SautoSauto TheaterFormer namesEsteban Theater (Teatro Esteban)General informationTypeTheaterArchi...