Unicode equivalence

Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with pre-existing standard character sets, which often included similar or identical characters.

Unicode provides two such notions, canonical equivalence and compatibility. Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E n LATIN SMALL LETTER N followed by U+0303 ◌̃ COMBINING TILDE is defined by Unicode to be canonically equivalent to the single code point U+00F1 ñ LATIN SMALL LETTER N WITH TILDE of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other. Similarly, each Hangul syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.

Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the typographic ligature "ff") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as sorting and indexing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one fully composed (where multiple code points are replaced by single points whenever possible), and one fully decomposed (where single points are split into multiple ones).

Sources of equivalence

Character duplication

For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the letter "A with a ring diacritic above" is encoded as U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE (a letter of the alphabet in Swedish and several other languages) or as U+212B ANGSTROM SIGN. Yet the symbol for angstrom is defined to be that Swedish letter, and most other symbols that are letters (such as ⟨V⟩ for volt) do not have a separate code point for each usage. In general, the code points of truly identical characters are defined to be canonically equivalent.

Combining and precomposed characters

For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å") or as combinations of two or more characters (such as U+FB00 for the ligature "ff" or U+0132 for the Dutch letter "IJ")

For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding base character. Examples of these combining characters are the combining tilde and the Japanese diacritic dakuten ("◌゛", U+3099).

In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.

In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur.

Example

Amélie with its two canonically equivalent Unicode forms (NFC and NFD)
NFC character A m é l i e
NFC code point 0041 006d 00e9 006c 0069 0065
NFD code point 0041 006d 0065 0301 006c 0069 0065
NFD character A m e ◌́ l i e

Typographical non-interaction

Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are, in general, canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.

Typographic conventions

Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as ligatures, the half-width katakana characters, or the full-width Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in subscript or superscript positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However, the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.

Encoding errors

UTF-8 and UTF-16 (and also some other Unicode encodings) do not allow all possible sequences of code units. Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy (e.g., turning all invalid sequences into the same character). This can be considered a form of normalization and can lead to the same difficulties as others.

Normalization

A text processing software implementing the Unicode string search and comparison functionality must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.

Algorithms

Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the representative element of an equivalence class, multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normal forms to be unique.

In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance, some typographic ligatures like U+FB03 (), Roman numerals like U+2168 () and even subscripts and superscripts, e.g. U+2075 () have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman numeral (U+2168). Similarly, the superscript (U+2075) is transformed to 5 (U+0035) by compatibility mapping.

Transforming superscripts into baseline equivalents may not be appropriate, however, for rich text software, because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains compatibility formatting tags that provide additional details on the compatibility transformation.[1] In the case of typographic ligatures, this tag is simply <compat>, while for the superscript it is <super>. Rich text standards like HTML take into account the compatibility tags. For instance, HTML uses its own markup to position a U+0035 in a superscript position.[2]

Normal forms

The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table below.

NFD
Normalization Form Canonical Decomposition
Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
NFC
Normalization Form Canonical Composition
Characters are decomposed and then recomposed by canonical equivalence.
NFKD
Normalization Form Compatibility Decomposition
Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
NFKC
Normalization Form Compatibility Composition
Characters are decomposed by compatibility, then recomposed by canonical equivalence.

All these algorithms are idempotent transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm.

The normal forms are not closed under string concatenation.[3] For defective Unicode strings starting with a Hangul vowel or trailing conjoining jamo, concatenation can break Composition.

However, they are not injective (they map different original glyphs and sequences to the same normalized sequence) and thus also not bijective (cannot be restored). For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").

A single character (other than a Hangul syllable block) that will get replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.

Canonical ordering

The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be diacritics, even though in general some diacritics are not combining characters, and some combining characters are not diacritics.

Unicode assigns each character a combining class, which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a stable sorting algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are not considered equivalent.

For example, the character U+1EBF (ế), used in Vietnamese, has both an acute and a circumflex accent. Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230, thus U+1EBF is not equivalent to U+0065 U+0301 U+0302.

Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+0302), even the normal form NFC is affected by combining characters' behavior.

Errors due to normalization differences

When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance, OS X normalized Unicode filenames sent from the Netatalk and Samba file- and printer-sharing software. Netatalk and Samba did not recognize the altered filenames as equivalent to the original, leading to data loss.[4][5] Resolving such an issue is non-trivial, as normalization is not losslessly invertible.

See also

Notes

  1. ^ "UAX #44: Unicode Character Database". Unicode.org. Retrieved 20 November 2014.
  2. ^ "Unicode in XML and other Markup Languages". Unicode.org. Retrieved 20 November 2014.
  3. ^ Per What should be done about concatenation
  4. ^ "netatalk / Bugs / #349 volcharset:UTF8 doesn't work from Mac". SourceForge. Retrieved 20 November 2014.
  5. ^ "rsync, samba, UTF8, international characters, oh my!". 2009. Archived from the original on January 9, 2010.

References

Read other articles:

2006 video gameThe Guild 2Developer(s)4HEAD StudiosPublisher(s)AspyrJoWooD ProductionsDeep SilverGame Factory InteractiveDirector(s)Matti JägerProducer(s)Ralf C. AdamDesigner(s)Heinrich MeyerProgrammer(s)Carsten StolpmannArtist(s)Matthias KästnerComposer(s)Bernd RufSeriesThe GuildEngineGamebryo enginePlatform(s)Microsoft WindowsReleaseEU: September 29, 2006NA: October 12, 2006AU: November 23, 2006Genre(s)Life simulationMode(s)Single-player  The Guild 2 is a life simulation video game ...

 

Place in California listed on National Register of Historic Places Mount San Jacinto State ParkIUCN category Ib (wilderness area)[1]Rock formation and trees seen from Round Valley trail in winterShow map of CaliforniaShow map of the United StatesLocationRiverside County, California, United StatesNearest cityIdyllwild, CaliforniaCoordinates33°48′N 116°40′W / 33.800°N 116.667°W / 33.800; -116.667Area14,000 acres (5,700 ha)Established1933Governin...

 

House in Sydney, New South Wales, Australia CarthonaCarthona, Darling Point, c. 1870 before the 1880s extensions at the back were made.Location5 Carthona Avenue, Darling Point, New South Wales, AustraliaCoordinates33°52′07″S 151°14′30″E / 33.8687°S 151.2417°E / -33.8687; 151.2417Built1841Built forSir Thomas MitchellArchitectGeorge Allen MansfieldArchitectural style(s)Gothic Revival architectureLocation of Carthona in Sydney Carthona is a large Gothic ...

Le texte apparait en double après avoir traversé le cristal de calcite. C'est la double réfraction, un phénomène caractéristique des milieux biréfringents. Article connexe : Réfringence. La biréfringence est la propriété physique d'un matériau dans lequel la lumière se propage de façon anisotrope. Dans un milieu biréfringent, l'indice de réfraction n'est pas unique, il dépend de la direction de polarisation de l'onde lumineuse. Un effet spectaculaire de la biréfringence...

 

2001 single by Gloria Estefan Por Un BesoSingle by Gloria Estefanfrom the album Alma Caribeña Released2000 (Mexico)Recorded2000GenreBoleroLength5:01LabelEpicSongwriter(s)Robert BladesGloria Estefan singles chronology Me Voy (2000) Por Un Beso (2000) You Can't Walk Away from Love (2001) Por Un Beso (For a Kiss) is a song by Gloria Estefan, released as the fourth promotional single and seventh single overall from her third Spanish album Alma Caribeña. In 2004 it was included on her compilatio...

 

Halaman ini berisi artikel tentang konsep hukum. Untuk kegunaan lain, lihat Di luar hukum (disambiguasi). Henry Danvers, Earl Danby, dijadikan orang di luar hukum pada 1597 oleh dewan koroner untuk pembunuhan Henry Long. Ia pergi ke Prancis dan bergabung dengan tentara Prancis; dua tahun kemudian ia dipanggil oleh Ratu Elizabeth dan kembali ke Inggris. Dalam sistem hukum dalam sejarah, di luar hukum dideklarasikan sebagai orang yang berada di luar perlindungan hukum. Dalam masyrakat pra-moder...

AH3

tidak ada Jalan Tol Asia 3Persimpangan besarUjung Utara:Ulan-Ude, RusiaUjung Selatan:Chiang Rai, ThailandLetakNegara:Russia, Mongolia, China, Myanmar, Laos, ThailandSistem jalan bebas hambatanJaringan Jalan Tol Asia Asian Highway 3 (AH3) adalah bagian dari Jaringan Jalan Asia sejauh 7,331 km (4,555 mi) dari Ulan-Ude, Rusia (di AH6) ke Tanggu, Tiongkok; dan Shanghai, Tiongkok (di AH5) ke Chiang Rai, Thailand dan Kengtung, Myanmar (berakhir di AH2).[1] Rute Lambang A...

 

Mitsui O.S.K. Lines, Ltd.Nama asli株式会社商船三井JenisPublik (KK)Kode emitenTYO: 9104IndustriTransportasiDidirikan1884; 139 tahun lalu (1884)KantorpusatToranomon, Minato, Tokyo, JepangWilayah operasiSeluruh duniaKaryawan9.626Situs webwww.mol.co.jp Mitsui O.S.K. Lines (Jepang: 株式会社商船三井; disingkat menjadi MOL) adalah sebuah perusahaan transportasi yang berkantor pusat di Toranomon, Minato, Tokyo, Jepang.[1] MOL adalah salah satu perusahaan pengapalan te...

 

Tiberius IIIKaisar RomawiSolidus yang menampilkan gambar Tiberius IIIKaisar BizantiumBerkuasa698–705PendahuluLeontiusPenerusJustinian IIInformasi pribadiKelahiranApsimarKematianAntara Agustus 705 dan Februari 706Konstantinopel (kini Istanbul, Turki)PemakamanProte (kini Kınalıada, Turki)PeriodeAnarki Dua Puluh TahunNama lengkapApsimarusNama takhtaTiberiusAnakTeodosius (Teodosius III?)Heraklius?[a] Tiberius III[b] (Yunani: Τιβέριος, translit. Tibérios), nam...

British DJ, record producer and club promoter (1948–2008) This article has an unclear citation style. The references used may be made clearer with a different or consistent style of citation and footnoting. (June 2023) (Learn how and when to remove this message) DJ TallulahTallulah (left) with Steve Strange in 1983Background informationBirth nameMartyn AllamBorn1948Hamburg, GermanyDied28 March 2008(2008-03-28) (aged 59–60)Ladbroke Grove, London, EnglandGenresHousediscodanceR&BOcc...

 

Australian indie rock band This article's lead section may be too short to adequately summarize the key points. Please consider expanding the lead to provide an accessible overview of all important aspects of the article. (July 2023) CustardCustard in 2011Background informationOriginBrisbane, Queensland, AustraliaGenresIndie rockYears active1989–2000, 2009–presentLabelsABC Music, rooArt, BMGMembers David McCormack Paul Medew Matthew Strong Glenn Thompson Past members James Straker Shane B...

 

Questa voce sull'argomento calciatori bulgari è solo un abbozzo. Contribuisci a migliorarla secondo le convenzioni di Wikipedia. Segui i suggerimenti del progetto di riferimento. Plamen Krumov Nazionalità  Bulgaria Altezza 180 cm Calcio Ruolo Difensore Squadra  Rilski Sportist CarrieraGiovanili  Lokomotiv SofiaSquadre di club1 2003-2008 Lokomotiv Sofia17 (2)2006→  Zagorets8 (0)2007→  Rilski Sportist14 (1)2008→  Banants9 (1)2009-2010 Minjor...

此條目可参照英語維基百科相應條目来扩充。 (2021年5月6日)若您熟悉来源语言和主题,请协助参考外语维基百科扩充条目。请勿直接提交机械翻译,也不要翻译不可靠、低品质内容。依版权协议,译文需在编辑摘要注明来源,或于讨论页顶部标记{{Translated page}}标签。 约翰斯顿环礁Kalama Atoll 美國本土外小島嶼 Johnston Atoll 旗幟颂歌:《星條旗》The Star-Spangled Banner約翰斯頓環礁�...

 

Japanese manga series I Am a HeroFirst tankōbon volume cover, featuring Hideo Suzukiアイアムアヒーロー(Ai Amu a Hīrō)GenreAction[1][2]Horror[1][2]Psychological thriller[3] MangaWritten byKengo HanazawaPublished byShogakukanEnglish publisherNA: Dark Horse ComicsMagazineWeekly Big Comic SpiritsDemographicSeinenOriginal runApril 27, 2009 – February 27, 2017Volumes22 Live-action filmDirected byShinsuke SatoMusic byNima Fakh...

 

Neighborhoods in Washington, D.C. The eight wards of Washington, D.C. as of 2023 Neighborhoods in Washington, D.C., the capital of the United States, are distinguished by their history, culture, architecture, demographics, and geography. The names of 131 neighborhoods are unofficially defined by the D.C. Office of Planning.[1] Neighborhoods can be defined by the boundaries of wards, historic districts, Advisory Neighborhood Commissions, civic associations, and business improvement di...

Large open-air venue used for public events in the ancient Roman Empire The site of the former Circus Maximus in modern-day Rome 'Circus (building)' redirects here. For the British English use of circus to describe circular housing projects, see crescent (architecture). A Roman circus (from the Latin word that means circle) was a large open-air venue used mainly for chariot races, although sometimes serving other purposes. It was similar to the ancient Greek hippodrome. Along with theatres an...

 

Indian TV series or programme PonniGenreFamily DramaWritten bySridharan K.A. Vijayan R. Ashwini Vijayan Dialogue Sivaram Kumar Santham GeorgeScreenplay byVijayan Rajesh (1-165) Priya Thambi (episodes 166-present)Directed byManoj KumarStarringVaishnavi Sundar Sabari Nathan Sindhuja VijiTheme music composerIlayavannanOpening themePonniyn Thirumurga by Sai Vignesh and Surmukhi RamanCountry of originIndiaOriginal languageTamilNo. of seasons1No. of episodes300+[1][2]ProductionProd...

 

Two-wheeled cart for carrying ammunition, or supporting the trail of an artillery piece For other uses, see caisson and limber. Horse artillery—rows of limbers and caissons, each pulled by teams of six horses with three postilion riders and an escort on horseback (1933, Poland) A limber is a two-wheeled cart designed to support the trail of an artillery piece, or the stock of a field carriage such as a caisson or traveling forge, allowing it to be towed. The trail is the hinder end of the s...

                           Referéndum para habilitar el matrimonio igualitarioEnmienda al Código Civil(«Matrimonio para todos») Fecha 26 de septiembre de 2021 Tipo Referéndum facultativo Demografía electoral Población 8 670 300 (est. 2020)[1]​ Hab. registrados 5 519 168 Votantes 2 903 228 Participación    52.6 %[2]...

 

Ілля Андрійович Безбородько Народився 16 (27) лютого 1756(1756-02-27)Глухів, ГетьманщинаПомер 3 (15) червня 1815(1815-06-15) (59 років)Санкт-Петербург, Російська імперіяПоховання Благовіщенська церква Олександро-Невської лавриdГромадянство Гетьманщина → Російська імперіяНаціональність �...