Variable-width encoding

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer.[1][a] Most common variable-width encodings are multibyte encodings (aka MBCSmulti-byte character set), which use varying numbers of bytes (octets) to encode different characters. (Some authors, notably in Microsoft documentation, use the term multibyte character set, which is a misnomer, because representation size is an attribute of the encoding, not of the character set.)

Early variable-width encodings using less than a byte per character were sometimes used to pack English text into fewer bytes in adventure games for early microcomputers. However disks (which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose compression algorithms have rendered such tricks largely obsolete.

Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking backward compatibility with an existing constraint. For example, with one byte (8 bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to use two or more bytes per encoding unit, two bytes (16 bits) would allow 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all.[b]

General structure

Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: singletons, which consist of a single unit, lead units, which come first in a multiunit sequence, and trail units, which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme, but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character.

For example, the four character string "I♥NY" is encoded in UTF-8 like this (shown as hexadecimal byte values): 49 E2 99 A5 4E 59. Of the six units in that sequence, 49, 4E, and 59 are singletons (for I, N, and Y), E2 is a lead unit and 99 and A5 are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.

UTF-8 makes it easy for a program to identify the three sorts of units, since they fall into separate value ranges. Older variable-width encodings are typically not as well-designed, since the ranges may overlap. A text processing application that deals with the variable-width encoding must then scan the text from the beginning of all definitive sequences in order to identify the various units and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if the hexadecimal values DE, DF, E0, and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the sequence DE DF E0 E1, which consists of two consecutive two-unit sequences. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences incorrect. In a variable-width encoding where all three types of units are disjunct, string searching always works without false positives, and (provided the decoder is well written) the corruption or loss of one unit corrupts only one character.

CJK multibyte encodings

The first use of multibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings used the range 21–7E (hexadecimal) for both lead units and trail units, and marked them off from the singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94×94) characters could be encoded at first, and further sets of 94×94 characters with switching. The ISO 2022 encoding schemes for CJK are still in use on the Internet. The stateful nature of these encodings and the large overlap make them very awkward to process.

On Unix platforms, the ISO 2022 7-bit encodings were replaced by a set of 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range 80–FF (hexadecimal), while the singletons were in the range 00–7F alone. The lead units and trail units were in the range A1 to FE (hexadecimal), that is, the same as their range in the ISO 2022 encodings, but with the high bit set to 1. These encodings were reasonably easy to work with provided all your delimiters were ASCII characters and you avoided truncating strings to fixed lengths, but a break in the middle of a multibyte character could still cause major corruption.

On the PC (DOS and Microsoft Windows platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped: Shift-JIS and Big5 respectively. In Shift-JIS, lead units had the range 81–9F and E0–FC, trail units had the range 40–7E and 80–FC, and singletons had the range 21–7E and A1–DF. In Big5, lead units had the range A1–FE, trail units had the range 40–7E and A1–FE, and singletons had the range 21–7E (all values in hexadecimal). This overlap again made processing tricky, though at least most of the symbols had unique byte values (though strangely the backslash does not).

Unicode variable-width encodings

The Unicode standard has two variable-width encodings: UTF-8 and UTF-16 (it also has a fixed-width encoding, UTF-32). Originally, both the Unicode and ISO 10646 standards were meant to be fixed-width, with Unicode being 16-bit and ISO 10646 being 32-bit.[citation needed] ISO 10646 provided a variable-width encoding called UTF-1, in which singletons had the range 00–9F, lead units the range A0–FF and trail units the ranges A0–FF and 21–7E. Because of this bad design, similar to Shift JIS and Big5 in its overlap of values, the inventors of the Plan 9 operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00–7F, lead units have the range C0–FD (now actually C2–F4, to avoid overlong sequences and to maintain synchronism with the encoding capacity of UTF-16; see the UTF-8 article), and trail units have the range 80–BF. The lead unit also tells how many trail units follow: one after C2–DF, two after E0–EF and three after F0–F4.[c]

UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000–D7FF (55,296 code points) and E000–FFFF (8192 code points, 63,488 in total), lead units the range D800–DBFF (1024 code points) and trail units the range DC00–DFFF (1024 code points, 2048 in total). The lead and trail units, called high surrogates and low surrogates, respectively, in Unicode terminology, map 1024×1024 or 1,048,576 supplementary characters, making 1,112,064 (63,488 BMP code points + 1,048,576 code points represented by high and low surrogate pairs) encodable code points, or scalar values in Unicode parlance (surrogates are not encodable).

See also

Notes

  1. ^ The concept long precedes the advent of the electronic computer, however, as seen with Morse code.
  2. ^ As a real-life example of this, UTF-16, which represents the most common characters in exactly the manner just described (and uses pairs of 16-bit code units for less-common characters) never gained traction as an encoding for text intended for interchange due to its incompatibility with the ubiquitous 7-/8-bit ASCII encoding, with its intended role instead being taken by UTF-8, which does preserve ASCII compatibility.
  3. ^ In the original version of UTF-8, from its 1992 publication until its code space was restricted to that of UTF-16 in 2003, the range of lead units encoding three-unit trailing sequences was larger (F0–F7); additionally, the lead units F8–FB were followed by four trail units, and FC–FD by five. FE–FF were never valid lead or trail units in any version of UTF-8.

References

  1. ^ Crispin, M. (1 April 2005). UTF-9 and UTF-18 Efficient Transformation Formats of Unicode. doi:10.17487/rfc4042.

Read other articles:

ジョージ・ガービンGeorge Gervin 1970年代のガービン基本情報愛称 Iceman国籍 アメリカ合衆国生年月日 (1952-04-27) 1952年4月27日(71歳)出身地 ミシガン州デトロイト身長 201cm (6 ft 7 in)体重 81kg (179 lb)キャリア情報高校 マーティン・ルーサー・キング高等学校(英語版)大学 東ミシガン大学NBAドラフト 1974年 (英語版) / 3巡目 / 全体40位[1] フェニックス・...

شتاتبروتسلتن    شعار   الإحداثيات 49°47′07″N 9°24′44″E / 49.785277777778°N 9.4122222222222°E / 49.785277777778; 9.4122222222222  [1] تقسيم إداري  البلد ألمانيا[2]  خصائص جغرافية  المساحة 10.87 كيلومتر مربع (31 ديسمبر 2017)[3]  ارتفاع 134 متر  عدد السكان  عدد السكان 1555 (31 �...

Gereja Katedral BanjarmasinGereja Katedral BanjarmasinAgamaAfiliasi agamaGereja Katolik RomaKepemimpinanMgr. Victorius Dwiardy, O.F.M.CapDiberkati28 Juni 1931LokasiLokasiBanjarmasin, Kalimantan SelatanNegara IndonesiaKoordinat3°19′13″S 114°35′36″E / 3.32028°S 114.59333°E / -3.32028; 114.59333Koordinat: 3°19′13″S 114°35′36″E / 3.32028°S 114.59333°E / -3.32028; 114.59333ArsitekturArsitekRoestenhurgJenisGerejaGaya arsit...

العلاقات الأذربيجانية الباكستانية أذربيجان باكستان   أذربيجان   باكستان تعديل مصدري - تعديل   العلاقات الأذربيجانية الباكستانية هي العلاقات الثنائية التي تجمع بين أذربيجان وباكستان.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجعية للد

مايكل زورك (بالإنجليزية: Michael Zorc)‏    معلومات شخصية الميلاد 25 أغسطس 1962 (العمر 61 سنة)[1]دورتموند  الطول 1.83 م (6 قدم 0 بوصة) مركز اللعب وسط الجنسية ألمانيا  معلومات النادي النادي الحالي بوروسيا دورتموند (sporting director) مسيرة الشباب سنوات فريق 1969–1978 TuS Eving-Lindenhorst 1978

Hieronder volgt een lijst van landen van de wereld in 1919. Uitleg Op 1 januari 1919 waren er 62 erkende onafhankelijke staten (inclusief Andorra[1], exclusief dominions van het Britse Rijk en exclusief vazalstaten). In 1919 verdween Hongarije als onafhankelijke staat en kwam Honduras er als onafhankelijke staat bij. De in grote mate onafhankelijke Britse dominions zijn weergegeven onder het kopje dominions van het Britse Rijk. Alle de facto onafhankelijke staten zonder ruime internat...

Getty ImagesJenisSwastaIndustriPenerbitan, media, Desain webGenreStock photographyPendahuluGetty Communications, PhotoDiscPendiriMark Getty, Jonathan KleinKantorpusatSeattle, Washington, A.S.ProdukGambar digital, Audio, VideoJasaRights-managed and royalty-free images, audio and videoPemilikCarlyle GroupAnakusahaPhotoDisc, Tony Stone Images, Hulton Getty, JupiterimagesiStockphotoSitus webwww.gettyimages.com Getty Images, Inc. adalah perusahaan agen penyedia foto, berbasis di Seattle, Washingto...

Artikel ini bukan mengenai Magic (seri televisi). Magic 5Genre Drama Fantasi Remaja SkenarioTim Kreatif MKFSutradara A. Septian[a] Usman Jiro[b] Sondang Pratama[c] Bobby Moeryawan Pemeran Temmy Rahadi Medina Basmalah Gralind Raden Rakha Afan-D.A. Sridevi-D.A. Eby-D.A. Rama Michael Dirly Dave Metta Permadi Indra Rooney Alifa Lubis Adryan Didi Jasmine Meijers Dava Nursyafa Diva Nursyifa Putty Noor Arie Nugroho Washifa Penggubah lagu temaRyan WiedaryantoLagu pembukaBertau...

Russian Ballet, David Bomberg, 1919 Russian Ballet is an artist's book by the English artist David Bomberg published in 1919. The work describes the impact of seeing a performance of Diaghilev's Ballets Russes, and is based on a series of drawings Bomberg had done around 1914,[1] while associated with the Vorticist group of avant-garde artists in London. Centred on Wyndham Lewis and Ezra Pound, the movement flourished briefly from 1914–1915, before being dispersed by the impact of t...

Figure 1. General schematic of the WLA. The Weak-Link Approach (WLA) is a supramolecular coordination-based assembly methodology, first introduced in 1998 by the Mirkin Group at Northwestern University.[1] This method takes advantage of hemilabile ligands -ligands that contain both strong and weak binding moieties- that can coordinate to metal centers and quantitatively assemble into a single condensed ‘closed’ structure (Figure 1). Unlike other supramolecular assembly methods, th...

Australian politician This article may require cleanup to meet Wikipedia's quality standards. No cleanup reason has been specified. Please help improve this article if you can. (April 2011) (Learn how and when to remove this template message) The HonourablePhil KoperbergAO AFSM BEM1st Commissioner of the NSW Rural Fire ServiceIn office1 September 1997 – 12 January 2007Preceded byNew titleSucceeded byShane FitzsimmonsMember of the New South Wales Parliamentfor Blue Mountains...

District and municipality in Ordu, TurkeyGülyalıDistrict and municipalityKestane, a village in Gülyalı districtMap showing Gülyalı District in Ordu ProvinceGülyalıLocation in TurkeyCoordinates: 40°58′00″N 38°03′25″E / 40.96667°N 38.05694°E / 40.96667; 38.05694CountryTurkeyProvinceOrduGovernment • MayorUlaş Tepe (CHP)Area62 km2 (24 sq mi)Elevation10 m (30 ft)Population (2022)[1]8,425 • De...

1986 film The Serpent's WaySwedish posterDirected byBo WiderbergWritten byBo WiderbergTorgny LindgrenProduced byGöran LindströmStarringStina EkbladCinematographyJörgen PerssonEdited byBo WiderbergRelease date 25 December 1986 (1986-12-25) Running time111 minutesCountrySwedenLanguageSwedish The Serpent's Way (Swedish: Ormens väg på hälleberget) is a 1986 Swedish drama film directed by Bo Widerberg. It is based on the novel The Way of a Serpent by Torgny Lindgren. The film ...

Accumulation of differences between closely related species populations, leading to speciation Darwin's finches are a clear and famous example of divergent evolution, in which an ancestral species radiates into a number of descendant species with both similar and different traits. Part of a series onEvolutionary biologyDarwin's finches by John Gould Index Introduction Main Outline Glossary Evidence History Processes and outcomes Population genetics Variation Diversity Mutation Natural selecti...

For the Satanic ritual, see Black Mass. Not to be confused with the house music collective[1] known for their 1997 single Wonderful Person. This article relies excessively on references to primary sources. Please improve this article by adding secondary or tertiary sources. Find sources: Black Masses – news · newspapers · books · scholar · JSTOR (February 2011) (Learn how and when to remove this template message) 2010 studio album by Electr...

Book for recording activities Logbook used for NASA's Mars Ingenuity helicopter Two different logbooks for scuba divers. A logbook (or log book) is a record used to record states, events, or conditions applicable to complex machines or the personnel who operate them. Logbooks are commonly associated with the operation of aircraft, nuclear plants, particle accelerators, and ships (among other applications). The term logbook originated with the ship's log, a maritime record of important events ...

Australian swimmer Lisa CurryCurry in 2012Personal informationFull nameLisa Gaye CurryNational teamBorn (1962-05-15) 15 May 1962 (age 61)Brisbane, QueenslandHeight1.70 m (5 ft 7 in)Weight62 kg (137 lb)SportSportSwimmingStrokesButterfly, medley, freestyle Medal record Women's swimming Representing Australia Commonwealth Games 1982 Brisbane 100 m butterfly 1982 Brisbane 200 m medley 1982 Brisbane 400 m medley 1990 Auckland 50 m freestyle 1990 Auckland 100...

Artykuł 50°20′58″N 16°39′31″E - błąd 39 m WD 50°21'7"N, 16°39'54"E - błąd 39 m Odległość 562 m Mielnik wieś Mielnik widziany od Gorzanowa Państwo  Polska Województwo  dolnośląskie Powiat kłodzki Gmina Bystrzyca Kłodzka Wysokość 340-405[1] m n.p.m. Liczba ludności (III 2011) 101[2] Strefa numeracyjna 74 Kod pocztowy 57-521[3] Tablice rejestracyjne DKL SIMC 0851413 Położenie na mapie gminy Bystrzyca KłodzkaMielnik Położenie na ma...

Fermín Trujillo Fuentes Diputado del Congreso del Estado de Sonorapor el distrito 18 Actualmente en el cargo Desde el 1 de septiembre de 2018Predecesor Rafael Buelna Clark Diputado del Congreso del Estado de Sonorapor Representación Proporcional 1 de septiembre de 2015-31 de agosto de 2018 Senador del Congreso de la Uniónpor SonoraPrimera Minoría 24 de septiembre de 2008-31 de agosto de 2012Predecesor Alfonso Elías SerranoSucesor Carlos Alberto Navarro Sugich Diputado del Congreso de la ...

Paghimo ni bot Lsjbot. Alang sa ubang mga dapit sa mao gihapon nga ngalan, tan-awa ang Lookout Rocks. 44°22′21″N 64°12′48″W / 44.37253°N 64.21345°W / 44.37253; -64.21345 Lookout Rocks Mga piliw Nasod  Kanada Lalawigan Nova Scotia Tiganos 44°22′21″N 64°12′48″W / 44.37253°N 64.21345°W / 44.37253; -64.21345 Timezone AST (UTC-4)  - summer (DST) ADT (UTC-3) GeoNames 6060000 Lookout Rocks maoy mga piliw s...