utf(7) - Plan 9 from User Space

From 78e51a8c6678b6e3dff3d619aa786669f531f4bc Mon Sep 17 00:00:00 2001 From: rsc Date: Fri, 14 Jan 2005 03:45:44 +0000 Subject: checkpoint --- man/man7/utf.html | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 man/man7/utf.html (limited to 'man/man7/utf.html') diff --git a/man/man7/utf.html b/man/man7/utf.html new file mode 100644 index 00000000..a1e767ec --- /dev/null +++ b/man/man7/utf.html @@ -0,0 +1,96 @@ + +utf(7) - Plan 9 from User Space + + + + +

UTF(7)

UTF(7) +

+
+

NAME
+ +
+ + UTF, Unicode, ASCII, rune – character set and format
+ +
+

DESCRIPTION
+ +
+ + The Plan 9 character set and representation are based on the Unicode + Standard and on the ISO multibyte UTF-8 encoding (Universal Character + Set Transformation Format, 8 bits wide). The Unicode Standard + represents its characters in 16 bits; UTF-8 represents such values + in an 8-bit byte stream. Throughout this + manual, UTF-8 is shortened to UTF. +
+ + In Plan 9, a rune is a 16-bit quantity representing a Unicode + character. Internally, programs may store characters as runes. + However, any external manifestation of textual information, in + files or at the interface between programs, uses a machine-independent, + byte-stream encoding called UTF. +
+ + UTF is designed so the 7-bit ASCII set (values hexadecimal 00 + to 7F), appear only as themselves in the encoding. Runes with + values above 7F appear as sequences of two or more bytes with + values only from 80 to FF. +
+ + The UTF encoding of the Unicode Standard is backward compatible + with ASCII: programs presented only with ASCII work on Plan 9 + even if not written to deal with UTF, as do programs that deal + with uninterpreted byte streams. However, programs that perform + semantic processing on ASCII graphic characters must convert + from UTF to runes in order to work properly with non-ASCII input. + See rune(3). +
+ + Letting numbers be binary, a rune x is converted to a multibyte + UTF sequence as follows: +
+ + 01. x in [00000000.0bbbbbbb] → 0bbbbbbb
+ 10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
+ 11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
+ +
+ + Conversion 01 provides a one-byte sequence that spans the ASCII + character set in a compatible way. Conversions 10 and 11 represent + higher-valued characters as sequences of two or three bytes with + the high bit set. Plan 9 does not support the 4, 5, and 6 byte + sequences proposed by X-Open. When there are + multiple ways to encode a value, for example rune 0, the shortest + encoding is used. +
+ + In the inverse mapping, any sequence except those described above + is incorrect and is converted to rune hexadecimal 0080.
+ +
+

SEE ALSO
+ +
+ + ascii(1), tcs(1), rune(3), The Unicode Standard.
+ +
+ +

+ + +

		+
	+ + + +

+ + -- cgit v1.2.3