aboutsummaryrefslogtreecommitdiff
path: root/man/man7/utf.7
diff options
context:
space:
mode:
authorrsc <devnull@localhost>2003-09-30 17:47:42 +0000
committerrsc <devnull@localhost>2003-09-30 17:47:42 +0000
commit76193d7cb0457807b2f0b95f909ab5de19480cd7 (patch)
tree97e538c7e38181431e90289a0fe8b6b7ce1f8f3c /man/man7/utf.7
parented7c8e8d02c02bdbff1e88a6d8d1419f39af48ad (diff)
downloadplan9port-76193d7cb0457807b2f0b95f909ab5de19480cd7.tar.gz
plan9port-76193d7cb0457807b2f0b95f909ab5de19480cd7.tar.bz2
plan9port-76193d7cb0457807b2f0b95f909ab5de19480cd7.zip
Initial revision
Diffstat (limited to 'man/man7/utf.7')
-rw-r--r--man/man7/utf.791
1 files changed, 91 insertions, 0 deletions
diff --git a/man/man7/utf.7 b/man/man7/utf.7
new file mode 100644
index 00000000..97b7b1e7
--- /dev/null
+++ b/man/man7/utf.7
@@ -0,0 +1,91 @@
+.TH UTF 7
+.SH NAME
+UTF, Unicode, ASCII, rune \- character set and format
+.SH DESCRIPTION
+The Plan 9 character set and representation are
+based on the Unicode Standard and on the ISO multibyte
+.SM UTF-8
+encoding (Universal Character
+Set Transformation Format, 8 bits wide).
+The Unicode Standard represents its characters in 16
+bits;
+.SM UTF-8
+represents such
+values in an 8-bit byte stream.
+Throughout this manual,
+.SM UTF-8
+is shortened to
+.SM UTF.
+.PP
+In Plan 9, a
+.I rune
+is a 16-bit quantity representing a Unicode character.
+Internally, programs may store characters as runes.
+However, any external manifestation of textual information,
+in files or at the interface between programs, uses a
+machine-independent, byte-stream encoding called
+.SM UTF.
+.PP
+.SM UTF
+is designed so the 7-bit
+.SM ASCII
+set (values hexadecimal 00 to 7F),
+appear only as themselves
+in the encoding.
+Runes with values above 7F appear as sequences of two or more
+bytes with values only from 80 to FF.
+.PP
+The
+.SM UTF
+encoding of the Unicode Standard is backward compatible with
+.SM ASCII\c
+:
+programs presented only with
+.SM ASCII
+work on Plan 9
+even if not written to deal with
+.SM UTF,
+as do
+programs that deal with uninterpreted byte streams.
+However, programs that perform semantic processing on
+.SM ASCII
+graphic
+characters must convert from
+.SM UTF
+to runes
+in order to work properly with non-\c
+.SM ASCII
+input.
+See
+.IR rune (2).
+.PP
+Letting numbers be binary,
+a rune x is converted to a multibyte
+.SM UTF
+sequence
+as follows:
+.PP
+01. x in [00000000.0bbbbbbb] → 0bbbbbbb
+.br
+10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
+.br
+11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
+.br
+.PP
+Conversion 01 provides a one-byte sequence that spans the
+.SM ASCII
+character set in a compatible way.
+Conversions 10 and 11 represent higher-valued characters
+as sequences of two or three bytes with the high bit set.
+Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open.
+When there are multiple ways to encode a value, for example rune 0,
+the shortest encoding is used.
+.PP
+In the inverse mapping,
+any sequence except those described above
+is incorrect and is converted to rune hexadecimal 0080.
+.SH "SEE ALSO"
+.IR ascii (1),
+.IR tcs (1),
+.IR rune (3),
+.IR "The Unicode Standard" .