123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507 |
- <!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
- <html>
- <!--
- Generated from manual.tex by tex2page, v 20050501
- (running on MzScheme 299.400, unix),
- (c) Dorai Sitaram,
- http://www.ccs.neu.edu/~dorai/tex2page/tex2page-doc.html
- -->
- <head>
- <title>
- The Incomplete Scheme 48 Reference Manual for release 1.6
- </title>
- <link rel="stylesheet" type="text/css" href="manual-Z-S.css" title=default>
- <meta name=robots content="noindex,follow">
- </head>
- <body>
- <div id=content>
- <div align=right class=navigation><i>[Go to <span><a href="manual.html">first</a>, <a href="manual-Z-H-7.html">previous</a></span><span>, <a href="manual-Z-H-9.html">next</a></span> page<span>; </span><span><a href="manual-Z-H-2.html#node_toc_start">contents</a></span><span><span>; </span><a href="manual-Z-H-13.html#node_index_start">index</a></span>]</i></div>
- <p></p>
- <a name="node_chap_6"></a>
- <h1 class=chapter>
- <div class=chapterheading><a href="manual-Z-H-2.html#node_toc_node_chap_6">Chapter 6</a></div><br>
- <a href="manual-Z-H-2.html#node_toc_node_chap_6">Unicode</a></h1>
- <p>Scheme 48 fully supports ISO 10646 (Unicode): Scheme characters
- represent Unicode scalar values, and Scheme strings are arrays of
- scalar values. More information on Unicode can be found at
- <a href="http://www.unicode.org/">the Unicode web
- site</a>.</p>
- <p>
- </p>
- <a name="node_sec_6.1"></a>
- <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.1">6.1 Characters and their codes</a></h2>
- <p>Scheme 48 internally represents characters as Unicode scalar values.
- The <tt>unicode</tt> structure contains procedures for converting
- between characters and scalar values:
- </p>
- <ul>
- <li><p><tt>(char->scalar-value<i> char</i>) -> <i>integer</i></tt><a name="node_idx_406"></a>
- </p>
- <li><p><tt>(scalar-value->char<i> integer</i>) -> <i>char</i></tt><a name="node_idx_408"></a>
- </p>
- <li><p><tt>(scalar-value?<i> integer</i>) -> <i>boolean</i></tt><a name="node_idx_410"></a>
- </p>
- </ul><p>
- <tt>Char->scalar-value</tt> returns the scalar value of a character, and
- <tt>scalar-value->char</tt> converts in the other direction.
- <tt>Scalar-value->char</tt> signals an error if passed an integer that is
- not a scalar value.</p>
- <p>
- Note that the Unicode scalar value range is
- </p>
- <div align=center><img src="unicode-Z-G-1.gif" border="0" alt="[unicode-Z-G-1.gif]"></div><p>
- In particular, this excludes the surrogates, which UTF-16 uses to
- encode scalar values with two 16-bit words. Note that this
- representation differs from that of Java, which uses UTF-16 code units
- as the character representation -- Scheme 48 effectively uses UTF-32,
- and is thus in line with other Scheme implementations and the current
- Unicode proposal for R<sup>6</sup>RS, as set forth in SRFI 75.</p>
- <p>
- The R<sup>5</sup>RS procedures <tt>char->integer</tt> and <tt>integer->char</tt>
- are synonyms for <tt>char->scalar-value</tt> and
- <tt>scalar-value->char</tt>, respectively.</p>
- <p>
- </p>
- <a name="node_sec_6.2"></a>
- <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.2">6.2 Character and string literals</a></h2>
- <p>The syntax specified here is in line with the current Unicode proposal
- for R<sup>6</sup>RS, as set forth in SRFI 75, except for case-sensitivity.
- (Scheme 48 is case-insensitive.)</p>
- <p>
- </p>
- <a name="node_sec_6.2.1"></a>
- <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.2.1">6.2.1 Character literals</a></h3>
- <p>The following character names are available in addition to what
- R<sup>5</sup>RS provides:
- </p>
- <ul>
- <li><p><code class=verbatim>#\nul</code> (ASCII 0)
- </p>
- <li><p><code class=verbatim>#\alarm</code> (ASCII 7)
- </p>
- <li><p><code class=verbatim>#\backspace</code> (ASCII 8)
- </p>
- <li><p><code class=verbatim>#\tab</code> (ASCII 9)
- </p>
- <li><p><code class=verbatim>#\vtab</code> (ASCII 11)
- </p>
- <li><p><code class=verbatim>#\page</code> (ASCII 12)
- </p>
- <li><p><code class=verbatim>#\return</code> (ASCII 13)
- </p>
- <li><p><code class=verbatim>#\esc</code> (ASCII 27)
- </p>
- <li><p><code class=verbatim>#\rubout</code> (ASCII 127)
- </p>
- <li><p><code class=verbatim>#\x</code><x><x><tt>...</tt> hex, explicitly or implicitly
- delimited, where <x><x><tt>...</tt> denotes the scalar value
- of the character
- </p>
- </ul><p></p>
- <p>
- </p>
- <a name="node_sec_6.2.2"></a>
- <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.2.2">6.2.2 String literals</a></h3>
- <p>The following escape characters in string literals are available in addition to what
- R<sup>5</sup>RS provides:</p>
- <p>
- </p>
- <ul>
- <li><p><code class=verbatim>\a</code>: alarm (ASCII 7)
- </p>
- <li><p><code class=verbatim>\b</code>: backspace (ASCII 8)
- </p>
- <li><p><code class=verbatim>\t</code>: tab (ASCII 9)
- </p>
- <li><p><code class=verbatim>\n</code>: linefeed (ASCII 10)
- </p>
- <li><p><code class=verbatim>\v</code>: vertical tab (ASCII 11)
- </p>
- <li><p><code class=verbatim>\f</code>: formfeed (ASCII 12)
- </p>
- <li><p><code class=verbatim>\r</code>: return (ASCII 13)
- </p>
- <li><p><code class=verbatim>\e</code>: escape (ASCII 27)
- </p>
- <li><p><code class=verbatim>\'</code>: quote (ASCII 39, same as unquoted)
- </p>
- <li><p><code class=verbatim>\</code><newline><intraline whitespace>: elided (allows a single-line string to
- span source lines)
- </p>
- <li><p><code class=verbatim>\x</code><x><x><tt>...</tt><code class=verbatim>;</code> hex, where <x><x><tt>...</tt>
- denotes the scalar value of the character
- </p>
- </ul><p></p>
- <p>
- </p>
- <a name="node_sec_6.2.3"></a>
- <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.2.3">6.2.3 Identifiers and symbol literals</a></h3>
- <p>Where R<sup>5</sup>RS allows a <letter>, Scheme 48 allows in addition any
- character whose scalar value is greater than 127 and whose Unicode
- general category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc,
- Po, Sc, Sm, Sk, So, or Co.</p>
- <p>
- Moreover, when a backslash appears in a symbol, it must start a
- <code class=verbatim>\x</code><x><x><tt>...</tt><code class=verbatim>;</code> escape, which identifies an
- arbitrary character to include in the symbol. Note that a backslash
- itself can be specified as <code class=verbatim>\x5C;</code>.</p>
- <p>
- </p>
- <a name="node_sec_6.3"></a>
- <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.3">6.3 Character classification and case mappings</a></h2>
- <p>The R<sup>5</sup>RS character predicates -- <tt>char-whitespace?</tt>,
- <tt>char-lower-case?</tt>, <tt>char-upper-case?</tt>,
- <tt>char-numeric?</tt>, and <tt>char-alphabetic?</tt> -- all treat the full
- Unicode range.</p>
- <p>
- <tt>Char-upcase</tt> and <tt>char-downcase</tt> as well as
- <tt>char-ci=?</tt>, <tt>char-ci<?</tt>, <tt>char-ci<=?</tt>,
- <tt>char-ci>?</tt>, <tt>char-ci>=?</tt>, <tt>string-ci=?</tt>,
- <tt>string-ci<?</tt>, <tt>string-ci>?</tt>, <tt>string-ci<=?</tt>,
- <tt>string-ci>=?</tt> all use the standard simple locale-insensitive
- Unicode case folding.</p>
- <p>
- In addition, Scheme 48 provides the <tt>unicode-char-maps</tt> structure
- for more complete access to the Unicode character classification with
- the following procedures and macros:
- </p>
- <ul>
- <li><p><tt>(general-category <i>general-category-name</i>) -> <i>general-category</i></tt> (syntax)
- </p>
- <li><p><tt>(general-category?<i> x</i>) -> <i>boolean</i></tt><a name="node_idx_412"></a>
- </p>
- <li><p><tt>(general-category-id<i> general-category</i>) -> <i>string</i></tt><a name="node_idx_414"></a>
- </p>
- <li><p><tt>(char-general-category<i> char</i>) -> <i>general-category</i></tt><a name="node_idx_416"></a>
- </p>
- </ul><p>
- The syntax <tt>general-category</tt> returns a Unicode general category
- object associated with <i>general-category-name</i>. (See
- Figure <a href="#node_fig_Temp_19">2</a> below.) <tt>General-category?</tt>
- is the predicate for general-category objects.
- <tt>General-category-id</tt> returns the Unicode category id as a string
- (also listed in Figure <a href="#node_fig_Temp_19">2</a>).
- <tt>Char-general-category</tt> returns the general category of a character.</p>
- <p>
- </p>
- <p></p>
- <hr>
- <p></p>
- <a name="node_fig_Temp_19"></a>
- <div class=figure align=center><table width=100%><tr><td align=center>
- <table border=1><tr><td valign=top ><i>general-category-name</i> </td><td valign=top ><i>primary-category-name</i> </td><td valign=top >Unicode category id
- </td></tr>
- <tr><td valign=top ><tt>uppercase-letter</tt> </td><td valign=top ><tt>letter</tt> </td><td valign=top ><code class=verbatim>"Lu"</code> </td></tr>
- <tr><td valign=top ><tt>lowercase-letter</tt> </td><td valign=top ><tt>letter</tt> </td><td valign=top ><code class=verbatim>"Ll"</code> </td></tr>
- <tr><td valign=top ><tt>titlecase-letter</tt> </td><td valign=top ><tt>letter</tt> </td><td valign=top ><code class=verbatim>"Lt"</code> </td></tr>
- <tr><td valign=top ><tt>modified-letter</tt> </td><td valign=top ><tt>letter</tt> </td><td valign=top ><code class=verbatim>"Lm"</code> </td></tr>
- <tr><td valign=top ><tt>other-letter</tt> </td><td valign=top ><tt>letter</tt> </td><td valign=top ><code class=verbatim>"Lo"</code> </td></tr>
- <tr><td valign=top ><p>
- <tt>non-spacing-mark</tt> </p>
- </td><td valign=top ><tt>mark</tt> </td><td valign=top ><code class=verbatim>"Mn"</code> </td></tr>
- <tr><td valign=top ><tt>combining-spacing-mark</tt> </td><td valign=top ><tt>mark</tt> </td><td valign=top ><code class=verbatim>"Mc"</code> </td></tr>
- <tr><td valign=top ><tt>enclosing-mark</tt> </td><td valign=top ><tt>mark</tt> </td><td valign=top ><code class=verbatim>"Me"</code> </td></tr>
- <tr><td valign=top ><p>
- <tt>decimal-digit-number</tt> </p>
- </td><td valign=top ><tt>number</tt> </td><td valign=top ><code class=verbatim>"Nd"</code> </td></tr>
- <tr><td valign=top ><tt>letter-number</tt> </td><td valign=top ><tt>number</tt> </td><td valign=top ><code class=verbatim>"Nl"</code> </td></tr>
- <tr><td valign=top ><tt>other-number</tt> </td><td valign=top ><tt>number</tt> </td><td valign=top ><code class=verbatim>"No"</code> </td></tr>
- <tr><td valign=top ><p>
- <tt>opening-punctuation</tt> </p>
- </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>"Ps"</code> </td></tr>
- <tr><td valign=top ><tt>closing-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>"Pe"</code> </td></tr>
- <tr><td valign=top ><tt>initial-quote-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>"Pi"</code> </td></tr>
- <tr><td valign=top ><tt>final-quote-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>"Pf"</code> </td></tr>
- <tr><td valign=top ><tt>dash-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>"Pd"</code> </td></tr>
- <tr><td valign=top ><tt>connector-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>"Pc"</code> </td></tr>
- <tr><td valign=top ><tt>other-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>"Po"</code> </td></tr>
- <tr><td valign=top ><p>
- <tt>currency-symbol</tt> </p>
- </td><td valign=top ><tt>symbol</tt> </td><td valign=top ><code class=verbatim>"Sc"</code> </td></tr>
- <tr><td valign=top ><tt>mathematical-symbol</tt> </td><td valign=top ><tt>symbol</tt> </td><td valign=top ><code class=verbatim>"Sm"</code> </td></tr>
- <tr><td valign=top ><tt>modifier-symbol</tt> </td><td valign=top ><tt>symbol</tt> </td><td valign=top ><code class=verbatim>"Sk"</code> </td></tr>
- <tr><td valign=top ><tt>other-symbol</tt> </td><td valign=top ><tt>symbol</tt> </td><td valign=top ><code class=verbatim>"So"</code> </td></tr>
- <tr><td valign=top ><p>
- <tt>space-separator</tt> </p>
- </td><td valign=top ><tt>separator</tt> </td><td valign=top ><code class=verbatim>"Zs"</code> </td></tr>
- <tr><td valign=top ><tt>paragraph-separator</tt> </td><td valign=top ><tt>separator</tt> </td><td valign=top ><code class=verbatim>"Zp"</code> </td></tr>
- <tr><td valign=top ><tt>line-separator</tt> </td><td valign=top ><tt>separator</tt> </td><td valign=top ><code class=verbatim>"Zl"</code> </td></tr>
- <tr><td valign=top ><p>
- <tt>control-character</tt> </p>
- </td><td valign=top ><tt>miscellaneous</tt> </td><td valign=top ><code class=verbatim>"Cc"</code> </td></tr>
- <tr><td valign=top ><tt>formatting-character</tt> </td><td valign=top ><tt>miscellaneous</tt> </td><td valign=top ><code class=verbatim>"Cf"</code> </td></tr>
- <tr><td valign=top ><tt>surrogate</tt> </td><td valign=top ><tt>miscellaneous</tt> </td><td valign=top ><code class=verbatim>"Cs"</code> </td></tr>
- <tr><td valign=top ><tt>private-use-character</tt> </td><td valign=top ><tt>miscellaneous</tt> </td><td valign=top ><code class=verbatim>"Co"</code> </td></tr>
- <tr><td valign=top ><tt>unassigned</tt> </td><td valign=top ><tt>miscellaneous</tt> </td><td valign=top ><code class=verbatim>"Cn"</code>
- </td></tr></table><p>
- </p>
- </td></tr>
- <tr><td align=center><b>Figure 2:</b> Unicode general categories and primary categories</td></tr>
- <tr><td>
- </td></tr></table></div><p></p>
- <hr>
- <p></p>
- <p></p>
- <p>
- </p>
- <ul>
- <li><p><tt>(general-category-primary-category<i> general-category</i>) -> <i>primary-category</i></tt><a name="node_idx_418"></a>
- </p>
- <li><p><tt>(primary-category <i>primary-category-name</i>) -> <i>primary-category</i></tt> (syntax)
- </p>
- <li><p><tt>(primary-category?<i> x</i>) -> <i>boolean</i></tt><a name="node_idx_420"></a>
- </p>
- </ul><p>
- <tt>General-category-primary-category</tt> maps the general category to
- its associated primary category -- also listed in
- Figure <a href="#node_fig_Temp_19">2</a>. The <tt>primary-category</tt>
- syntax returns the primary-category object associated with
- <i>primary-category-name</i>. <tt>Primary-category?</tt> is the
- predicate for primary-category objects.</p>
- <p>
- The <tt>unicode-char-maps</tt> procedure also provides the following
- additional case-mapping procedures for characters:
- </p>
- <ul>
- <li><p><tt>(char-titlecase?<i> char</i>) -> <i>boolean</i></tt><a name="node_idx_422"></a>
- </p>
- <li><p><tt>(char-titlecase<i> char</i>) -> <i>char</i></tt><a name="node_idx_424"></a>
- </p>
- <li><p><tt>(char-foldcase<i> char</i>) -> <i>char</i></tt><a name="node_idx_426"></a>
- </p>
- </ul><p>
- <tt>Char-titlecase?</tt> tests if a character is in titlecase.
- <tt>Char-titlecase</tt> returns the titlecase counterpart of a
- character. <tt>Char-foldcase</tt> folds the case of a character, i.e.
- maps it to uppercase first, then to lowercase.
- The following case-mapping procedures on strings are available:
- </p>
- <ul>
- <li><p><tt>(string-upcase<i> string</i>) -> <i>string</i></tt><a name="node_idx_428"></a>
- </p>
- <li><p><tt>(string-downcase<i> string</i>) -> <i>string</i></tt><a name="node_idx_430"></a>
- </p>
- <li><p><tt>(string-titlecase<i> string</i>) -> <i>string</i></tt><a name="node_idx_432"></a>
- </p>
- <li><p><tt>(string-foldcase<i> string</i>) -> <i>string</i></tt><a name="node_idx_434"></a>
- </p>
- </ul><p>
- These implement the simple case mappings defined by the Unicode
- standard -- note that the length of the output string may be different
- from that of the input string.</p>
- <p>
- </p>
- <a name="node_sec_6.4"></a>
- <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.4">6.4 SRFI 14</a></h2>
- <p>The SRFI 14 (``Character Sets'') implementation in the <tt>srfi-14</tt>
- structure is fully Unicode-compliant.</p>
- <p>
- </p>
- <a name="node_sec_6.5"></a>
- <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.5">6.5 R6RS</a></h2>
- <p>The <tt>unicode-r6rs</tt> structure exports the procedures from the
- <tt>(r6rs unicode)</tt> library of 5.91 draft of R<sup>6</sup>RS that are not
- already in the <tt>scheme</tt> structure:</p>
- <p>
- </p>
- <div align=left><table><tr><td>
- <tt>string-normalize-nfd</tt><br>
- <tt>string-normalize-nfkd</tt><br>
- <tt>string-normalize-nfc</tt><br>
- <tt>string-normalize-nfkc</tt><br>
- <tt>char-titlecase</tt><br>
- <tt>char-title-case?</tt><br>
- <tt>char-foldcase</tt><br>
- <tt>string-upcase</tt><br>
- <tt>string-downcase</tt><br>
- <tt>string-foldcase</tt><br>
- <tt>string-titlecase</tt>
- </td></tr></table></div>
- The <tt>r6rs-unicode</tt> structure also exports a
- <tt>char-general-category</tt> procedure compatible with the
- <tt>(r6rs unicode)</tt> library. Note that, as Scheme 48 treats
- source code case-insensitively, the symbols it returns are
- all-lowercase.<p>
- </p>
- <a name="node_sec_6.6"></a>
- <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.6">6.6 I/O</a></h2>
- <p>Ports must encode any text a program writes to an output port to a
- byte sequence, and conversely decode byte sequences when a program
- reads text from an input port. Therefore, each port has an associated
- <i>text codec</i><a name="node_idx_436"></a> that describes how encode and decode text.</p>
- <p>
- Note that the interface to the text codec functionality is
- experimental and very likely to change in the future.</p>
- <p>
- </p>
- <a name="node_sec_6.6.1"></a>
- <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.6.1">6.6.1 Text codecs</a></h3>
- <p></p>
- <p>
- The <tt>i/o</tt> structure defines the following procedures:
- </p>
- <ul>
- <li><p><tt>(port-text-codec<i> port</i>) -> <i>text-codec</i></tt><a name="node_idx_438"></a>
- </p>
- <li><p><tt>(set-port-text-codec!<i> port text-codec</i>)</tt><a name="node_idx_440"></a>
- </p>
- </ul><p>
- These two procedures retrieve and set the text codec associated with a
- port, respectively. A program can set text codec of a port at any
- time, even if it has already performed I/O on the port.</p>
- <p>
- The <tt>text-codecs</tt> structure defines the following procedures and macros:</p>
- <p>
- </p>
- <ul>
- <li><p><tt>(text-codec?<i> x</i>) -> <i>boolean</i></tt><a name="node_idx_442"></a>
- </p>
- <li><p><tt>null-text-codec</tt> ( text-codec)<a name="node_idx_444"></a>
- </p>
- <li><p><tt>us-ascii-codec</tt> ( text-codec)<a name="node_idx_446"></a>
- </p>
- <li><p><tt>latin-1-codec</tt> ( text-codec)<a name="node_idx_448"></a>
- </p>
- <li><p><tt>utf-8-codec</tt> ( text-codec)<a name="node_idx_450"></a>
- </p>
- <li><p><tt>utf-16le-codec</tt> ( text-codec)<a name="node_idx_452"></a>
- </p>
- <li><p><tt>utf-16be-codec</tt> ( text-codec)<a name="node_idx_454"></a>
- </p>
- <li><p><tt>utf-32le-codec</tt> ( text-codec)<a name="node_idx_456"></a>
- </p>
- <li><p><tt>utf-32be-codec</tt> ( text-codec)<a name="node_idx_458"></a>
- </p>
- <li><p><tt>(find-text-codec<i> string</i>) -> <i>text-codec or <tt>#f</tt></i></tt><a name="node_idx_460"></a>
- </p>
- </ul><p>
- <tt>Text-codec?</tt> is the predicate for text codecs.
- <tt>Null-text-codec</tt> is primarily meant for null ports that never
- yield input and swallow all output. The following text codecs
- implement the US-ASCII, Latin-1, Unicode UTF-8, Unicode UTF-16
- (little-endian), Unicode UTF-16 (big-endian), Unicode UTF-32
- (little-endian), Unicode UTF-32 (big-endian) encodings, respectively.</p>
- <p>
- <tt>Find-text-codec</tt> finds the codec associated with an encoding
- name. The names of the above encodings are <code class=verbatim>"null"</code>,
- <code class=verbatim>"US-ASCII"</code>, <code class=verbatim>"ISO8859-1"</code>, <code class=verbatim>"UTF-8"</code>,
- <code class=verbatim>"UTF-16LE"</code>, <code class=verbatim>"UTF-16BE"</code>, <code class=verbatim>"UTF-32LE"</code>, and
- <code class=verbatim>"UTF-32BE"</code>, respectively.</p>
- <p>
- </p>
- <a name="node_sec_6.6.2"></a>
- <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.6.2">6.6.2 Text-codec utilities</a></h3>
- <p>The <tt>text-codec-utils</tt> structure exports a few utilities for
- dealing with text codecs:</p>
- <p>
- </p>
- <ul>
- <li><p><tt>(guess-port-text-codec-according-to-bom<i> port</i>) -> <i>text-codec or <tt>#f</tt></i></tt><a name="node_idx_462"></a>
- </p>
- <li><p><tt>(set-port-text-codec-according-to-bom!<i> port</i>) -> <i>boolean</i></tt><a name="node_idx_464"></a>
- </p>
- </ul><p>
- These procedures look at the byte-order-mark (also called the
- ``BOM'', <tt>U+FEFF</tt>) at the
- beginning of a port and guess the appropriate text codec. This works
- only for UTF-16 (little-endian and big-endian) and UTF-8.
- <tt>Guess-port-text-codec-according-to-bom</tt> returns the text codec,
- or <tt>#f</tt> if it found no UTF-16 or UTF-8 BOM. Note that this
- actually reads from the port. If the guess does not succeed, it is
- probably a good idea to re-open the port.
- <tt>Set-port-text-codec-according-to-bom!</tt> calls
- <tt>guess-port-text-codec-according-to-bom</tt>, sets the port text
- codec to the result if successful and returns <tt>#t</tt>. If it is
- not successful, it returns <tt>#f</tt>. As with
- <tt>guess-port-text-codec-according-to-bom</tt>, this reads from the
- port, whether successful or not.</p>
- <p>
- </p>
- <a name="node_sec_6.6.3"></a>
- <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.6.3">6.6.3 Creating text codecs</a></h3>
- <p></p>
- <ul>
- <li><p><tt>(make-text-codec<i> strings encode-proc decode-proc</i>) -> <i>text-codec</i></tt><a name="node_idx_466"></a>
- </p>
- <li><p><tt>(text-codec-names<i> text-codec</i>) -> <i>list of strings</i></tt><a name="node_idx_468"></a>
- </p>
- <li><p><tt>(text-codec-encode-char-proc<i> text-codec</i>) -> <i> encode-proc</i></tt><a name="node_idx_470"></a>
- </p>
- <li><p><tt>(text-codec-decode-char-proc<i> text-codec</i>) -> <i> decode-proc</i></tt><a name="node_idx_472"></a>
- </p>
- <li><p><tt>(define-text-codec <i>id</i> <i>name</i> <i>encode-proc</i> <i>decode-proc</i>)</tt> (syntax)<a name="node_idx_474"></a>
- </p>
- <li><p><tt>(define-text-codec <i>id</i> (<i>name</i> <tt>...</tt>) <i>encode-proc</i> <i>decode-proc</i>)</tt> (syntax)<a name="node_idx_476"></a>
- </p>
- </ul><p>
- <tt>Make-text-codec</tt> constructs a text codec from a list of names,
- and an encode and a decode procedure. (See below on how to construct
- encode and decode procedures.) <tt>Text-codec-names</tt>,
- <tt>text-codec-encode-char-proc</tt>, and
- <tt>text-codec-decode-char-proc</tt> are the accessors for text codec.
- The <tt>define-text-codec</tt> is a shorthand for binding a global
- identifier to a text codec. Its first form is for codecs with only
- one name, the second for codecs with several names.</p>
- <p>
- Encoding and decoding procedures work as follows:
- </p>
- <ul>
- <li><p><tt>(<i>encode-proc</i><i> char buffer start count</i>) -> <i>boolean maybe-count</i></tt>
- </p>
- <li><p><tt>(<i>decode-proc</i><i> buffer start count</i>) -> <i>maybe-char count</i></tt>
- </p>
- </ul><p>
- An <i>encode-proc</i> consumes a character <i>char</i> to encode, a
- byte vector <i>buffer</i> to receive the encoding, an index <i>start</i>
- into the buffer, and a block size <i>count</i>. It is supposed to
- encode the bytes into the block at [<i>start</i>, <i>start +
- count</i>). If the encoding is successful, the procedure must
- return <tt>#t</tt> and the number of bytes needed by the encoding.
- If the character cannot be encoded at all, the procedure must return
- <tt>#f</tt> and <tt>#f</tt>. If the encoding is possible but the
- space is not sufficient, the procedure must return <tt>#f</tt> and a
- total number of bytes needed for the encoding.</p>
- <p>
- A <i>decode-proc</i> consumes a byte vector <i>buffer</i>, an index
- <i>start</i> into the buffer, and a block size <i>count</i>. It is
- supposed to decode the bytes at indices [<i>start</i>, <i>start
- + count</i>). If the decoding is successul, it must return
- the decoded character at the beginning of the block, and the number of
- bytes consumed. If the block cannot begin with or be a prefix of a
- valid encoding, the procedure must return <tt>#f</tt> and
- <tt>#f</tt>. If the block contains a true prefix of a valid
- encoding, the procedure must return <tt>#f</tt> and a total count of
- bytes (including the buffer) needed to complete the encoding. Note
- that this byte count is only a guess: the system will provide that
- many bytes, but the decoding procedures might still signal an
- incomplete encoding, causing the system to try to obtain more. </p>
- <p>
- </p>
- <a name="node_sec_6.7"></a>
- <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.7">6.7 Default encodings</a></h2>
- <p>The default encoding for new ports is UTF-8. For the default
- <tt>current-input-port</tt>, <tt>current-output-port</tt>, and
- <tt>current-error-port</tt>, Scheme 48 consults the OS for encoding
- information.</p>
- <p>
- For Unix, it consults <tt>nl_langinfo(3)</tt>, which in turn consults
- the <tt>LC_</tt> environment variables. If the encoding is not defined
- that way, Scheme 48 reverts to US-ASCII.</p>
- <p>
- Under Windows, Scheme 48 uses Unicode I/O (using UTF-16) for the
- default ports connected to the console, and Latin-1 for default ports
- that are not.</p>
- <p>
- </p>
- <div align=right class=navigation><i>[Go to <span><a href="manual.html">first</a>, <a href="manual-Z-H-7.html">previous</a></span><span>, <a href="manual-Z-H-9.html">next</a></span> page<span>; </span><span><a href="manual-Z-H-2.html#node_toc_start">contents</a></span><span><span>; </span><a href="manual-Z-H-13.html#node_index_start">index</a></span>]</i></div>
- <p></p>
- </div>
- </body>
- </html>
|