manual-Z-H-8.html 26 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507
  1. <!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
  2. <html>
  3. <!--
  4. Generated from manual.tex by tex2page, v 20050501
  5. (running on MzScheme 299.400, unix),
  6. (c) Dorai Sitaram,
  7. http://www.ccs.neu.edu/~dorai/tex2page/tex2page-doc.html
  8. -->
  9. <head>
  10. <title>
  11. The Incomplete Scheme 48 Reference Manual for release 1.6
  12. </title>
  13. <link rel="stylesheet" type="text/css" href="manual-Z-S.css" title=default>
  14. <meta name=robots content="noindex,follow">
  15. </head>
  16. <body>
  17. <div id=content>
  18. <div align=right class=navigation><i>[Go to <span><a href="manual.html">first</a>, <a href="manual-Z-H-7.html">previous</a></span><span>, <a href="manual-Z-H-9.html">next</a></span> page<span>; &nbsp;&nbsp;</span><span><a href="manual-Z-H-2.html#node_toc_start">contents</a></span><span><span>; &nbsp;&nbsp;</span><a href="manual-Z-H-13.html#node_index_start">index</a></span>]</i></div>
  19. <p></p>
  20. <a name="node_chap_6"></a>
  21. <h1 class=chapter>
  22. <div class=chapterheading><a href="manual-Z-H-2.html#node_toc_node_chap_6">Chapter 6</a></div><br>
  23. <a href="manual-Z-H-2.html#node_toc_node_chap_6">Unicode</a></h1>
  24. <p>Scheme&nbsp;48 fully supports ISO 10646 (Unicode): Scheme characters
  25. represent Unicode scalar values, and Scheme strings are arrays of
  26. scalar values. More information on Unicode can be found at
  27. <a href="http://www.unicode.org/">the Unicode web
  28. site</a>.</p>
  29. <p>
  30. </p>
  31. <a name="node_sec_6.1"></a>
  32. <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.1">6.1&nbsp;&nbsp;Characters and their codes</a></h2>
  33. <p>Scheme&nbsp;48 internally represents characters as Unicode scalar values.
  34. The <tt>unicode</tt> structure contains procedures for converting
  35. between characters and scalar values:
  36. </p>
  37. <ul>
  38. <li><p><tt>(char-&gt;scalar-value<i> char</i>)&nbsp;-&gt;&nbsp;<i>integer</i></tt><a name="node_idx_406"></a>
  39. </p>
  40. <li><p><tt>(scalar-value-&gt;char<i> integer</i>)&nbsp;-&gt;&nbsp;<i>char</i></tt><a name="node_idx_408"></a>
  41. </p>
  42. <li><p><tt>(scalar-value?<i> integer</i>)&nbsp;-&gt;&nbsp;<i>boolean</i></tt><a name="node_idx_410"></a>
  43. </p>
  44. </ul><p>
  45. <tt>Char-&gt;scalar-value</tt> returns the scalar value of a character, and
  46. <tt>scalar-value-&gt;char</tt> converts in the other direction.
  47. <tt>Scalar-value-&gt;char</tt> signals an error if passed an integer that is
  48. not a scalar value.</p>
  49. <p>
  50. Note that the Unicode scalar value range is
  51. </p>
  52. <div align=center><img src="unicode-Z-G-1.gif" border="0" alt="[unicode-Z-G-1.gif]"></div><p>
  53. In particular, this excludes the surrogates, which UTF-16 uses to
  54. encode scalar values with two 16-bit words. Note that this
  55. representation differs from that of Java, which uses UTF-16 code units
  56. as the character representation -- Scheme&nbsp;48 effectively uses UTF-32,
  57. and is thus in line with other Scheme implementations and the current
  58. Unicode proposal for R<sup>6</sup>RS, as set forth in SRFI&nbsp;75.</p>
  59. <p>
  60. The R<sup>5</sup>RS procedures <tt>char-&gt;integer</tt> and <tt>integer-&gt;char</tt>
  61. are synonyms for <tt>char-&gt;scalar-value</tt> and
  62. <tt>scalar-value-&gt;char</tt>, respectively.</p>
  63. <p>
  64. </p>
  65. <a name="node_sec_6.2"></a>
  66. <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.2">6.2&nbsp;&nbsp;Character and string literals</a></h2>
  67. <p>The syntax specified here is in line with the current Unicode proposal
  68. for R<sup>6</sup>RS, as set forth in SRFI&nbsp;75, except for case-sensitivity.
  69. (Scheme&nbsp;48 is case-insensitive.)</p>
  70. <p>
  71. </p>
  72. <a name="node_sec_6.2.1"></a>
  73. <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.2.1">6.2.1&nbsp;&nbsp;Character literals</a></h3>
  74. <p>The following character names are available in addition to what
  75. R<sup>5</sup>RS provides:
  76. </p>
  77. <ul>
  78. <li><p><code class=verbatim>#\nul</code> (ASCII 0)
  79. </p>
  80. <li><p><code class=verbatim>#\alarm</code> (ASCII 7)
  81. </p>
  82. <li><p><code class=verbatim>#\backspace</code> (ASCII 8)
  83. </p>
  84. <li><p><code class=verbatim>#\tab</code> (ASCII 9)
  85. </p>
  86. <li><p><code class=verbatim>#\vtab</code> (ASCII 11)
  87. </p>
  88. <li><p><code class=verbatim>#\page</code> (ASCII 12)
  89. </p>
  90. <li><p><code class=verbatim>#\return</code> (ASCII 13)
  91. </p>
  92. <li><p><code class=verbatim>#\esc</code> (ASCII 27)
  93. </p>
  94. <li><p><code class=verbatim>#\rubout</code> (ASCII 127)
  95. </p>
  96. <li><p><code class=verbatim>#\x</code>&lt;x&gt;&lt;x&gt;<tt>...</tt> hex, explicitly or implicitly
  97. delimited, where &lt;x&gt;&lt;x&gt;<tt>...</tt> denotes the scalar value
  98. of the character
  99. </p>
  100. </ul><p></p>
  101. <p>
  102. </p>
  103. <a name="node_sec_6.2.2"></a>
  104. <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.2.2">6.2.2&nbsp;&nbsp;String literals</a></h3>
  105. <p>The following escape characters in string literals are available in addition to what
  106. R<sup>5</sup>RS provides:</p>
  107. <p>
  108. </p>
  109. <ul>
  110. <li><p><code class=verbatim>\a</code>: alarm (ASCII 7)
  111. </p>
  112. <li><p><code class=verbatim>\b</code>: backspace (ASCII 8)
  113. </p>
  114. <li><p><code class=verbatim>\t</code>: tab (ASCII 9)
  115. </p>
  116. <li><p><code class=verbatim>\n</code>: linefeed (ASCII 10)
  117. </p>
  118. <li><p><code class=verbatim>\v</code>: vertical tab (ASCII 11)
  119. </p>
  120. <li><p><code class=verbatim>\f</code>: formfeed (ASCII 12)
  121. </p>
  122. <li><p><code class=verbatim>\r</code>: return (ASCII 13)
  123. </p>
  124. <li><p><code class=verbatim>\e</code>: escape (ASCII 27)
  125. </p>
  126. <li><p><code class=verbatim>\'</code>: quote (ASCII 39, same as unquoted)
  127. </p>
  128. <li><p><code class=verbatim>\</code>&lt;newline&gt;&lt;intraline whitespace&gt;: elided (allows a single-line string to
  129. span source lines)
  130. </p>
  131. <li><p><code class=verbatim>\x</code>&lt;x&gt;&lt;x&gt;<tt>...</tt><code class=verbatim>;</code> hex, where &lt;x&gt;&lt;x&gt;<tt>...</tt>
  132. denotes the scalar value of the character
  133. </p>
  134. </ul><p></p>
  135. <p>
  136. </p>
  137. <a name="node_sec_6.2.3"></a>
  138. <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.2.3">6.2.3&nbsp;&nbsp;Identifiers and symbol literals</a></h3>
  139. <p>Where R<sup>5</sup>RS allows a &lt;letter&gt;, Scheme&nbsp;48 allows in addition any
  140. character whose scalar value is greater than 127 and whose Unicode
  141. general category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc,
  142. Po, Sc, Sm, Sk, So, or Co.</p>
  143. <p>
  144. Moreover, when a backslash appears in a symbol, it must start a
  145. <code class=verbatim>\x</code>&lt;x&gt;&lt;x&gt;<tt>...</tt><code class=verbatim>;</code> escape, which identifies an
  146. arbitrary character to include in the symbol. Note that a backslash
  147. itself can be specified as <code class=verbatim>\x5C;</code>.</p>
  148. <p>
  149. </p>
  150. <a name="node_sec_6.3"></a>
  151. <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.3">6.3&nbsp;&nbsp;Character classification and case mappings</a></h2>
  152. <p>The R<sup>5</sup>RS character predicates -- <tt>char-whitespace?</tt>,
  153. <tt>char-lower-case?</tt>, <tt>char-upper-case?</tt>,
  154. <tt>char-numeric?</tt>, and <tt>char-alphabetic?</tt> -- all treat the full
  155. Unicode range.</p>
  156. <p>
  157. <tt>Char-upcase</tt> and <tt>char-downcase</tt> as well as
  158. <tt>char-ci=?</tt>, <tt>char-ci&lt;?</tt>, <tt>char-ci&lt;=?</tt>,
  159. <tt>char-ci&gt;?</tt>, <tt>char-ci&gt;=?</tt>, <tt>string-ci=?</tt>,
  160. <tt>string-ci&lt;?</tt>, <tt>string-ci&gt;?</tt>, <tt>string-ci&lt;=?</tt>,
  161. <tt>string-ci&gt;=?</tt> all use the standard simple locale-insensitive
  162. Unicode case folding.</p>
  163. <p>
  164. In addition, Scheme&nbsp;48 provides the <tt>unicode-char-maps</tt> structure
  165. for more complete access to the Unicode character classification with
  166. the following procedures and macros:
  167. </p>
  168. <ul>
  169. <li><p><tt>(general-category <i>general-category-name</i>)&nbsp;-&gt;&nbsp;<i>general-category</i></tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(syntax)
  170. </p>
  171. <li><p><tt>(general-category?<i> x</i>)&nbsp;-&gt;&nbsp;<i>boolean</i></tt><a name="node_idx_412"></a>
  172. </p>
  173. <li><p><tt>(general-category-id<i> general-category</i>)&nbsp;-&gt;&nbsp;<i>string</i></tt><a name="node_idx_414"></a>
  174. </p>
  175. <li><p><tt>(char-general-category<i> char</i>)&nbsp;-&gt;&nbsp;<i>general-category</i></tt><a name="node_idx_416"></a>
  176. </p>
  177. </ul><p>
  178. The syntax <tt>general-category</tt> returns a Unicode general category
  179. object associated with <i>general-category-name</i>. (See
  180. Figure&nbsp;<a href="#node_fig_Temp_19">2</a> below.) <tt>General-category?</tt>
  181. is the predicate for general-category objects.
  182. <tt>General-category-id</tt> returns the Unicode category id as a string
  183. (also listed in Figure&nbsp;<a href="#node_fig_Temp_19">2</a>).
  184. <tt>Char-general-category</tt> returns the general category of a character.</p>
  185. <p>
  186. </p>
  187. <p></p>
  188. <hr>
  189. <p></p>
  190. <a name="node_fig_Temp_19"></a>
  191. <div class=figure align=center><table width=100%><tr><td align=center>
  192. <table border=1><tr><td valign=top ><i>general-category-name</i> </td><td valign=top ><i>primary-category-name</i> </td><td valign=top >Unicode category id
  193. </td></tr>
  194. <tr><td valign=top ><tt>uppercase-letter</tt> </td><td valign=top ><tt>letter</tt> </td><td valign=top ><code class=verbatim>&quot;Lu&quot;</code> </td></tr>
  195. <tr><td valign=top ><tt>lowercase-letter</tt> </td><td valign=top ><tt>letter</tt> </td><td valign=top ><code class=verbatim>&quot;Ll&quot;</code> </td></tr>
  196. <tr><td valign=top ><tt>titlecase-letter</tt> </td><td valign=top ><tt>letter</tt> </td><td valign=top ><code class=verbatim>&quot;Lt&quot;</code> </td></tr>
  197. <tr><td valign=top ><tt>modified-letter</tt> </td><td valign=top ><tt>letter</tt> </td><td valign=top ><code class=verbatim>&quot;Lm&quot;</code> </td></tr>
  198. <tr><td valign=top ><tt>other-letter</tt> </td><td valign=top ><tt>letter</tt> </td><td valign=top ><code class=verbatim>&quot;Lo&quot;</code> </td></tr>
  199. <tr><td valign=top ><p>
  200. <tt>non-spacing-mark</tt> </p>
  201. </td><td valign=top ><tt>mark</tt> </td><td valign=top ><code class=verbatim>&quot;Mn&quot;</code> </td></tr>
  202. <tr><td valign=top ><tt>combining-spacing-mark</tt> </td><td valign=top ><tt>mark</tt> </td><td valign=top ><code class=verbatim>&quot;Mc&quot;</code> </td></tr>
  203. <tr><td valign=top ><tt>enclosing-mark</tt> </td><td valign=top ><tt>mark</tt> </td><td valign=top ><code class=verbatim>&quot;Me&quot;</code> </td></tr>
  204. <tr><td valign=top ><p>
  205. <tt>decimal-digit-number</tt> </p>
  206. </td><td valign=top ><tt>number</tt> </td><td valign=top ><code class=verbatim>&quot;Nd&quot;</code> </td></tr>
  207. <tr><td valign=top ><tt>letter-number</tt> </td><td valign=top ><tt>number</tt> </td><td valign=top ><code class=verbatim>&quot;Nl&quot;</code> </td></tr>
  208. <tr><td valign=top ><tt>other-number</tt> </td><td valign=top ><tt>number</tt> </td><td valign=top ><code class=verbatim>&quot;No&quot;</code> </td></tr>
  209. <tr><td valign=top ><p>
  210. <tt>opening-punctuation</tt> </p>
  211. </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>&quot;Ps&quot;</code> </td></tr>
  212. <tr><td valign=top ><tt>closing-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>&quot;Pe&quot;</code> </td></tr>
  213. <tr><td valign=top ><tt>initial-quote-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>&quot;Pi&quot;</code> </td></tr>
  214. <tr><td valign=top ><tt>final-quote-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>&quot;Pf&quot;</code> </td></tr>
  215. <tr><td valign=top ><tt>dash-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>&quot;Pd&quot;</code> </td></tr>
  216. <tr><td valign=top ><tt>connector-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>&quot;Pc&quot;</code> </td></tr>
  217. <tr><td valign=top ><tt>other-punctuation</tt> </td><td valign=top ><tt>punctuation</tt> </td><td valign=top ><code class=verbatim>&quot;Po&quot;</code> </td></tr>
  218. <tr><td valign=top ><p>
  219. <tt>currency-symbol</tt> </p>
  220. </td><td valign=top ><tt>symbol</tt> </td><td valign=top ><code class=verbatim>&quot;Sc&quot;</code> </td></tr>
  221. <tr><td valign=top ><tt>mathematical-symbol</tt> </td><td valign=top ><tt>symbol</tt> </td><td valign=top ><code class=verbatim>&quot;Sm&quot;</code> </td></tr>
  222. <tr><td valign=top ><tt>modifier-symbol</tt> </td><td valign=top ><tt>symbol</tt> </td><td valign=top ><code class=verbatim>&quot;Sk&quot;</code> </td></tr>
  223. <tr><td valign=top ><tt>other-symbol</tt> </td><td valign=top ><tt>symbol</tt> </td><td valign=top ><code class=verbatim>&quot;So&quot;</code> </td></tr>
  224. <tr><td valign=top ><p>
  225. <tt>space-separator</tt> </p>
  226. </td><td valign=top ><tt>separator</tt> </td><td valign=top ><code class=verbatim>&quot;Zs&quot;</code> </td></tr>
  227. <tr><td valign=top ><tt>paragraph-separator</tt> </td><td valign=top ><tt>separator</tt> </td><td valign=top ><code class=verbatim>&quot;Zp&quot;</code> </td></tr>
  228. <tr><td valign=top ><tt>line-separator</tt> </td><td valign=top ><tt>separator</tt> </td><td valign=top ><code class=verbatim>&quot;Zl&quot;</code> </td></tr>
  229. <tr><td valign=top ><p>
  230. <tt>control-character</tt> </p>
  231. </td><td valign=top ><tt>miscellaneous</tt> </td><td valign=top ><code class=verbatim>&quot;Cc&quot;</code> </td></tr>
  232. <tr><td valign=top ><tt>formatting-character</tt> </td><td valign=top ><tt>miscellaneous</tt> </td><td valign=top ><code class=verbatim>&quot;Cf&quot;</code> </td></tr>
  233. <tr><td valign=top ><tt>surrogate</tt> </td><td valign=top ><tt>miscellaneous</tt> </td><td valign=top ><code class=verbatim>&quot;Cs&quot;</code> </td></tr>
  234. <tr><td valign=top ><tt>private-use-character</tt> </td><td valign=top ><tt>miscellaneous</tt> </td><td valign=top ><code class=verbatim>&quot;Co&quot;</code> </td></tr>
  235. <tr><td valign=top ><tt>unassigned</tt> </td><td valign=top ><tt>miscellaneous</tt> </td><td valign=top ><code class=verbatim>&quot;Cn&quot;</code>
  236. </td></tr></table><p>
  237. </p>
  238. </td></tr>
  239. <tr><td align=center><b>Figure 2:</b>&nbsp;&nbsp;Unicode general categories and primary categories</td></tr>
  240. <tr><td>
  241. </td></tr></table></div><p></p>
  242. <hr>
  243. <p></p>
  244. <p></p>
  245. <p>
  246. </p>
  247. <ul>
  248. <li><p><tt>(general-category-primary-category<i> general-category</i>)&nbsp;-&gt;&nbsp;<i>primary-category</i></tt><a name="node_idx_418"></a>
  249. </p>
  250. <li><p><tt>(primary-category <i>primary-category-name</i>)&nbsp;-&gt;&nbsp;<i>primary-category</i></tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(syntax)
  251. </p>
  252. <li><p><tt>(primary-category?<i> x</i>)&nbsp;-&gt;&nbsp;<i>boolean</i></tt><a name="node_idx_420"></a>
  253. </p>
  254. </ul><p>
  255. <tt>General-category-primary-category</tt> maps the general category to
  256. its associated primary category -- also listed in
  257. Figure&nbsp;<a href="#node_fig_Temp_19">2</a>. The <tt>primary-category</tt>
  258. syntax returns the primary-category object associated with
  259. <i>primary-category-name</i>. <tt>Primary-category?</tt> is the
  260. predicate for primary-category objects.</p>
  261. <p>
  262. The <tt>unicode-char-maps</tt> procedure also provides the following
  263. additional case-mapping procedures for characters:
  264. </p>
  265. <ul>
  266. <li><p><tt>(char-titlecase?<i> char</i>)&nbsp;-&gt;&nbsp;<i>boolean</i></tt><a name="node_idx_422"></a>
  267. </p>
  268. <li><p><tt>(char-titlecase<i> char</i>)&nbsp;-&gt;&nbsp;<i>char</i></tt><a name="node_idx_424"></a>
  269. </p>
  270. <li><p><tt>(char-foldcase<i> char</i>)&nbsp;-&gt;&nbsp;<i>char</i></tt><a name="node_idx_426"></a>
  271. </p>
  272. </ul><p>
  273. <tt>Char-titlecase?</tt> tests if a character is in titlecase.
  274. <tt>Char-titlecase</tt> returns the titlecase counterpart of a
  275. character. <tt>Char-foldcase</tt> folds the case of a character, i.e.
  276. maps it to uppercase first, then to lowercase.
  277. The following case-mapping procedures on strings are available:
  278. </p>
  279. <ul>
  280. <li><p><tt>(string-upcase<i> string</i>)&nbsp;-&gt;&nbsp;<i>string</i></tt><a name="node_idx_428"></a>
  281. </p>
  282. <li><p><tt>(string-downcase<i> string</i>)&nbsp;-&gt;&nbsp;<i>string</i></tt><a name="node_idx_430"></a>
  283. </p>
  284. <li><p><tt>(string-titlecase<i> string</i>)&nbsp;-&gt;&nbsp;<i>string</i></tt><a name="node_idx_432"></a>
  285. </p>
  286. <li><p><tt>(string-foldcase<i> string</i>)&nbsp;-&gt;&nbsp;<i>string</i></tt><a name="node_idx_434"></a>
  287. </p>
  288. </ul><p>
  289. These implement the simple case mappings defined by the Unicode
  290. standard -- note that the length of the output string may be different
  291. from that of the input string.</p>
  292. <p>
  293. </p>
  294. <a name="node_sec_6.4"></a>
  295. <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.4">6.4&nbsp;&nbsp;SRFI 14</a></h2>
  296. <p>The SRFI&nbsp;14 (``Character Sets'') implementation in the <tt>srfi-14</tt>
  297. structure is fully Unicode-compliant.</p>
  298. <p>
  299. </p>
  300. <a name="node_sec_6.5"></a>
  301. <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.5">6.5&nbsp;&nbsp;R6RS</a></h2>
  302. <p>The <tt>unicode-r6rs</tt> structure exports the procedures from the
  303. <tt>(r6rs unicode)</tt> library of 5.91 draft of R<sup>6</sup>RS that are not
  304. already in the <tt>scheme</tt> structure:</p>
  305. <p>
  306. </p>
  307. <div align=left><table><tr><td>
  308. <tt>string-normalize-nfd</tt><br>
  309. <tt>string-normalize-nfkd</tt><br>
  310. <tt>string-normalize-nfc</tt><br>
  311. <tt>string-normalize-nfkc</tt><br>
  312. <tt>char-titlecase</tt><br>
  313. <tt>char-title-case?</tt><br>
  314. <tt>char-foldcase</tt><br>
  315. <tt>string-upcase</tt><br>
  316. <tt>string-downcase</tt><br>
  317. <tt>string-foldcase</tt><br>
  318. <tt>string-titlecase</tt>
  319. </td></tr></table></div>
  320. The <tt>r6rs-unicode</tt> structure also exports a
  321. <tt>char-general-category</tt> procedure compatible with the
  322. <tt>(r6rs unicode)</tt> library. Note that, as Scheme&nbsp;48 treats
  323. source code case-insensitively, the symbols it returns are
  324. all-lowercase.<p>
  325. </p>
  326. <a name="node_sec_6.6"></a>
  327. <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.6">6.6&nbsp;&nbsp;I/O</a></h2>
  328. <p>Ports must encode any text a program writes to an output port to a
  329. byte sequence, and conversely decode byte sequences when a program
  330. reads text from an input port. Therefore, each port has an associated
  331. <i>text codec</i><a name="node_idx_436"></a> that describes how encode and decode text.</p>
  332. <p>
  333. Note that the interface to the text codec functionality is
  334. experimental and very likely to change in the future.</p>
  335. <p>
  336. </p>
  337. <a name="node_sec_6.6.1"></a>
  338. <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.6.1">6.6.1&nbsp;&nbsp;Text codecs</a></h3>
  339. <p></p>
  340. <p>
  341. The <tt>i/o</tt> structure defines the following procedures:
  342. </p>
  343. <ul>
  344. <li><p><tt>(port-text-codec<i> port</i>)&nbsp;-&gt;&nbsp;<i>text-codec</i></tt><a name="node_idx_438"></a>
  345. </p>
  346. <li><p><tt>(set-port-text-codec!<i> port text-codec</i>)</tt><a name="node_idx_440"></a>
  347. </p>
  348. </ul><p>
  349. These two procedures retrieve and set the text codec associated with a
  350. port, respectively. A program can set text codec of a port at any
  351. time, even if it has already performed I/O on the port.</p>
  352. <p>
  353. The <tt>text-codecs</tt> structure defines the following procedures and macros:</p>
  354. <p>
  355. </p>
  356. <ul>
  357. <li><p><tt>(text-codec?<i> x</i>)&nbsp;-&gt;&nbsp;<i>boolean</i></tt><a name="node_idx_442"></a>
  358. </p>
  359. <li><p><tt>null-text-codec</tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;( text-codec)<a name="node_idx_444"></a>
  360. </p>
  361. <li><p><tt>us-ascii-codec</tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;( text-codec)<a name="node_idx_446"></a>
  362. </p>
  363. <li><p><tt>latin-1-codec</tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;( text-codec)<a name="node_idx_448"></a>
  364. </p>
  365. <li><p><tt>utf-8-codec</tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;( text-codec)<a name="node_idx_450"></a>
  366. </p>
  367. <li><p><tt>utf-16le-codec</tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;( text-codec)<a name="node_idx_452"></a>
  368. </p>
  369. <li><p><tt>utf-16be-codec</tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;( text-codec)<a name="node_idx_454"></a>
  370. </p>
  371. <li><p><tt>utf-32le-codec</tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;( text-codec)<a name="node_idx_456"></a>
  372. </p>
  373. <li><p><tt>utf-32be-codec</tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;( text-codec)<a name="node_idx_458"></a>
  374. </p>
  375. <li><p><tt>(find-text-codec<i> string</i>)&nbsp;-&gt;&nbsp;<i>text-codec or <tt>#f</tt></i></tt><a name="node_idx_460"></a>
  376. </p>
  377. </ul><p>
  378. <tt>Text-codec?</tt> is the predicate for text codecs.
  379. <tt>Null-text-codec</tt> is primarily meant for null ports that never
  380. yield input and swallow all output. The following text codecs
  381. implement the US-ASCII, Latin-1, Unicode UTF-8, Unicode UTF-16
  382. (little-endian), Unicode UTF-16 (big-endian), Unicode UTF-32
  383. (little-endian), Unicode UTF-32 (big-endian) encodings, respectively.</p>
  384. <p>
  385. <tt>Find-text-codec</tt> finds the codec associated with an encoding
  386. name. The names of the above encodings are <code class=verbatim>&quot;null&quot;</code>,
  387. <code class=verbatim>&quot;US-ASCII&quot;</code>, <code class=verbatim>&quot;ISO8859-1&quot;</code>, <code class=verbatim>&quot;UTF-8&quot;</code>,
  388. <code class=verbatim>&quot;UTF-16LE&quot;</code>, <code class=verbatim>&quot;UTF-16BE&quot;</code>, <code class=verbatim>&quot;UTF-32LE&quot;</code>, and
  389. <code class=verbatim>&quot;UTF-32BE&quot;</code>, respectively.</p>
  390. <p>
  391. </p>
  392. <a name="node_sec_6.6.2"></a>
  393. <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.6.2">6.6.2&nbsp;&nbsp;Text-codec utilities</a></h3>
  394. <p>The <tt>text-codec-utils</tt> structure exports a few utilities for
  395. dealing with text codecs:</p>
  396. <p>
  397. </p>
  398. <ul>
  399. <li><p><tt>(guess-port-text-codec-according-to-bom<i> port</i>)&nbsp;-&gt;&nbsp;<i>text-codec or <tt>#f</tt></i></tt><a name="node_idx_462"></a>
  400. </p>
  401. <li><p><tt>(set-port-text-codec-according-to-bom!<i> port</i>)&nbsp;-&gt;&nbsp;<i>boolean</i></tt><a name="node_idx_464"></a>
  402. </p>
  403. </ul><p>
  404. These procedures look at the byte-order-mark (also called the
  405. ``BOM'', <tt>U+FEFF</tt>) at the
  406. beginning of a port and guess the appropriate text codec. This works
  407. only for UTF-16 (little-endian and big-endian) and UTF-8.
  408. <tt>Guess-port-text-codec-according-to-bom</tt> returns the text codec,
  409. or <tt>#f</tt> if it found no UTF-16 or UTF-8 BOM. Note that this
  410. actually reads from the port. If the guess does not succeed, it is
  411. probably a good idea to re-open the port.
  412. <tt>Set-port-text-codec-according-to-bom!</tt> calls
  413. <tt>guess-port-text-codec-according-to-bom</tt>, sets the port text
  414. codec to the result if successful and returns <tt>#t</tt>. If it is
  415. not successful, it returns <tt>#f</tt>. As with
  416. <tt>guess-port-text-codec-according-to-bom</tt>, this reads from the
  417. port, whether successful or not.</p>
  418. <p>
  419. </p>
  420. <a name="node_sec_6.6.3"></a>
  421. <h3><a href="manual-Z-H-2.html#node_toc_node_sec_6.6.3">6.6.3&nbsp;&nbsp;Creating text codecs</a></h3>
  422. <p></p>
  423. <ul>
  424. <li><p><tt>(make-text-codec<i> strings encode-proc decode-proc</i>)&nbsp;-&gt;&nbsp;<i>text-codec</i></tt><a name="node_idx_466"></a>
  425. </p>
  426. <li><p><tt>(text-codec-names<i> text-codec</i>)&nbsp;-&gt;&nbsp;<i>list of strings</i></tt><a name="node_idx_468"></a>
  427. </p>
  428. <li><p><tt>(text-codec-encode-char-proc<i> text-codec</i>)&nbsp;-&gt;&nbsp;<i> encode-proc</i></tt><a name="node_idx_470"></a>
  429. </p>
  430. <li><p><tt>(text-codec-decode-char-proc<i> text-codec</i>)&nbsp;-&gt;&nbsp;<i> decode-proc</i></tt><a name="node_idx_472"></a>
  431. </p>
  432. <li><p><tt>(define-text-codec <i>id</i> <i>name</i> <i>encode-proc</i> <i>decode-proc</i>)</tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(syntax)<a name="node_idx_474"></a>
  433. </p>
  434. <li><p><tt>(define-text-codec <i>id</i> (<i>name</i> <tt>...</tt>) <i>encode-proc</i> <i>decode-proc</i>)</tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(syntax)<a name="node_idx_476"></a>
  435. </p>
  436. </ul><p>
  437. <tt>Make-text-codec</tt> constructs a text codec from a list of names,
  438. and an encode and a decode procedure. (See below on how to construct
  439. encode and decode procedures.) <tt>Text-codec-names</tt>,
  440. <tt>text-codec-encode-char-proc</tt>, and
  441. <tt>text-codec-decode-char-proc</tt> are the accessors for text codec.
  442. The <tt>define-text-codec</tt> is a shorthand for binding a global
  443. identifier to a text codec. Its first form is for codecs with only
  444. one name, the second for codecs with several names.</p>
  445. <p>
  446. Encoding and decoding procedures work as follows:
  447. </p>
  448. <ul>
  449. <li><p><tt>(<i>encode-proc</i><i> char buffer start count</i>)&nbsp;-&gt;&nbsp;<i>boolean maybe-count</i></tt>
  450. </p>
  451. <li><p><tt>(<i>decode-proc</i><i> buffer start count</i>)&nbsp;-&gt;&nbsp;<i>maybe-char count</i></tt>
  452. </p>
  453. </ul><p>
  454. An <i>encode-proc</i> consumes a character <i>char</i> to encode, a
  455. byte vector <i>buffer</i> to receive the encoding, an index <i>start</i>
  456. into the buffer, and a block size <i>count</i>. It is supposed to
  457. encode the bytes into the block at [<i>start</i>, <i>start +
  458. count</i>). If the encoding is successful, the procedure must
  459. return <tt>#t</tt> and the number of bytes needed by the encoding.
  460. If the character cannot be encoded at all, the procedure must return
  461. <tt>#f</tt> and <tt>#f</tt>. If the encoding is possible but the
  462. space is not sufficient, the procedure must return <tt>#f</tt> and a
  463. total number of bytes needed for the encoding.</p>
  464. <p>
  465. A <i>decode-proc</i> consumes a byte vector <i>buffer</i>, an index
  466. <i>start</i> into the buffer, and a block size <i>count</i>. It is
  467. supposed to decode the bytes at indices [<i>start</i>, <i>start
  468. + count</i>). If the decoding is successul, it must return
  469. the decoded character at the beginning of the block, and the number of
  470. bytes consumed. If the block cannot begin with or be a prefix of a
  471. valid encoding, the procedure must return <tt>#f</tt> and
  472. <tt>#f</tt>. If the block contains a true prefix of a valid
  473. encoding, the procedure must return <tt>#f</tt> and a total count of
  474. bytes (including the buffer) needed to complete the encoding. Note
  475. that this byte count is only a guess: the system will provide that
  476. many bytes, but the decoding procedures might still signal an
  477. incomplete encoding, causing the system to try to obtain more. </p>
  478. <p>
  479. </p>
  480. <a name="node_sec_6.7"></a>
  481. <h2><a href="manual-Z-H-2.html#node_toc_node_sec_6.7">6.7&nbsp;&nbsp;Default encodings</a></h2>
  482. <p>The default encoding for new ports is UTF-8. For the default
  483. <tt>current-input-port</tt>, <tt>current-output-port</tt>, and
  484. <tt>current-error-port</tt>, Scheme&nbsp;48 consults the OS for encoding
  485. information.</p>
  486. <p>
  487. For Unix, it consults <tt>nl_langinfo(3)</tt>, which in turn consults
  488. the <tt>LC_</tt> environment variables. If the encoding is not defined
  489. that way, Scheme&nbsp;48 reverts to US-ASCII.</p>
  490. <p>
  491. Under Windows, Scheme&nbsp;48 uses Unicode I/O (using UTF-16) for the
  492. default ports connected to the console, and Latin-1 for default ports
  493. that are not.</p>
  494. <p>
  495. </p>
  496. <div align=right class=navigation><i>[Go to <span><a href="manual.html">first</a>, <a href="manual-Z-H-7.html">previous</a></span><span>, <a href="manual-Z-H-9.html">next</a></span> page<span>; &nbsp;&nbsp;</span><span><a href="manual-Z-H-2.html#node_toc_start">contents</a></span><span><span>; &nbsp;&nbsp;</span><a href="manual-Z-H-13.html#node_index_start">index</a></span>]</i></div>
  497. <p></p>
  498. </div>
  499. </body>
  500. </html>