README.StarDict 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381
  1. Format for StarDict dictionary files
  2. ------------------------------------
  3. StarDict homepage: http://stardict.sourceforge.net
  4. {0}. Number and Byte-order Conventions
  5. When you record the numbers that identify sizes, offsets, etc., you
  6. should use 32-bit numbers, such as you might represent with a glong.
  7. In order to make StarDict work on different platforms, these numbers
  8. must be in network byte order. You can ensure the correct byte order
  9. by using the g_htonl() function when creating dictionary files.
  10. Conversely, you should use g_ntohl() when reading dictionary files.
  11. Strings should be encoded in UTF-8.
  12. {1}. Files
  13. Every dictionary consists of three files:
  14. (1). somedict.ifo
  15. (2). somedict.idx or somedict.idx.gz
  16. (3). somedict.dict or somedict.dict.dz
  17. You can use gzip -9 to compress the .idx file. If the .idx file are not
  18. compressed, the loading can be fast and save memory when using, compress it
  19. will make the .idx file load into memory and make the quering fast when using.
  20. You can use dictzip to compress the .dict file.
  21. "dictzip" uses the same compression algorithm and file format as does gzip,
  22. but provides a table that can be used to randomly access compressed blocks
  23. in the file. The use of 50-64kB blocks for compression typically degrades
  24. compression by less than 10%, while maintaining acceptable random access
  25. capabilities for all data in the file. As an added benefit, files
  26. compressed with dictzip can be decompressed with gunzip.
  27. For more information about dictzip, refer to DICT project, please see:
  28. http://www.dict.org
  29. Stardict will search for the .ifo file, then open the .idx or
  30. .idx.gz file and the .dict.dz or .dict file which is in the same directory and
  31. has the same base name.
  32. {2}. The ".ifo" file's format.
  33. The .ifo file has the following format:
  34. StarDict's dict ifo file
  35. version=2.4.2
  36. [options]
  37. Note that the current "version" string must be "2.4.2". If it's not,
  38. then StarDict will refuse to read the file.
  39. [options]
  40. ---------
  41. In the example above, [options] expands to any of the following lines
  42. specifying information about the dictionary. Each option is a keyword
  43. followed by an equal sign, then the value of that option, then a
  44. newline. The options may be appear in any order.
  45. Note that the dictionary must have at least a bookname, a wordcount and a
  46. idxfilesize, or the load will fail. All other information is optional. All
  47. strings should be encoded in UTF-8.
  48. Available options:
  49. bookname= // required
  50. wordcount= // required
  51. idxfilesize= // required
  52. author=
  53. email=
  54. website=
  55. description=
  56. date=
  57. sametypesequence= // very important.
  58. wordcount is the count of word entries in .idx file, it must be right.
  59. idxfilesize is the size(in bytes) of the .idx file, even the .idx is compressed
  60. to a .idx.gz file, this entry must record the original .idx file's size, and it
  61. must be right too. The .gz file don't contain its original size information,
  62. but knowing the original size can speed up the extraction to memory, as you
  63. don't need to call realloc() for many times.
  64. The "sametypesequence" option is described in further detail below.
  65. ***
  66. sametypesequence
  67. You should first familiarize yourself with the .dict file format
  68. described in the next section so that you can understand what effect
  69. this option has on the .dict file.
  70. If the sametypesequence option is set, it tells StarDict that each
  71. word's data in the .dict file will have the same sequence of datatypes.
  72. In this case, we expect a .dict file that's been optimized in two
  73. ways: the type identifiers should be omitted, and the size marker for
  74. the last data entry of each word should be omitted.
  75. Let's consider some concrete examples of the sametypesequence option.
  76. Suppose that a dictionary records many .wav files, and so sets:
  77. sametypesequence=W
  78. In this case, each word's entry in the .dict file consists solely of a
  79. wav file. In the .dict file, you would leave out the 'W' character
  80. before each entry, and you would also omit the 32-bit integer at the
  81. front of each .wav entry that would normally give the entry's length.
  82. You can do this since the length is known from the information in the
  83. idx file.
  84. As another example, suppose a dictionary contains phonetic information
  85. and a meaning for each word. The sametypesequence option for this
  86. dictionary would be:
  87. sametypesequence=tm
  88. Once again, you can omit the 't' and 'm' characters before each data
  89. entry in the .dict file. In addition, you should omit the terminating
  90. '\0' for the 'm' entry for each word in the .dict file, as the length
  91. of the meaning string can be inferred from the length of the phonetic
  92. string (still indicated by a terminating '\0') and the length of the
  93. entire word entry (listed in the .idx file).
  94. So for cases where the last data entry for each word normally requires
  95. a terminating '\0' character, you should omit this character in the
  96. dict file. And for cases where the last data entry for each word
  97. normally requires an initial 32-bit number giving the length of the
  98. field (such as WAV and PNG entries), you must omit this number in the
  99. dictionary.
  100. Every dictionary should try to use the sametypesequence feature to
  101. save disk space.
  102. ***
  103. {3}. The ".idx" file's format.
  104. The .idx file is just a word list.
  105. The word list is a sorted list of word entries.
  106. Each entry in the word list contains three fields, one after the other:
  107. word_str; // a utf-8 string terminated by '\0'.
  108. word_data_offset; // word data's offset in .dict file
  109. word_data_size; // word data's total size in .dict file
  110. word_str gives the string representing this word. It's the string
  111. that is "looked up" by the StarDict.
  112. word_data_offset and word_data_size should both be 32-bit numbers in
  113. network byte order.
  114. No two entries should have the same "word_str". In other words,
  115. (strcmp(s1, s2) != 0).
  116. The length of "word_str" should be less than 256. In other words,
  117. (strlen(word) < 256).
  118. The word list must be sorted by calling stardict_strcmp() on the "word_str"
  119. fields. If the word list order is wrong, StarDict will fail to function
  120. correctly!
  121. ============
  122. gint stardict_strcmp(const gchar *s1, const gchar *s2)
  123. {
  124. gint a;
  125. a = g_ascii_strcasecmp(s1, s2);
  126. if (a == 0)
  127. return strcmp(s1, s2);
  128. else
  129. return a;
  130. }
  131. ============
  132. g_ascii_strcasecmp() is a glib function:
  133. Unlike the BSD strcasecmp() function, this only recognizes standard
  134. ASCII letters and ignores the locale, treating all non-ASCII characters
  135. as if they are not letters.
  136. stardict_strcmp() works fine with English characters, but the other
  137. locale characters' sorting is not so good. There should be a _strcmp
  138. function which handles the utf-8 string sorting better. If you know
  139. one, email me :)
  140. g_utf8_collate()? This is a locale-dependent funcition. So if you look
  141. up Chinese characters while in the Chinese locale, it works fine. But
  142. if you are in some other locale then the lookup will fail, as the
  143. order is not the same as in the Chinese locale (which was used when
  144. creating the dictionary).
  145. g_utf8_to_ucs4() then do comparing? This sounds like a good solution, but..
  146. The complete solution can be found in "Unicode Technical Standard #10: Unicode
  147. Collation Algorithm", http://www.unicode.org/reports/tr10/
  148. I hope glib will provide a locale-independent g_utf8_collate() soon.
  149. http://bugzilla.gnome.org/show_bug.cgi?id=112798
  150. {4}. The ".dict" file's format.
  151. The .dict file is a pure data sequence, as the offset and size of each
  152. word is recorded in the corresponding .idx file.
  153. If the "sametypesequence" option is not used in the .ifo file, then
  154. the .dict file has fields in the following order:
  155. ==============
  156. word_1_data_1_type; // a single char identifying the data type
  157. word_1_data_1_data; // the data
  158. word_1_data_2_type;
  159. word_1_data_2_data;
  160. ...... // the number of data entries for each word is determined by
  161. // word_data_size in .idx file
  162. word_2_data_1_type;
  163. word_2_data_1_data;
  164. ......
  165. ==============
  166. It's important to note that each field in each word indicates its
  167. own length, as described below. The number of possible fields per
  168. word is also not fixed, and is determined by simply reading data until
  169. you've read word_data_size bytes for that word.
  170. Suppose the "sametypesequence" option is used in the .idx file, and
  171. the option is set like this:
  172. sametypesequence=tm
  173. Then the .dict file will look like this:
  174. ==============
  175. word_1_data_1_data
  176. word_1_data_2_data
  177. word_2_data_1_data
  178. word_2_data_2_data
  179. ......
  180. ==============
  181. The first data entry for each word will have a terminating '\0', but
  182. the second entry will not have a terminating '\0'. The omissions of
  183. the type chars and of the last field's size information are the
  184. optimizations required by the "sametypesequence" option described
  185. above.
  186. Type identifiers
  187. ----------------
  188. Here are the single-character type identifiers that may be used with
  189. the "sametypesequence" option in the .idx file, or may appear in the
  190. dict file itself if the "sametypesequence" option is not used.
  191. Lower-case characters signify that a field's size is determined by a
  192. terminating '\0', while upper-case characters indicate that the data
  193. begins with a 32-bit integer that gives the length of the data field.
  194. 'm'
  195. Word's pure text meaning.
  196. The data should be a utf-8 string ending with '\0'.
  197. 'l'
  198. Word's pure text meaning.
  199. The data is NOT a utf-8 string, but is instead a string in locale
  200. encoding, ending with '\0'. Sometimes using this type will save disk
  201. space, but its use is discouraged.
  202. 'g'
  203. A utf-8 string which is marked up with the Pango text markup language.
  204. For more information about this markup language, See the "Pango
  205. Reference Manual."
  206. You might have it installed locally at:
  207. file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html
  208. 't'
  209. English phonetic string.
  210. The data should be a utf-8 string ending with '\0'.
  211. Here are some utf-8 phonetic characters:
  212. θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑ
  213. æɑɒʌәєŋvθðʃʒːɡˏˊˋ
  214. 'y'
  215. Chinese YinBiao.
  216. The data should be a utf-8 string ending with '\0'.
  217. 'W'
  218. wav file.
  219. The data begins with a network byte-ordered glong to identify the wav
  220. file's size, immediately followed by the file's content.
  221. 'P'
  222. png file.
  223. The data begins with a network byte-ordered glong to identify the png
  224. file's size, immediately followed by the file's content.
  225. 'X'
  226. this type identifier is reserved for experimental extensions.
  227. {5}. Tree Dictionary
  228. The tree dictionary support is used for information viewing, etc.
  229. A tree dictionary contains three file: sometreedict.ifo, sometreedict.tdx.gz
  230. and sometreedict.dict.dz.
  231. It is better to compress the .tdx file, as it is always load into memory.
  232. The .ifo file has the following format:
  233. StarDict's treedict ifo file
  234. version=2.4.2
  235. [options]
  236. Available options:
  237. bookname= // required
  238. tdxfilesize= // required
  239. wordcount=
  240. author=
  241. email=
  242. website=
  243. description=
  244. date=
  245. sametypesequence=
  246. wordcount is only used for info view in the dict manage dialog, so it is not
  247. important in tree dictionary.
  248. The .tdx file is just the word list.
  249. -----------
  250. The word list is a tree list of word entries.
  251. Each entry in the word list contains four fields, one after the other:
  252. word_str; // a utf-8 string terminated by '\0'.
  253. word_data_offset; // word data's offset in .dict file
  254. word_data_size; // word data's total size in .dict file. it can be 0.
  255. word_subentry_count; //have many sub word this entry has, 0 means none.
  256. Subentry is immidiately followed by its parent entry. This make the order is
  257. just as when a tree list with all its nodes extended, then sort from top to
  258. bottom.
  259. The .dict file's format is the same as the normal dictionary.
  260. {6}. More information.
  261. You can read "src/lib.cpp", "src/dictmanagedlg.cpp" and
  262. "src/tools/*.cpp" for more information.
  263. If you have any questions, email me. :)
  264. Thanks to Will Robinson <wsr23@stanford.edu> for cleaning up this file's
  265. English.
  266. Hu Zheng <huzheng_001@163.com>
  267. http://forlinux.yeah.net
  268. 2003.11.11