123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381 |
- Format for StarDict dictionary files
- ------------------------------------
- StarDict homepage: http://stardict.sourceforge.net
- {0}. Number and Byte-order Conventions
- When you record the numbers that identify sizes, offsets, etc., you
- should use 32-bit numbers, such as you might represent with a glong.
- In order to make StarDict work on different platforms, these numbers
- must be in network byte order. You can ensure the correct byte order
- by using the g_htonl() function when creating dictionary files.
- Conversely, you should use g_ntohl() when reading dictionary files.
- Strings should be encoded in UTF-8.
- {1}. Files
- Every dictionary consists of three files:
- (1). somedict.ifo
- (2). somedict.idx or somedict.idx.gz
- (3). somedict.dict or somedict.dict.dz
- You can use gzip -9 to compress the .idx file. If the .idx file are not
- compressed, the loading can be fast and save memory when using, compress it
- will make the .idx file load into memory and make the quering fast when using.
- You can use dictzip to compress the .dict file.
- "dictzip" uses the same compression algorithm and file format as does gzip,
- but provides a table that can be used to randomly access compressed blocks
- in the file. The use of 50-64kB blocks for compression typically degrades
- compression by less than 10%, while maintaining acceptable random access
- capabilities for all data in the file. As an added benefit, files
- compressed with dictzip can be decompressed with gunzip.
- For more information about dictzip, refer to DICT project, please see:
- http://www.dict.org
- Stardict will search for the .ifo file, then open the .idx or
- .idx.gz file and the .dict.dz or .dict file which is in the same directory and
- has the same base name.
- {2}. The ".ifo" file's format.
- The .ifo file has the following format:
- StarDict's dict ifo file
- version=2.4.2
- [options]
- Note that the current "version" string must be "2.4.2". If it's not,
- then StarDict will refuse to read the file.
- [options]
- ---------
- In the example above, [options] expands to any of the following lines
- specifying information about the dictionary. Each option is a keyword
- followed by an equal sign, then the value of that option, then a
- newline. The options may be appear in any order.
- Note that the dictionary must have at least a bookname, a wordcount and a
- idxfilesize, or the load will fail. All other information is optional. All
- strings should be encoded in UTF-8.
- Available options:
- bookname= // required
- wordcount= // required
- idxfilesize= // required
- author=
- email=
- website=
- description=
- date=
- sametypesequence= // very important.
- wordcount is the count of word entries in .idx file, it must be right.
- idxfilesize is the size(in bytes) of the .idx file, even the .idx is compressed
- to a .idx.gz file, this entry must record the original .idx file's size, and it
- must be right too. The .gz file don't contain its original size information,
- but knowing the original size can speed up the extraction to memory, as you
- don't need to call realloc() for many times.
- The "sametypesequence" option is described in further detail below.
- ***
- sametypesequence
- You should first familiarize yourself with the .dict file format
- described in the next section so that you can understand what effect
- this option has on the .dict file.
- If the sametypesequence option is set, it tells StarDict that each
- word's data in the .dict file will have the same sequence of datatypes.
- In this case, we expect a .dict file that's been optimized in two
- ways: the type identifiers should be omitted, and the size marker for
- the last data entry of each word should be omitted.
- Let's consider some concrete examples of the sametypesequence option.
- Suppose that a dictionary records many .wav files, and so sets:
- sametypesequence=W
- In this case, each word's entry in the .dict file consists solely of a
- wav file. In the .dict file, you would leave out the 'W' character
- before each entry, and you would also omit the 32-bit integer at the
- front of each .wav entry that would normally give the entry's length.
- You can do this since the length is known from the information in the
- idx file.
- As another example, suppose a dictionary contains phonetic information
- and a meaning for each word. The sametypesequence option for this
- dictionary would be:
- sametypesequence=tm
- Once again, you can omit the 't' and 'm' characters before each data
- entry in the .dict file. In addition, you should omit the terminating
- '\0' for the 'm' entry for each word in the .dict file, as the length
- of the meaning string can be inferred from the length of the phonetic
- string (still indicated by a terminating '\0') and the length of the
- entire word entry (listed in the .idx file).
- So for cases where the last data entry for each word normally requires
- a terminating '\0' character, you should omit this character in the
- dict file. And for cases where the last data entry for each word
- normally requires an initial 32-bit number giving the length of the
- field (such as WAV and PNG entries), you must omit this number in the
- dictionary.
- Every dictionary should try to use the sametypesequence feature to
- save disk space.
- ***
- {3}. The ".idx" file's format.
- The .idx file is just a word list.
- The word list is a sorted list of word entries.
- Each entry in the word list contains three fields, one after the other:
- word_str; // a utf-8 string terminated by '\0'.
- word_data_offset; // word data's offset in .dict file
- word_data_size; // word data's total size in .dict file
- word_str gives the string representing this word. It's the string
- that is "looked up" by the StarDict.
- word_data_offset and word_data_size should both be 32-bit numbers in
- network byte order.
- No two entries should have the same "word_str". In other words,
- (strcmp(s1, s2) != 0).
- The length of "word_str" should be less than 256. In other words,
- (strlen(word) < 256).
- The word list must be sorted by calling stardict_strcmp() on the "word_str"
- fields. If the word list order is wrong, StarDict will fail to function
- correctly!
- ============
- gint stardict_strcmp(const gchar *s1, const gchar *s2)
- {
- gint a;
- a = g_ascii_strcasecmp(s1, s2);
- if (a == 0)
- return strcmp(s1, s2);
- else
- return a;
- }
- ============
- g_ascii_strcasecmp() is a glib function:
- Unlike the BSD strcasecmp() function, this only recognizes standard
- ASCII letters and ignores the locale, treating all non-ASCII characters
- as if they are not letters.
- stardict_strcmp() works fine with English characters, but the other
- locale characters' sorting is not so good. There should be a _strcmp
- function which handles the utf-8 string sorting better. If you know
- one, email me :)
- g_utf8_collate()? This is a locale-dependent funcition. So if you look
- up Chinese characters while in the Chinese locale, it works fine. But
- if you are in some other locale then the lookup will fail, as the
- order is not the same as in the Chinese locale (which was used when
- creating the dictionary).
- g_utf8_to_ucs4() then do comparing? This sounds like a good solution, but..
- The complete solution can be found in "Unicode Technical Standard #10: Unicode
- Collation Algorithm", http://www.unicode.org/reports/tr10/
- I hope glib will provide a locale-independent g_utf8_collate() soon.
- http://bugzilla.gnome.org/show_bug.cgi?id=112798
- {4}. The ".dict" file's format.
- The .dict file is a pure data sequence, as the offset and size of each
- word is recorded in the corresponding .idx file.
- If the "sametypesequence" option is not used in the .ifo file, then
- the .dict file has fields in the following order:
- ==============
- word_1_data_1_type; // a single char identifying the data type
- word_1_data_1_data; // the data
- word_1_data_2_type;
- word_1_data_2_data;
- ...... // the number of data entries for each word is determined by
- // word_data_size in .idx file
- word_2_data_1_type;
- word_2_data_1_data;
- ......
- ==============
- It's important to note that each field in each word indicates its
- own length, as described below. The number of possible fields per
- word is also not fixed, and is determined by simply reading data until
- you've read word_data_size bytes for that word.
- Suppose the "sametypesequence" option is used in the .idx file, and
- the option is set like this:
- sametypesequence=tm
- Then the .dict file will look like this:
- ==============
- word_1_data_1_data
- word_1_data_2_data
- word_2_data_1_data
- word_2_data_2_data
- ......
- ==============
- The first data entry for each word will have a terminating '\0', but
- the second entry will not have a terminating '\0'. The omissions of
- the type chars and of the last field's size information are the
- optimizations required by the "sametypesequence" option described
- above.
- Type identifiers
- ----------------
- Here are the single-character type identifiers that may be used with
- the "sametypesequence" option in the .idx file, or may appear in the
- dict file itself if the "sametypesequence" option is not used.
- Lower-case characters signify that a field's size is determined by a
- terminating '\0', while upper-case characters indicate that the data
- begins with a 32-bit integer that gives the length of the data field.
- 'm'
- Word's pure text meaning.
- The data should be a utf-8 string ending with '\0'.
- 'l'
- Word's pure text meaning.
- The data is NOT a utf-8 string, but is instead a string in locale
- encoding, ending with '\0'. Sometimes using this type will save disk
- space, but its use is discouraged.
- 'g'
- A utf-8 string which is marked up with the Pango text markup language.
- For more information about this markup language, See the "Pango
- Reference Manual."
- You might have it installed locally at:
- file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html
- 't'
- English phonetic string.
- The data should be a utf-8 string ending with '\0'.
- Here are some utf-8 phonetic characters:
- θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑ
- æɑɒʌәєŋvθðʃʒːɡˏˊˋ
- 'y'
- Chinese YinBiao.
- The data should be a utf-8 string ending with '\0'.
- 'W'
- wav file.
- The data begins with a network byte-ordered glong to identify the wav
- file's size, immediately followed by the file's content.
- 'P'
- png file.
- The data begins with a network byte-ordered glong to identify the png
- file's size, immediately followed by the file's content.
- 'X'
- this type identifier is reserved for experimental extensions.
- {5}. Tree Dictionary
- The tree dictionary support is used for information viewing, etc.
- A tree dictionary contains three file: sometreedict.ifo, sometreedict.tdx.gz
- and sometreedict.dict.dz.
- It is better to compress the .tdx file, as it is always load into memory.
- The .ifo file has the following format:
- StarDict's treedict ifo file
- version=2.4.2
- [options]
- Available options:
- bookname= // required
- tdxfilesize= // required
- wordcount=
- author=
- email=
- website=
- description=
- date=
- sametypesequence=
- wordcount is only used for info view in the dict manage dialog, so it is not
- important in tree dictionary.
- The .tdx file is just the word list.
- -----------
- The word list is a tree list of word entries.
- Each entry in the word list contains four fields, one after the other:
- word_str; // a utf-8 string terminated by '\0'.
- word_data_offset; // word data's offset in .dict file
- word_data_size; // word data's total size in .dict file. it can be 0.
- word_subentry_count; //have many sub word this entry has, 0 means none.
- Subentry is immidiately followed by its parent entry. This make the order is
- just as when a tree list with all its nodes extended, then sort from top to
- bottom.
- The .dict file's format is the same as the normal dictionary.
- {6}. More information.
- You can read "src/lib.cpp", "src/dictmanagedlg.cpp" and
- "src/tools/*.cpp" for more information.
- If you have any questions, email me. :)
- Thanks to Will Robinson <wsr23@stanford.edu> for cleaning up this file's
- English.
- Hu Zheng <huzheng_001@163.com>
- http://forlinux.yeah.net
- 2003.11.11
|