转载请保留作者信息:
作者:88250
Blog:http:/blog.csdn.net/DL88250
MSN & Gmail & QQ:DL88250@gmail.com
下面的文件是StarDict的词库格式说明文件:
FormatforStarDictdictionaryfiles
------------------------------------
StarDicthomepage:http://stardict.sourceforge.net
StarDicton-linedictionary:http://www.stardict.org
{ 0 }.NumberandByte-orderConventions
Whenyourecordthenumbersthatidentifysizes , offsets , etc. , you
shoulduse 32 -bitsnumbers , suchasyoumightrepresentwithaglong.
InordertomakeStarDictworkondifferentplatforms , thesenumbers
mustbeinnetworkbyteorder.Youcanensurethecorrectbyteorder
byusingtheg_htonl()functionwhencreatingdictionaryfiles.
Conversely , youshoulduseg_ntohl()whenreadingdictionaryfiles.
StringsshouldbeencodedinUTF- 8 .
{ 1 }.Files
Everydictionaryconsistsofthesefiles:
( 1 ).somedict.ifo
( 2 ).somedict.idxorsomedict.idx.gz
( 3 ).somedict.dictorsomedict.dict.dz
( 4 ).somedict.syn(optional)
Youcanusegzip- 9 tocompressthe.idxfile.Ifthe.idxfilearenot
compressed , theloadingcanbefastandsavememorywhenusing , compressit
willmakethe.idxfileloadintomemoryandmakethequeringbecomefaster
whenusing.
Youcanusedictziptocompressthe.dictfile.
" dictzip " usesthesamecompressionalgorithmandfileformatasdoesgzip ,
butprovidesatablethatcanbeusedtorandomlyaccesscompressedblocks
inthefile.Theuseof 50 -64kBblocksforcompressiontypicallydegrades
compressionbylessthan 10 % , whilemaintainingacceptablerandomaccess
capabilitiesforalldatainthefile.Asanaddedbenefit , files
compressedwithdictzipcanbedecompressedwithgunzip.
Formoreinformationaboutdictzip , refertoDICTproject , pleasesee:
http://www.dict.org
Whenyoucreateadictionary , youshoulduse.idxand.dict.dzinnormal
case.
Stardictwillsearchforthe.ifofile , thenopenthe.idxor
.idx.gzfileandthe.dict.dzor.dictfilewhichisinthesamedirectoryand
hasthesamebasename.
{ 2 }.The " .ifo " file'sformat.
The.ifofilehasthefollowingformat:
StarDict'sdictifofile
version = 2.4.2
[ options ]
Notethatthecurrent " version " stringmustbe " 2.4.2 " or " 3.0.0 " .Ifit'snot ,
thenStarDictwillrefusetoreadthefile.
Ifversionis " 3.0.0 " , StarDictwillparsethe " idxoffsetbits " option.
[ options ]
---------
Intheexampleabove , [ options ] expandstoanyofthefollowinglines
specifyinginformationaboutthedictionary.Eachoptionisakeyword
followedbyanequalsign , thenthevalueofthatoption , thena
newline.Theoptionsmaybeappearinanyorder.
Notethatthedictionarymusthaveatleastabookname , awordcountanda
idxfilesize , ortheloadwillfail.Allotherinformationisoptional.All
stringsshouldbeencodedinUTF- 8 .
Availableoptions:
bookname = //required
wordcount = //required
synwordcount = //requiredif " .syn " fileexists.
idxfilesize = //required
idxoffsetbits = //Newin 3.0.0
author =
email =
website =
description = //Youcanuse<br>fornewline.
date =
sametypesequence = //veryimportant.
dicttype =
wordcountisthecountofwordentriesin.idxfile , itmustberight.
idxfilesizeisthesize(inbytes)ofthe.idxfile , eventhe.idxiscompressed
toa.idx.gzfile , thisentrymustrecordtheoriginal.idxfile'ssize , andit
mustberighttoo.The.gzfiledon'tcontainitsoriginalsizeinformation ,
butknowingtheoriginalsizecanspeeduptheextractiontomemory , asyou
don'tneedtocallrealloc()formanytimes.
idxoffsetbitscanbe 64 or 32 .If " idxoffsetbits=64 " , theoffsetfieldofthe
.idxfilewillbe 64 bits.
dicttypeisusedbysomespecialdictionaryplugins , suchaswordnet.Itsvalue
canbe " wordnet " presently.
The " sametypesequence " optionisdescribedinfurtherdetailbelow.
***
sametypesequence
Youshouldfirstfamiliarizeyourselfwiththe.dictfileformat
describedinthenextsectionsothatyoucanunderstandwhateffect
thisoptionhasonthe.dictfile.
Ifthesametypesequenceoptionisset , ittellsStarDictthateach
word'sdatainthe.dictfilewillhavethesamesequenceofdatatypes.
Inthiscase , weexpecta.dictfilethat'sbeenoptimizedintwo
ways:thetypeidentifiersshouldbeomitted , andthesizemarkerfor
thelastdataentryofeachwordshouldbeomitted.
Let'sconsidersomeconcreteexamplesofthesametypesequenceoption.
Supposethatadictionaryrecordsmany.wavfiles , andsosets:
sametypesequence = W
Inthiscase , eachword'sentryinthe.dictfileconsistssolelyofa
wavfile.Inthe.dictfile , youwouldleaveoutthe'W'character
beforeeachentry , andyouwouldalsoomitthe 32 -bitsintegeratthe
frontofeach.waventrythatwouldnormallygivetheentry'slength.
Youcandothissincethelengthisknownfromtheinformationinthe
idxfile.
Asanotherexample , supposeadictionarycontainsphoneticinformation
andameaningforeachword.Thesametypesequenceoptionforthis
dictionarywouldbe:
sametypesequence = tm
Onceagain , youcanomitthe't'and'm'charactersbeforeeachdata
entryinthe.dictfile.Inaddition , youshouldomittheterminating
' 0 'forthe'm'entryforeachwordinthe.dictfile , asthelength
ofthemeaningstringcanbeinferredfromthelengthofthephonetic
string(stillindicatedbyaterminating' 0 ')andthelengthofthe
entirewordentry(listedinthe.idxfile).
Soforcaseswherethelastdataentryforeachwordnormallyrequires
aterminating' 0 'character , youshouldomitthischaracterinthe
dictfile.Andforcaseswherethelastdataentryforeachword
normallyrequiresaninitial 32 -bitsnumbergivingthelengthofthe
field(suchasWAVandPNGentries) , youmustomitthisnumberinthe
dictionary.
Everydictionaryshouldtrytousethesametypesequencefeatureto
savediskspace.
***
{ 3 }.The " .idx " file'sformat.
The.idxfileisjustawordlist.
Thewordlistisasortedlistofwordentries.
Eachentryinthewordlistcontainsthreefields , oneaftertheother:
word_str ; //autf-8stringterminatedby'�'.
word_data_offset ; //worddata'soffsetin.dictfile
word_data_size ; //worddata'stotalsizein.dictfile
word_strgivesthestringrepresentingthisword.It'sthestring
thatis " lookedup " bytheStarDict.
Twoormoreentriesmayhavethesame " word_str " withdifferent
word_data_offsetandword_data_size.Thismaybeusefulforsome
dictionaries.Butthisfeatureisonlywellsupportedby
StarDict- 2.4.8 andnewer.
Thelengthof " word_str " shouldbelessthan 256 .Inotherwords ,
(strlen(word)< 256 ).
Iftheversionis " 3.0.0 " and " idxoffsetbits=64 " , word_data_offsetwill
be 64 -bitsunsignednumberinnetworkbyteorder.Otherwiseitwillbe
32 -bits.
word_data_sizeshouldbe 32 -bitsunsignednumberinnetworkbyteorder.
Itispossiblethedifferentword_strhavethesameword_data_offsetand
word_data_size , somultiplewordindexpointtothesamedefinition.
Butthisisnotrecommended , formutiplewordshavethesamedefinition ,
youmaycreatea " .syn " fileforthem , seesection 4 below.
Thewordlistmustbesortedbycallingstardict_strcmp()onthe " word_str "
fields.Ifthewordlistorderiswrong , StarDictwillfailtofunction
correctly!
============
gintstardict_strcmp(constgchar*s1 , constgchar*s2)
{
ginta ;
a = g_ascii_strcasecmp(s1 , s2) ;
if(a == 0 )
returnstrcmp(s1 , s2) ;
else
returna ;
}
============
g_ascii_strcasecmp()isaglibfunction:
UnliketheBSDstrcasecmp()function , thisonlyrecognizesstandard
ASCIIlettersandignoresthelocale , treatingallnon-ASCIIcharacters
asiftheyarenotletters.
stardict_strcmp()worksfinewithEnglishcharacters , buttheother
localecharacters'sortingisnotsogood , inthiscase , youcanenable
thecollationfeature , seesection 6 .
{ 4 }.The " ,syn " file'sformat.
Thisfileisoptional , andyoushouldnoticetreedictionaryneedn'tthisfile.
OnlyStarDict- 2.4.8 andnewersupportthisfile.
The.synfilecontainsinformationforsynonyms , thatmeans , whenyouinputa
synonym , StarDictwillsearchanotherwordthatrelatedtoit.
Theformatissimple.Eachitemcontainonestringandanumber.
synonym_word ; //autf-8stringterminatedby'�'.
original_word_index ; //originalword'sindexin.idxfile.
Thenotheritemswithoutseparation.
Whenyouinputsynonym_word , StarDictwillsearchoriginal_word ;
Thelengthof " synonym_word " shouldbelessthan 256 .Inother
words , (strlen(word)< 256 ).
original_word_indexisa 32 -bitsunsignednumberinnetworkbyteorder.
Twoormoreitemsmayhavethesame " synonym_word " withdifferent
original_word_index.
Theitemsmustbesortedbystardict_strcmp()withsynonym_word.
{ 5 }.Theoffsetcachefile'sformat.
StarDict- 2.4.8 starttosupportcachefiles , thisfeaturecanspeedup
loadingandsavememoryasmmap()thecachefile.Thecachefilenames
are.idx.oftand.syn.oft , theformatis:
Firstautf- 8 stringterminatedby' 0 ' , thenmany 32 -bitsnumbersas
thewordoffsetindex , thisindexissparse , and " ENTR_PER_PAGE=32 " ,
theyarenotstoredinnetworkbyteorder.
Thestringmustbeginwith:
=====
StarDict'softfile
version = 2.4.8
=====
Thenalinelikethis:
url = /usr/share/stardict/dic/stardict-somedict- 2.4.2 /somedict.idx
Thislineshouldhaveaending' '.
StarDictwilltrytocreatethe.oftfileatthesamedirectoryof
the.ifofilefirst , iffailed , thentrytocreateitat
~/.cache/stardict/ , ~/.cacheisgetbyg_get_user_cache_dir().
Iftwoormoredictionarieshavethesamefilename , StarDictwill
createsomedict.idx.oft , somedict( 2 ).idx.oft , somedict( 3 ).idx.oft ,
etc.forthemrespectively , eachwithdifferent " url= " inthe
beginningstring.
{ 6 }.Thecollationfile'sformat.
StarDict- 2.4.8 starttosupportcollation , thatsorttheword
listbycollatefunction.Itwillcreatecollationfilewhich
names.idx.cltand.syn.clt , theformatisalittlelikeoffset
cachefile:
Firstautf- 8 stringterminatedby' 0 ' , thenmany 32 -bitsnumbersas
theindexthatsortedbythecollatefunction , theyarenotstored
innetworkbyteorder.
Thestringmustbeginwith:
=====
StarDict'scltfile
version = 2.4.8
=====
Thentwolineslikethis:
url = /usr/share/stardict/dic/stardict-somedict- 2.4.2 /somedict.idx
func = 0
Thesecondlineshouldhaveaending' 'too.
StarDictsupportthesecollatefunctionscurrently:
typedefenum{
UTF8_GENERAL_CI = 0 ,
UTF8_UNICODE_CI ,
UTF8_BIN ,
UTF8_CZECH_CI ,
UTF8_DANISH_CI ,
UTF8_ESPERANTO_CI ,
UTF8_ESTONIAN_CI ,
UTF8_HUNGARIAN_CI ,
UTF8_ICELANDIC_CI ,
UTF8_LATVIAN_CI ,
UTF8_LITHUANIAN_CI ,
UTF8_PERSIAN_CI ,
UTF8_POLISH_CI ,
UTF8_ROMAN_CI ,
UTF8_ROMANIAN_CI ,
UTF8_SLOVAK_CI ,
UTF8_SLOVENIAN_CI ,
UTF8_SPANISH_CI ,
UTF8_SPANISH2_CI ,
UTF8_SWEDISH_CI ,
UTF8_TURKISH_CI ,
COLLATE_FUNC_NUMS
}CollateFunctions ;
TheseUTF8_*_CIfunctionscomesfromMySQLinfact.
Thefile'slocatepathjustlikethe.oftfile.
Notice , for " somedict.idx.gz " file , thecorrespondingcollation
fileissomedict.idx.clt , butnotsomedict.idx.gz.clt , the
" url= " issomedict.idx , notsomedict.idx.gz.Soafteryougzip
the.idxfile , StarDictneedn'tcreatethe.cltfileagain.
{ 7 }.The " .dict " file'sformat.
The.dictfileisapuredatasequence , astheoffsetandsizeofeach
wordisrecordedinthecorresponding.idxfile.
Ifthe " sametypesequence " optionisnotusedinthe.ifofile , then
the.dictfilehasfieldsinthefollowingorder:
==============
word_1_data_1_type ; //asinglecharidentifyingthedatatype
word_1_data_1_data ; //thedata
word_1_data_2_type ;
word_1_data_2_data ;
......//thenumberofdataentriesforeachwordisdeterminedby
//word_data_sizein.idxfile
word_2_data_1_type ;
word_2_data_1_data ;
......
==============
It'simportanttonotethateachfieldineachwordindicatesits
ownlength , asdescribedbelow.Thenumberofpossiblefieldsper
wordisalsonotfixed , andisdeterminedbysimplyreadingdatauntil
you'vereadword_data_sizebytesforthatword.
Supposethe " sametypesequence " optionisusedinthe.idxfile , and
theoptionissetlikethis:
sametypesequence = tm
Thenthe.dictfilewilllooklikethis:
==============
word_1_data_1_data
word_1_data_2_data
word_2_data_1_data
word_2_data_2_data
......
==============
Thefirstdataentryforeachwordwillhaveaterminating' 0 ' , but
thesecondentrywillnothaveaterminating' 0 '.Theomissionsof
thetypecharsandofthelastfield'ssizeinformationarethe
optimizationsrequiredbythe " sametypesequence " optiondescribed
above.
If " idxoffsetbits=64 " , thefilesizeofthe.dictfilewillbebigger
than4G.Becauseweoftenneedtommapthislargefile , andthereis
a4Gmaximumvirtualmemoryspacelimitinaprocessonthe 32 bits
computer , whichwillmakewecangeterror , so " idxoffsetbits=64 "
dictionarycan'tbeloadedin 32 bitsmachineinfact , StarDictwill
simplyprintawarninginthiscasewhenloading. 64 -bitscomputers
shouldhaven'tthislimit.
Typeidentifiers
----------------
Herearethesingle-charactertypeidentifiersthatmaybeusedwith
the " sametypesequence " optioninthe.idxfile , ormayappearinthe
dictfileitselfifthe " sametypesequence " optionisnotused.
Lower-casecharacterssignifythatafield'ssizeisdeterminedbya
terminating' 0 ' , whileupper-casecharactersindicatethatthedata
beginswithanetworkbyte-orderedguint32thatgivesthelengthof
thefollowingdata'ssize(NOTthewholesizewhichis 4 bytesbigger).
'm'
Word'spuretextmeaning.
Thedatashouldbeautf- 8 stringendingwith' 0 '.
'l'
Word'spuretextmeaning.
ThedataisNOTautf- 8 string , butisinsteadastringinlocale
encoding , endingwith' 0 '.Sometimesusingthistypewillsavedisk
space , butitsuseisdiscouraged.
'g'
Autf- 8 stringwhichismarkedupwiththePangotextmarkuplanguage.
Formoreinformationaboutthismarkuplanguage , Seethe " Pango
ReferenceManual. "
Youmighthaveitinstalledlocallyat:
file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html
't'
Englishphoneticstring.
Thedatashouldbeautf- 8 stringendingwith' 0 '.
Herearesomeutf- 8 phoneticcharacters:
θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑṃṇḷ
æɑɒʌәєŋvθðʃʒɚːɡˏˊˋ
'x'
Autf- 8 stringwhichismarkedupwiththexdxflanguage.
Seehttp://xdxf.sourceforge.net
StarDicthavetheseextention:
<rref>canhave " type " attribute , itcanbe " image " , " sound " , " video "
and " attach " .
<kref>canhave " k " attribute.
'y'
ChineseYinBiaoorJapaneseKANA.
Thedatashouldbeautf- 8 stringendingwith' 0 '.
'k'
KingSoftPowerWord'sdata.Thedataisautf- 8 stringendingwith' 0 '.
ItisinXMLformat.
'w'
MediaWikimarkuplanguage.
Seehttp://meta.wikimedia.org/wiki/Help:Editing#The_wiki_markup
'h'
Htmlcodes.
'n'
WordNetdata.
'r'
Resourcefilelist.
Thecontentcanbe:
img:pic/example.jpg//Imagefile
snd:apple.wav//Soundfile
vdo:film.avi//Videofile
att:file.bin//Attachmentfile
Morethanonelineissupportedasalistofavailablefiles.
StarDictwillfindthefilesintheResourceStorage.
Theimagewillbeshown , thesoundfilewillhaveaplaybutton.
Youcan " saveas " theattachmentfileandsoon.
'W'
wavfile.
Thedatabeginswithanetworkbyte-orderedguint32toidentifythewav
file'ssize , immediatelyfollowedbythefile'scontent.
'P'
Picturefile.
Thedatabeginswithanetworkbyte-orderedguint32toidentifythepicture
file'ssize , immediatelyfollowedbythefile'scontent.
'X'
thistypeidentifierisreservedforexperimentalextensions.
{ 8 }.ResourceStorage
ResourceStoragestoretheexternalfilein'r'resourcefilelist , the
imageinhtmlcode , theimage , mediaandotherfilesinwikitag.
Ithavetwoforms:
1 .Directdirectoryandfilesinthe " res " sub-directory.
2 .Theres.rifo , res.ridxandres.rdicdatabase.
Directfilesmayhavefilenameencodingproblem , asLinuxuseUTF- 8 and
Windowsuselocalencoding , soyou'dbetterjustuseASCIIfilename , or
usedatabsetostoreUTF- 8 filename.
Databsemayneedtoextractthefile(suchas.wav)filetoatemporary
file , sonotsoefficientcomparetodirectfiles.Butdatabasehavethe
advantageofcompressing.
Youcanconverttheresdirectoryandtheresdatabasefromeachotherby
thedir2resdatabseandresdatabase2dirtools.
StarDictwilltrytoloadthestoragedatabasefirst , thentrythedirect
filesform.
Theformatoftheres.rifofile:
StarDict'sstorageifofile
version = 3.0.0
filecount = //required.
idxoffsetbits = //optional.
Theformatoftheres.ridxfile:
filename ; //Astringendwith'�'.
offset ; //32or64bitsunsignednumberinnetworkbyteorder.
size ; //32bitsunsignednumberinnetworkbyteorder.
filenamecanincludeapathtoo , suchas " pic/example.png " .filenameis
casesensitive , andthereshouldhavenotwosamefilenamesinallthe
entries.
if " idxoffsetbits=64 " , thenoffsetis 64 bits.
Thesethreeitemsarerepeatedaseachentry.
Theentriesaresortedbythestrcmp()functionwiththefilenamefield.
Itispossiblethatdifferentfilenameshavethesameoffsetandsize.
Theformatoftheres.rdicfile:
Itisjustthejoinofeachresourcefiles.
Youcandictzipthisfileasres.rdic.dz
{ 9 }.TreeDictionary
Thetreedictionarysupportisusedforinformationviewing , etc.
Atreedictionarycontainsthreefile:sometreedict.ifo , sometreedict.tdx.gz
andsometreedict.dict.dz.
Itisbettertocompressthe.tdxfile , asitisalwaysloadintomemory.
The.ifofilehasthefollowingformat:
StarDict'streedictifofile
version = 2.4.2
[ options ]
Availableoptions:
bookname = //required
tdxfilesize = //required
wordcount =
author =
email =
website =
description =
date =
sametypesequence =
wordcountisonlyusedforinfoviewinthedictmanagedialog , soitisnot
importantintreedictionary.
The.tdxfileisjustthewordlist.
-----------
Thewordlistisatreelistofwordentries.
Eachentryinthewordlistcontainsfourfields , oneaftertheother:
word_str ; //autf-8stringterminatedby'�'.
word_data_offset ; //worddata'soffsetin.dictfile
word_data_size ; //worddata'stotalsizein.dictfile.itcanbe0.
word_subentry_count ; //howmanysubwordthisentryhas,0meansnone.
Subentryisimmidiatelyfollowedbyitsparententry.Thismaketheorderis
justaswhenatreelistwithallitsnodesextended , thensortfromtopto
bottom.
word_data_offset , word_data_sizeandword_subentry_countshouldbe 32 -bits
unsignednumbersinnetworkbyteorder.
The.dictfile'sformatisthesameasthenormaldictionary.
{ 10 }.Moreinformation.
Youcanread " src/lib.cpp " , " src/dictmanagedlg.cpp " and
" src/tools/*.cpp " formoreinformation.
Afteryouhavebuildadictionary , youcanuse " stardict_verify " toverifythe
dictionaryfiles.Youcanfinditat " src/tools/ " .
Ifyouhaveanyquestions , emailme.:)
ThankstoWillRobinson<wsr23@stanford.edu>forcleaningupthisfile's
English.
HuZheng<huzheng_001@ 163 .com>
http://forlinux.yeah.net
2007.4.24
------------------------------------
StarDicthomepage:http://stardict.sourceforge.net
StarDicton-linedictionary:http://www.stardict.org
{ 0 }.NumberandByte-orderConventions
Whenyourecordthenumbersthatidentifysizes , offsets , etc. , you
shoulduse 32 -bitsnumbers , suchasyoumightrepresentwithaglong.
InordertomakeStarDictworkondifferentplatforms , thesenumbers
mustbeinnetworkbyteorder.Youcanensurethecorrectbyteorder
byusingtheg_htonl()functionwhencreatingdictionaryfiles.
Conversely , youshoulduseg_ntohl()whenreadingdictionaryfiles.
StringsshouldbeencodedinUTF- 8 .
{ 1 }.Files
Everydictionaryconsistsofthesefiles:
( 1 ).somedict.ifo
( 2 ).somedict.idxorsomedict.idx.gz
( 3 ).somedict.dictorsomedict.dict.dz
( 4 ).somedict.syn(optional)
Youcanusegzip- 9 tocompressthe.idxfile.Ifthe.idxfilearenot
compressed , theloadingcanbefastandsavememorywhenusing , compressit
willmakethe.idxfileloadintomemoryandmakethequeringbecomefaster
whenusing.
Youcanusedictziptocompressthe.dictfile.
" dictzip " usesthesamecompressionalgorithmandfileformatasdoesgzip ,
butprovidesatablethatcanbeusedtorandomlyaccesscompressedblocks
inthefile.Theuseof 50 -64kBblocksforcompressiontypicallydegrades
compressionbylessthan 10 % , whilemaintainingacceptablerandomaccess
capabilitiesforalldatainthefile.Asanaddedbenefit , files
compressedwithdictzipcanbedecompressedwithgunzip.
Formoreinformationaboutdictzip , refertoDICTproject , pleasesee:
http://www.dict.org
Whenyoucreateadictionary , youshoulduse.idxand.dict.dzinnormal
case.
Stardictwillsearchforthe.ifofile , thenopenthe.idxor
.idx.gzfileandthe.dict.dzor.dictfilewhichisinthesamedirectoryand
hasthesamebasename.
{ 2 }.The " .ifo " file'sformat.
The.ifofilehasthefollowingformat:
StarDict'sdictifofile
version = 2.4.2
[ options ]
Notethatthecurrent " version " stringmustbe " 2.4.2 " or " 3.0.0 " .Ifit'snot ,
thenStarDictwillrefusetoreadthefile.
Ifversionis " 3.0.0 " , StarDictwillparsethe " idxoffsetbits " option.
[ options ]
---------
Intheexampleabove , [ options ] expandstoanyofthefollowinglines
specifyinginformationaboutthedictionary.Eachoptionisakeyword
followedbyanequalsign , thenthevalueofthatoption , thena
newline.Theoptionsmaybeappearinanyorder.
Notethatthedictionarymusthaveatleastabookname , awordcountanda
idxfilesize , ortheloadwillfail.Allotherinformationisoptional.All
stringsshouldbeencodedinUTF- 8 .
Availableoptions:
bookname = //required
wordcount = //required
synwordcount = //requiredif " .syn " fileexists.
idxfilesize = //required
idxoffsetbits = //Newin 3.0.0
author =
email =
website =
description = //Youcanuse<br>fornewline.
date =
sametypesequence = //veryimportant.
dicttype =
wordcountisthecountofwordentriesin.idxfile , itmustberight.
idxfilesizeisthesize(inbytes)ofthe.idxfile , eventhe.idxiscompressed
toa.idx.gzfile , thisentrymustrecordtheoriginal.idxfile'ssize , andit
mustberighttoo.The.gzfiledon'tcontainitsoriginalsizeinformation ,
butknowingtheoriginalsizecanspeeduptheextractiontomemory , asyou
don'tneedtocallrealloc()formanytimes.
idxoffsetbitscanbe 64 or 32 .If " idxoffsetbits=64 " , theoffsetfieldofthe
.idxfilewillbe 64 bits.
dicttypeisusedbysomespecialdictionaryplugins , suchaswordnet.Itsvalue
canbe " wordnet " presently.
The " sametypesequence " optionisdescribedinfurtherdetailbelow.
***
sametypesequence
Youshouldfirstfamiliarizeyourselfwiththe.dictfileformat
describedinthenextsectionsothatyoucanunderstandwhateffect
thisoptionhasonthe.dictfile.
Ifthesametypesequenceoptionisset , ittellsStarDictthateach
word'sdatainthe.dictfilewillhavethesamesequenceofdatatypes.
Inthiscase , weexpecta.dictfilethat'sbeenoptimizedintwo
ways:thetypeidentifiersshouldbeomitted , andthesizemarkerfor
thelastdataentryofeachwordshouldbeomitted.
Let'sconsidersomeconcreteexamplesofthesametypesequenceoption.
Supposethatadictionaryrecordsmany.wavfiles , andsosets:
sametypesequence = W
Inthiscase , eachword'sentryinthe.dictfileconsistssolelyofa
wavfile.Inthe.dictfile , youwouldleaveoutthe'W'character
beforeeachentry , andyouwouldalsoomitthe 32 -bitsintegeratthe
frontofeach.waventrythatwouldnormallygivetheentry'slength.
Youcandothissincethelengthisknownfromtheinformationinthe
idxfile.
Asanotherexample , supposeadictionarycontainsphoneticinformation
andameaningforeachword.Thesametypesequenceoptionforthis
dictionarywouldbe:
sametypesequence = tm
Onceagain , youcanomitthe't'and'm'charactersbeforeeachdata
entryinthe.dictfile.Inaddition , youshouldomittheterminating
' 0 'forthe'm'entryforeachwordinthe.dictfile , asthelength
ofthemeaningstringcanbeinferredfromthelengthofthephonetic
string(stillindicatedbyaterminating' 0 ')andthelengthofthe
entirewordentry(listedinthe.idxfile).
Soforcaseswherethelastdataentryforeachwordnormallyrequires
aterminating' 0 'character , youshouldomitthischaracterinthe
dictfile.Andforcaseswherethelastdataentryforeachword
normallyrequiresaninitial 32 -bitsnumbergivingthelengthofthe
field(suchasWAVandPNGentries) , youmustomitthisnumberinthe
dictionary.
Everydictionaryshouldtrytousethesametypesequencefeatureto
savediskspace.
***
{ 3 }.The " .idx " file'sformat.
The.idxfileisjustawordlist.
Thewordlistisasortedlistofwordentries.
Eachentryinthewordlistcontainsthreefields , oneaftertheother:
word_str ; //autf-8stringterminatedby'�'.
word_data_offset ; //worddata'soffsetin.dictfile
word_data_size ; //worddata'stotalsizein.dictfile
word_strgivesthestringrepresentingthisword.It'sthestring
thatis " lookedup " bytheStarDict.
Twoormoreentriesmayhavethesame " word_str " withdifferent
word_data_offsetandword_data_size.Thismaybeusefulforsome
dictionaries.Butthisfeatureisonlywellsupportedby
StarDict- 2.4.8 andnewer.
Thelengthof " word_str " shouldbelessthan 256 .Inotherwords ,
(strlen(word)< 256 ).
Iftheversionis " 3.0.0 " and " idxoffsetbits=64 " , word_data_offsetwill
be 64 -bitsunsignednumberinnetworkbyteorder.Otherwiseitwillbe
32 -bits.
word_data_sizeshouldbe 32 -bitsunsignednumberinnetworkbyteorder.
Itispossiblethedifferentword_strhavethesameword_data_offsetand
word_data_size , somultiplewordindexpointtothesamedefinition.
Butthisisnotrecommended , formutiplewordshavethesamedefinition ,
youmaycreatea " .syn " fileforthem , seesection 4 below.
Thewordlistmustbesortedbycallingstardict_strcmp()onthe " word_str "
fields.Ifthewordlistorderiswrong , StarDictwillfailtofunction
correctly!
============
gintstardict_strcmp(constgchar*s1 , constgchar*s2)
{
ginta ;
a = g_ascii_strcasecmp(s1 , s2) ;
if(a == 0 )
returnstrcmp(s1 , s2) ;
else
returna ;
}
============
g_ascii_strcasecmp()isaglibfunction:
UnliketheBSDstrcasecmp()function , thisonlyrecognizesstandard
ASCIIlettersandignoresthelocale , treatingallnon-ASCIIcharacters
asiftheyarenotletters.
stardict_strcmp()worksfinewithEnglishcharacters , buttheother
localecharacters'sortingisnotsogood , inthiscase , youcanenable
thecollationfeature , seesection 6 .
{ 4 }.The " ,syn " file'sformat.
Thisfileisoptional , andyoushouldnoticetreedictionaryneedn'tthisfile.
OnlyStarDict- 2.4.8 andnewersupportthisfile.
The.synfilecontainsinformationforsynonyms , thatmeans , whenyouinputa
synonym , StarDictwillsearchanotherwordthatrelatedtoit.
Theformatissimple.Eachitemcontainonestringandanumber.
synonym_word ; //autf-8stringterminatedby'�'.
original_word_index ; //originalword'sindexin.idxfile.
Thenotheritemswithoutseparation.
Whenyouinputsynonym_word , StarDictwillsearchoriginal_word ;
Thelengthof " synonym_word " shouldbelessthan 256 .Inother
words , (strlen(word)< 256 ).
original_word_indexisa 32 -bitsunsignednumberinnetworkbyteorder.
Twoormoreitemsmayhavethesame " synonym_word " withdifferent
original_word_index.
Theitemsmustbesortedbystardict_strcmp()withsynonym_word.
{ 5 }.Theoffsetcachefile'sformat.
StarDict- 2.4.8 starttosupportcachefiles , thisfeaturecanspeedup
loadingandsavememoryasmmap()thecachefile.Thecachefilenames
are.idx.oftand.syn.oft , theformatis:
Firstautf- 8 stringterminatedby' 0 ' , thenmany 32 -bitsnumbersas
thewordoffsetindex , thisindexissparse , and " ENTR_PER_PAGE=32 " ,
theyarenotstoredinnetworkbyteorder.
Thestringmustbeginwith:
=====
StarDict'softfile
version = 2.4.8
=====
Thenalinelikethis:
url = /usr/share/stardict/dic/stardict-somedict- 2.4.2 /somedict.idx
Thislineshouldhaveaending' '.
StarDictwilltrytocreatethe.oftfileatthesamedirectoryof
the.ifofilefirst , iffailed , thentrytocreateitat
~/.cache/stardict/ , ~/.cacheisgetbyg_get_user_cache_dir().
Iftwoormoredictionarieshavethesamefilename , StarDictwill
createsomedict.idx.oft , somedict( 2 ).idx.oft , somedict( 3 ).idx.oft ,
etc.forthemrespectively , eachwithdifferent " url= " inthe
beginningstring.
{ 6 }.Thecollationfile'sformat.
StarDict- 2.4.8 starttosupportcollation , thatsorttheword
listbycollatefunction.Itwillcreatecollationfilewhich
names.idx.cltand.syn.clt , theformatisalittlelikeoffset
cachefile:
Firstautf- 8 stringterminatedby' 0 ' , thenmany 32 -bitsnumbersas
theindexthatsortedbythecollatefunction , theyarenotstored
innetworkbyteorder.
Thestringmustbeginwith:
=====
StarDict'scltfile
version = 2.4.8
=====
Thentwolineslikethis:
url = /usr/share/stardict/dic/stardict-somedict- 2.4.2 /somedict.idx
func = 0
Thesecondlineshouldhaveaending' 'too.
StarDictsupportthesecollatefunctionscurrently:
typedefenum{
UTF8_GENERAL_CI = 0 ,
UTF8_UNICODE_CI ,
UTF8_BIN ,
UTF8_CZECH_CI ,
UTF8_DANISH_CI ,
UTF8_ESPERANTO_CI ,
UTF8_ESTONIAN_CI ,
UTF8_HUNGARIAN_CI ,
UTF8_ICELANDIC_CI ,
UTF8_LATVIAN_CI ,
UTF8_LITHUANIAN_CI ,
UTF8_PERSIAN_CI ,
UTF8_POLISH_CI ,
UTF8_ROMAN_CI ,
UTF8_ROMANIAN_CI ,
UTF8_SLOVAK_CI ,
UTF8_SLOVENIAN_CI ,
UTF8_SPANISH_CI ,
UTF8_SPANISH2_CI ,
UTF8_SWEDISH_CI ,
UTF8_TURKISH_CI ,
COLLATE_FUNC_NUMS
}CollateFunctions ;
TheseUTF8_*_CIfunctionscomesfromMySQLinfact.
Thefile'slocatepathjustlikethe.oftfile.
Notice , for " somedict.idx.gz " file , thecorrespondingcollation
fileissomedict.idx.clt , butnotsomedict.idx.gz.clt , the
" url= " issomedict.idx , notsomedict.idx.gz.Soafteryougzip
the.idxfile , StarDictneedn'tcreatethe.cltfileagain.
{ 7 }.The " .dict " file'sformat.
The.dictfileisapuredatasequence , astheoffsetandsizeofeach
wordisrecordedinthecorresponding.idxfile.
Ifthe " sametypesequence " optionisnotusedinthe.ifofile , then
the.dictfilehasfieldsinthefollowingorder:
==============
word_1_data_1_type ; //asinglecharidentifyingthedatatype
word_1_data_1_data ; //thedata
word_1_data_2_type ;
word_1_data_2_data ;
......//thenumberofdataentriesforeachwordisdeterminedby
//word_data_sizein.idxfile
word_2_data_1_type ;
word_2_data_1_data ;
......
==============
It'simportanttonotethateachfieldineachwordindicatesits
ownlength , asdescribedbelow.Thenumberofpossiblefieldsper
wordisalsonotfixed , andisdeterminedbysimplyreadingdatauntil
you'vereadword_data_sizebytesforthatword.
Supposethe " sametypesequence " optionisusedinthe.idxfile , and
theoptionissetlikethis:
sametypesequence = tm
Thenthe.dictfilewilllooklikethis:
==============
word_1_data_1_data
word_1_data_2_data
word_2_data_1_data
word_2_data_2_data
......
==============
Thefirstdataentryforeachwordwillhaveaterminating' 0 ' , but
thesecondentrywillnothaveaterminating' 0 '.Theomissionsof
thetypecharsandofthelastfield'ssizeinformationarethe
optimizationsrequiredbythe " sametypesequence " optiondescribed
above.
If " idxoffsetbits=64 " , thefilesizeofthe.dictfilewillbebigger
than4G.Becauseweoftenneedtommapthislargefile , andthereis
a4Gmaximumvirtualmemoryspacelimitinaprocessonthe 32 bits
computer , whichwillmakewecangeterror , so " idxoffsetbits=64 "
dictionarycan'tbeloadedin 32 bitsmachineinfact , StarDictwill
simplyprintawarninginthiscasewhenloading. 64 -bitscomputers
shouldhaven'tthislimit.
Typeidentifiers
----------------
Herearethesingle-charactertypeidentifiersthatmaybeusedwith
the " sametypesequence " optioninthe.idxfile , ormayappearinthe
dictfileitselfifthe " sametypesequence " optionisnotused.
Lower-casecharacterssignifythatafield'ssizeisdeterminedbya
terminating' 0 ' , whileupper-casecharactersindicatethatthedata
beginswithanetworkbyte-orderedguint32thatgivesthelengthof
thefollowingdata'ssize(NOTthewholesizewhichis 4 bytesbigger).
'm'
Word'spuretextmeaning.
Thedatashouldbeautf- 8 stringendingwith' 0 '.
'l'
Word'spuretextmeaning.
ThedataisNOTautf- 8 string , butisinsteadastringinlocale
encoding , endingwith' 0 '.Sometimesusingthistypewillsavedisk
space , butitsuseisdiscouraged.
'g'
Autf- 8 stringwhichismarkedupwiththePangotextmarkuplanguage.
Formoreinformationaboutthismarkuplanguage , Seethe " Pango
ReferenceManual. "
Youmighthaveitinstalledlocallyat:
file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html
't'
Englishphoneticstring.
Thedatashouldbeautf- 8 stringendingwith' 0 '.
Herearesomeutf- 8 phoneticcharacters:
θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑṃṇḷ
æɑɒʌәєŋvθðʃʒɚːɡˏˊˋ
'x'
Autf- 8 stringwhichismarkedupwiththexdxflanguage.
Seehttp://xdxf.sourceforge.net
StarDicthavetheseextention:
<rref>canhave " type " attribute , itcanbe " image " , " sound " , " video "
and " attach " .
<kref>canhave " k " attribute.
'y'
ChineseYinBiaoorJapaneseKANA.
Thedatashouldbeautf- 8 stringendingwith' 0 '.
'k'
KingSoftPowerWord'sdata.Thedataisautf- 8 stringendingwith' 0 '.
ItisinXMLformat.
'w'
MediaWikimarkuplanguage.
Seehttp://meta.wikimedia.org/wiki/Help:Editing#The_wiki_markup
'h'
Htmlcodes.
'n'
WordNetdata.
'r'
Resourcefilelist.
Thecontentcanbe:
img:pic/example.jpg//Imagefile
snd:apple.wav//Soundfile
vdo:film.avi//Videofile
att:file.bin//Attachmentfile
Morethanonelineissupportedasalistofavailablefiles.
StarDictwillfindthefilesintheResourceStorage.
Theimagewillbeshown , thesoundfilewillhaveaplaybutton.
Youcan " saveas " theattachmentfileandsoon.
'W'
wavfile.
Thedatabeginswithanetworkbyte-orderedguint32toidentifythewav
file'ssize , immediatelyfollowedbythefile'scontent.
'P'
Picturefile.
Thedatabeginswithanetworkbyte-orderedguint32toidentifythepicture
file'ssize , immediatelyfollowedbythefile'scontent.
'X'
thistypeidentifierisreservedforexperimentalextensions.
{ 8 }.ResourceStorage
ResourceStoragestoretheexternalfilein'r'resourcefilelist , the
imageinhtmlcode , theimage , mediaandotherfilesinwikitag.
Ithavetwoforms:
1 .Directdirectoryandfilesinthe " res " sub-directory.
2 .Theres.rifo , res.ridxandres.rdicdatabase.
Directfilesmayhavefilenameencodingproblem , asLinuxuseUTF- 8 and
Windowsuselocalencoding , soyou'dbetterjustuseASCIIfilename , or
usedatabsetostoreUTF- 8 filename.
Databsemayneedtoextractthefile(suchas.wav)filetoatemporary
file , sonotsoefficientcomparetodirectfiles.Butdatabasehavethe
advantageofcompressing.
Youcanconverttheresdirectoryandtheresdatabasefromeachotherby
thedir2resdatabseandresdatabase2dirtools.
StarDictwilltrytoloadthestoragedatabasefirst , thentrythedirect
filesform.
Theformatoftheres.rifofile:
StarDict'sstorageifofile
version = 3.0.0
filecount = //required.
idxoffsetbits = //optional.
Theformatoftheres.ridxfile:
filename ; //Astringendwith'�'.
offset ; //32or64bitsunsignednumberinnetworkbyteorder.
size ; //32bitsunsignednumberinnetworkbyteorder.
filenamecanincludeapathtoo , suchas " pic/example.png " .filenameis
casesensitive , andthereshouldhavenotwosamefilenamesinallthe
entries.
if " idxoffsetbits=64 " , thenoffsetis 64 bits.
Thesethreeitemsarerepeatedaseachentry.
Theentriesaresortedbythestrcmp()functionwiththefilenamefield.
Itispossiblethatdifferentfilenameshavethesameoffsetandsize.
Theformatoftheres.rdicfile:
Itisjustthejoinofeachresourcefiles.
Youcandictzipthisfileasres.rdic.dz
{ 9 }.TreeDictionary
Thetreedictionarysupportisusedforinformationviewing , etc.
Atreedictionarycontainsthreefile:sometreedict.ifo , sometreedict.tdx.gz
andsometreedict.dict.dz.
Itisbettertocompressthe.tdxfile , asitisalwaysloadintomemory.
The.ifofilehasthefollowingformat:
StarDict'streedictifofile
version = 2.4.2
[ options ]
Availableoptions:
bookname = //required
tdxfilesize = //required
wordcount =
author =
email =
website =
description =
date =
sametypesequence =
wordcountisonlyusedforinfoviewinthedictmanagedialog , soitisnot
importantintreedictionary.
The.tdxfileisjustthewordlist.
-----------
Thewordlistisatreelistofwordentries.
Eachentryinthewordlistcontainsfourfields , oneaftertheother:
word_str ; //autf-8stringterminatedby'�'.
word_data_offset ; //worddata'soffsetin.dictfile
word_data_size ; //worddata'stotalsizein.dictfile.itcanbe0.
word_subentry_count ; //howmanysubwordthisentryhas,0meansnone.
Subentryisimmidiatelyfollowedbyitsparententry.Thismaketheorderis
justaswhenatreelistwithallitsnodesextended , thensortfromtopto
bottom.
word_data_offset , word_data_sizeandword_subentry_countshouldbe 32 -bits
unsignednumbersinnetworkbyteorder.
The.dictfile'sformatisthesameasthenormaldictionary.
{ 10 }.Moreinformation.
Youcanread " src/lib.cpp " , " src/dictmanagedlg.cpp " and
" src/tools/*.cpp " formoreinformation.
Afteryouhavebuildadictionary , youcanuse " stardict_verify " toverifythe
dictionaryfiles.Youcanfinditat " src/tools/ " .
Ifyouhaveanyquestions , emailme.:)
ThankstoWillRobinson<wsr23@stanford.edu>forcleaningupthisfile's
English.
HuZheng<huzheng_001@ 163 .com>
http://forlinux.yeah.net
2007.4.24
根据这个说明和参考其源代码,实现了读取词库的一个程序。
工程使用NetBeans6创建,请参看 附件:-)