Java读取星际译王(StarDict)词库[88250原创]

转载请保留作者信息:

作者:88250

Blog:http:/blog.csdn.net/DL88250

MSN & Gmail & QQ:DL88250@gmail.com



下面的文件是StarDict的词库格式说明文件:

FormatforStarDictdictionaryfiles
------------------------------------

StarDicthomepage:http://stardict.sourceforge.net
StarDicton-linedictionary:http://www.stardict.org

{
0 }.NumberandByte-orderConventions
Whenyourecordthenumbersthatidentifysizes
, offsets , etc. , you
shoulduse
32 -bitsnumbers , suchasyoumightrepresentwithaglong.

InordertomakeStarDictworkondifferentplatforms
, thesenumbers
mustbeinnetworkbyteorder.Youcanensurethecorrectbyteorder
byusingtheg_htonl()functionwhencreatingdictionaryfiles.
Conversely
, youshoulduseg_ntohl()whenreadingdictionaryfiles.

StringsshouldbeencodedinUTF-
8 .


{
1 }.Files
Everydictionaryconsistsofthesefiles:
(
1 ).somedict.ifo
(
2 ).somedict.idxorsomedict.idx.gz
(
3 ).somedict.dictorsomedict.dict.dz
(
4 ).somedict.syn(optional)

Youcanusegzip-
9 tocompressthe.idxfile.Ifthe.idxfilearenot
compressed
, theloadingcanbefastandsavememorywhenusing , compressit
willmakethe.idxfileloadintomemoryandmakethequeringbecomefaster
whenusing.

Youcanusedictziptocompressthe.dictfile.
" dictzip " usesthesamecompressionalgorithmandfileformatasdoesgzip ,
butprovidesatablethatcanbeusedtorandomlyaccesscompressedblocks
inthefile.Theuseof
50 -64kBblocksforcompressiontypicallydegrades
compressionbylessthan
10 % , whilemaintainingacceptablerandomaccess
capabilitiesforalldatainthefile.Asanaddedbenefit
, files
compressedwithdictzipcanbedecompressedwithgunzip.
Formoreinformationaboutdictzip
, refertoDICTproject , pleasesee:
http://www.dict.org

Whenyoucreateadictionary
, youshoulduse.idxand.dict.dzinnormal
case.

Stardictwillsearchforthe.ifofile
, thenopenthe.idxor
.idx.gzfileandthe.dict.dzor.dictfilewhichisinthesamedirectoryand
hasthesamebasename.



{
2 }.The " .ifo " file'sformat.
The.ifofilehasthefollowingformat:

StarDict'sdictifofile
version
= 2.4.2
[ options ]

Notethatthecurrent
" version " stringmustbe " 2.4.2 " or " 3.0.0 " .Ifit'snot ,
thenStarDictwillrefusetoreadthefile.
Ifversionis
" 3.0.0 " , StarDictwillparsethe " idxoffsetbits " option.

[ options ]
---------
Intheexampleabove
, [ options ] expandstoanyofthefollowinglines
specifyinginformationaboutthedictionary.Eachoptionisakeyword
followedbyanequalsign
, thenthevalueofthatoption , thena
newline.Theoptionsmaybeappearinanyorder.

Notethatthedictionarymusthaveatleastabookname
, awordcountanda
idxfilesize
, ortheloadwillfail.Allotherinformationisoptional.All
stringsshouldbeencodedinUTF-
8 .

Availableoptions:

bookname
= //required
wordcount
= //required
synwordcount
= //requiredif " .syn " fileexists.
idxfilesize
= //required
idxoffsetbits
= //Newin 3.0.0
author
=
email
=
website
=
description
= //Youcanuse<br>fornewline.
date
=
sametypesequence
= //veryimportant.
dicttype
=


wordcountisthecountofwordentriesin.idxfile
, itmustberight.

idxfilesizeisthesize(inbytes)ofthe.idxfile
, eventhe.idxiscompressed
toa.idx.gzfile
, thisentrymustrecordtheoriginal.idxfile'ssize , andit
mustberighttoo.The.gzfiledon'tcontainitsoriginalsizeinformation
,
butknowingtheoriginalsizecanspeeduptheextractiontomemory
, asyou
don'tneedtocallrealloc()formanytimes.

idxoffsetbitscanbe
64 or 32 .If " idxoffsetbits=64 " , theoffsetfieldofthe
.idxfilewillbe
64 bits.

dicttypeisusedbysomespecialdictionaryplugins
, suchaswordnet.Itsvalue
canbe
" wordnet " presently.

The
" sametypesequence " optionisdescribedinfurtherdetailbelow.

***
sametypesequence

Youshouldfirstfamiliarizeyourselfwiththe.dictfileformat
describedinthenextsectionsothatyoucanunderstandwhateffect
thisoptionhasonthe.dictfile.

Ifthesametypesequenceoptionisset
, ittellsStarDictthateach
word'sdatainthe.dictfilewillhavethesamesequenceofdatatypes.
Inthiscase
, weexpecta.dictfilethat'sbeenoptimizedintwo
ways:thetypeidentifiersshouldbeomitted
, andthesizemarkerfor
thelastdataentryofeachwordshouldbeomitted.

Let'sconsidersomeconcreteexamplesofthesametypesequenceoption.

Supposethatadictionaryrecordsmany.wavfiles
, andsosets:
sametypesequence
= W
Inthiscase
, eachword'sentryinthe.dictfileconsistssolelyofa
wavfile.Inthe.dictfile
, youwouldleaveoutthe'W'character
beforeeachentry
, andyouwouldalsoomitthe 32 -bitsintegeratthe
frontofeach.waventrythatwouldnormallygivetheentry'slength.
Youcandothissincethelengthisknownfromtheinformationinthe
idxfile.

Asanotherexample
, supposeadictionarycontainsphoneticinformation
andameaningforeachword.Thesametypesequenceoptionforthis
dictionarywouldbe:
sametypesequence
= tm
Onceagain
, youcanomitthe't'and'm'charactersbeforeeachdata
entryinthe.dictfile.Inaddition
, youshouldomittheterminating
'
0 'forthe'm'entryforeachwordinthe.dictfile , asthelength
ofthemeaningstringcanbeinferredfromthelengthofthephonetic
string(stillindicatedbyaterminating'
0 ')andthelengthofthe
entirewordentry(listedinthe.idxfile).

Soforcaseswherethelastdataentryforeachwordnormallyrequires
aterminating'
0 'character , youshouldomitthischaracterinthe
dictfile.Andforcaseswherethelastdataentryforeachword
normallyrequiresaninitial
32 -bitsnumbergivingthelengthofthe
field(suchasWAVandPNGentries)
, youmustomitthisnumberinthe
dictionary.

Everydictionaryshouldtrytousethesametypesequencefeatureto
savediskspace.
***


{
3 }.The " .idx " file'sformat.
The.idxfileisjustawordlist.

Thewordlistisasortedlistofwordentries.

Eachentryinthewordlistcontainsthreefields
, oneaftertheother:
word_str
; //autf-8stringterminatedby'�'.
word_data_offset ; //worddata'soffsetin.dictfile
word_data_size ; //worddata'stotalsizein.dictfile

word_strgivesthestringrepresentingthisword.It'sthestring
thatis
" lookedup " bytheStarDict.

Twoormoreentriesmayhavethesame
" word_str " withdifferent
word_data_offsetandword_data_size.Thismaybeusefulforsome
dictionaries.Butthisfeatureisonlywellsupportedby
StarDict-
2.4.8 andnewer.

Thelengthof
" word_str " shouldbelessthan 256 .Inotherwords ,
(strlen(word)<
256 ).

Iftheversionis
" 3.0.0 " and " idxoffsetbits=64 " , word_data_offsetwill
be
64 -bitsunsignednumberinnetworkbyteorder.Otherwiseitwillbe
32 -bits.
word_data_sizeshouldbe
32 -bitsunsignednumberinnetworkbyteorder.

Itispossiblethedifferentword_strhavethesameword_data_offsetand
word_data_size
, somultiplewordindexpointtothesamedefinition.
Butthisisnotrecommended
, formutiplewordshavethesamedefinition ,
youmaycreatea
" .syn " fileforthem , seesection 4 below.

Thewordlistmustbesortedbycallingstardict_strcmp()onthe
" word_str "
fields.Ifthewordlistorderiswrong
, StarDictwillfailtofunction
correctly!

============
gintstardict_strcmp(constgchar*s1
, constgchar*s2)
{
ginta
;
a = g_ascii_strcasecmp(s1 , s2) ;
if(a == 0 )
returnstrcmp(s1
, s2) ;
else
returna
;
}
============
g_ascii_strcasecmp()isaglibfunction:
UnliketheBSDstrcasecmp()function
, thisonlyrecognizesstandard
ASCIIlettersandignoresthelocale
, treatingallnon-ASCIIcharacters
asiftheyarenotletters.

stardict_strcmp()worksfinewithEnglishcharacters
, buttheother
localecharacters'sortingisnotsogood
, inthiscase , youcanenable
thecollationfeature
, seesection 6 .


{
4 }.The " ,syn " file'sformat.
Thisfileisoptional
, andyoushouldnoticetreedictionaryneedn'tthisfile.
OnlyStarDict-
2.4.8 andnewersupportthisfile.

The.synfilecontainsinformationforsynonyms
, thatmeans , whenyouinputa
synonym
, StarDictwillsearchanotherwordthatrelatedtoit.

Theformatissimple.Eachitemcontainonestringandanumber.
synonym_word
; //autf-8stringterminatedby'�'.
original_word_index ; //originalword'sindexin.idxfile.
Thenotheritemswithoutseparation.
Whenyouinputsynonym_word
, StarDictwillsearchoriginal_word ;

Thelengthof
" synonym_word " shouldbelessthan 256 .Inother
words
, (strlen(word)< 256 ).
original_word_indexisa
32 -bitsunsignednumberinnetworkbyteorder.
Twoormoreitemsmayhavethesame
" synonym_word " withdifferent
original_word_index.
Theitemsmustbesortedbystardict_strcmp()withsynonym_word.


{
5 }.Theoffsetcachefile'sformat.
StarDict-
2.4.8 starttosupportcachefiles , thisfeaturecanspeedup
loadingandsavememoryasmmap()thecachefile.Thecachefilenames
are.idx.oftand.syn.oft
, theformatis:
Firstautf-
8 stringterminatedby' 0 ' , thenmany 32 -bitsnumbersas
thewordoffsetindex
, thisindexissparse , and " ENTR_PER_PAGE=32 " ,
theyarenotstoredinnetworkbyteorder.
Thestringmustbeginwith:
=====
StarDict'softfile
version
= 2.4.8
=====
Thenalinelikethis:
url
= /usr/share/stardict/dic/stardict-somedict- 2.4.2 /somedict.idx
Thislineshouldhaveaending' '.

StarDictwilltrytocreatethe.oftfileatthesamedirectoryof
the.ifofilefirst
, iffailed , thentrytocreateitat
~/.cache/stardict/
, ~/.cacheisgetbyg_get_user_cache_dir().
Iftwoormoredictionarieshavethesamefilename
, StarDictwill
createsomedict.idx.oft
, somedict( 2 ).idx.oft , somedict( 3 ).idx.oft ,
etc.forthemrespectively
, eachwithdifferent " url= " inthe
beginningstring.


{
6 }.Thecollationfile'sformat.
StarDict-
2.4.8 starttosupportcollation , thatsorttheword
listbycollatefunction.Itwillcreatecollationfilewhich
names.idx.cltand.syn.clt
, theformatisalittlelikeoffset
cachefile:
Firstautf-
8 stringterminatedby' 0 ' , thenmany 32 -bitsnumbersas
theindexthatsortedbythecollatefunction
, theyarenotstored
innetworkbyteorder.
Thestringmustbeginwith:
=====
StarDict'scltfile
version
= 2.4.8
=====
Thentwolineslikethis:
url
= /usr/share/stardict/dic/stardict-somedict- 2.4.2 /somedict.idx
func
= 0
Thesecondlineshouldhaveaending' 'too.

StarDictsupportthesecollatefunctionscurrently:
typedefenum{
UTF8_GENERAL_CI
= 0 ,
UTF8_UNICODE_CI
,
UTF8_BIN
,
UTF8_CZECH_CI
,
UTF8_DANISH_CI
,
UTF8_ESPERANTO_CI
,
UTF8_ESTONIAN_CI
,
UTF8_HUNGARIAN_CI
,
UTF8_ICELANDIC_CI
,
UTF8_LATVIAN_CI
,
UTF8_LITHUANIAN_CI
,
UTF8_PERSIAN_CI
,
UTF8_POLISH_CI
,
UTF8_ROMAN_CI
,
UTF8_ROMANIAN_CI
,
UTF8_SLOVAK_CI
,
UTF8_SLOVENIAN_CI
,
UTF8_SPANISH_CI
,
UTF8_SPANISH2_CI
,
UTF8_SWEDISH_CI
,
UTF8_TURKISH_CI
,
COLLATE_FUNC_NUMS
}CollateFunctions
;
TheseUTF8_*_CIfunctionscomesfromMySQLinfact.

Thefile'slocatepathjustlikethe.oftfile.

Notice
, for " somedict.idx.gz " file , thecorrespondingcollation
fileissomedict.idx.clt
, butnotsomedict.idx.gz.clt , the
" url= " issomedict.idx , notsomedict.idx.gz.Soafteryougzip
the.idxfile
, StarDictneedn'tcreatethe.cltfileagain.


{
7 }.The " .dict " file'sformat.
The.dictfileisapuredatasequence
, astheoffsetandsizeofeach
wordisrecordedinthecorresponding.idxfile.

Ifthe
" sametypesequence " optionisnotusedinthe.ifofile , then
the.dictfilehasfieldsinthefollowingorder:
==============
word_1_data_1_type
; //asinglecharidentifyingthedatatype
word_1_data_1_data ; //thedata
word_1_data_2_type ;
word_1_data_2_data ;
......//thenumberofdataentriesforeachwordisdeterminedby
//word_data_sizein.idxfile
word_2_data_1_type
;
word_2_data_1_data ;
......
==============
It'simportanttonotethateachfieldineachwordindicatesits
ownlength
, asdescribedbelow.Thenumberofpossiblefieldsper
wordisalsonotfixed
, andisdeterminedbysimplyreadingdatauntil
you'vereadword_data_sizebytesforthatword.


Supposethe
" sametypesequence " optionisusedinthe.idxfile , and
theoptionissetlikethis:
sametypesequence
= tm
Thenthe.dictfilewilllooklikethis:
==============
word_1_data_1_data
word_1_data_2_data
word_2_data_1_data
word_2_data_2_data
......
==============
Thefirstdataentryforeachwordwillhaveaterminating'
0 ' , but
thesecondentrywillnothaveaterminating'
0 '.Theomissionsof
thetypecharsandofthelastfield'ssizeinformationarethe
optimizationsrequiredbythe
" sametypesequence " optiondescribed
above.

If
" idxoffsetbits=64 " , thefilesizeofthe.dictfilewillbebigger
than4G.Becauseweoftenneedtommapthislargefile
, andthereis
a4Gmaximumvirtualmemoryspacelimitinaprocessonthe
32 bits
computer
, whichwillmakewecangeterror , so " idxoffsetbits=64 "
dictionarycan'tbeloadedin
32 bitsmachineinfact , StarDictwill
simplyprintawarninginthiscasewhenloading.
64 -bitscomputers
shouldhaven'tthislimit.

Typeidentifiers
----------------
Herearethesingle-charactertypeidentifiersthatmaybeusedwith
the
" sametypesequence " optioninthe.idxfile , ormayappearinthe
dictfileitselfifthe
" sametypesequence " optionisnotused.

Lower-casecharacterssignifythatafield'ssizeisdeterminedbya
terminating'
0 ' , whileupper-casecharactersindicatethatthedata
beginswithanetworkbyte-orderedguint32thatgivesthelengthof
thefollowingdata'ssize(NOTthewholesizewhichis
4 bytesbigger).

'm'
Word'spuretextmeaning.
Thedatashouldbeautf-
8 stringendingwith' 0 '.

'l'
Word'spuretextmeaning.
ThedataisNOTautf-
8 string , butisinsteadastringinlocale
encoding
, endingwith' 0 '.Sometimesusingthistypewillsavedisk
space
, butitsuseisdiscouraged.

'g'
Autf-
8 stringwhichismarkedupwiththePangotextmarkuplanguage.
Formoreinformationaboutthismarkuplanguage
, Seethe " Pango
ReferenceManual.
"
Youmighthaveitinstalledlocallyat:
file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html

't'
Englishphoneticstring.
Thedatashouldbeautf-
8 stringendingwith' 0 '.

Herearesomeutf-
8 phoneticcharacters:
θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑṃṇḷ
æɑɒʌәєŋvθðʃʒɚːɡˏˊˋ

'x'
Autf-
8 stringwhichismarkedupwiththexdxflanguage.
Seehttp://xdxf.sourceforge.net
StarDicthavetheseextention:
<rref>canhave
" type " attribute , itcanbe " image " , " sound " , " video "
and
" attach " .
<kref>canhave
" k " attribute.

'y'
ChineseYinBiaoorJapaneseKANA.
Thedatashouldbeautf-
8 stringendingwith' 0 '.

'k'
KingSoftPowerWord'sdata.Thedataisautf-
8 stringendingwith' 0 '.
ItisinXMLformat.

'w'
MediaWikimarkuplanguage.
Seehttp://meta.wikimedia.org/wiki/Help:Editing#The_wiki_markup

'h'
Htmlcodes.

'n'
WordNetdata.

'r'
Resourcefilelist.
Thecontentcanbe:
img:pic/example.jpg//Imagefile
snd:apple.wav//Soundfile
vdo:film.avi//Videofile
att:file.bin//Attachmentfile
Morethanonelineissupportedasalistofavailablefiles.
StarDictwillfindthefilesintheResourceStorage.
Theimagewillbeshown
, thesoundfilewillhaveaplaybutton.
Youcan
" saveas " theattachmentfileandsoon.

'W'
wavfile.
Thedatabeginswithanetworkbyte-orderedguint32toidentifythewav
file'ssize
, immediatelyfollowedbythefile'scontent.

'P'
Picturefile.
Thedatabeginswithanetworkbyte-orderedguint32toidentifythepicture
file'ssize
, immediatelyfollowedbythefile'scontent.

'X'
thistypeidentifierisreservedforexperimentalextensions.



{
8 }.ResourceStorage
ResourceStoragestoretheexternalfilein'r'resourcefilelist
, the
imageinhtmlcode
, theimage , mediaandotherfilesinwikitag.
Ithavetwoforms:
1 .Directdirectoryandfilesinthe " res " sub-directory.
2 .Theres.rifo , res.ridxandres.rdicdatabase.
Directfilesmayhavefilenameencodingproblem
, asLinuxuseUTF- 8 and
Windowsuselocalencoding
, soyou'dbetterjustuseASCIIfilename , or
usedatabsetostoreUTF-
8 filename.
Databsemayneedtoextractthefile(suchas.wav)filetoatemporary
file
, sonotsoefficientcomparetodirectfiles.Butdatabasehavethe
advantageofcompressing.
Youcanconverttheresdirectoryandtheresdatabasefromeachotherby
thedir2resdatabseandresdatabase2dirtools.
StarDictwilltrytoloadthestoragedatabasefirst
, thentrythedirect
filesform.

Theformatoftheres.rifofile:
StarDict'sstorageifofile
version
= 3.0.0
filecount
= //required.
idxoffsetbits
= //optional.


Theformatoftheres.ridxfile:
filename
; //Astringendwith'�'.
offset ; //32or64bitsunsignednumberinnetworkbyteorder.
size ; //32bitsunsignednumberinnetworkbyteorder.
filenamecanincludeapathtoo , suchas " pic/example.png " .filenameis
casesensitive
, andthereshouldhavenotwosamefilenamesinallthe
entries.
if
" idxoffsetbits=64 " , thenoffsetis 64 bits.
Thesethreeitemsarerepeatedaseachentry.
Theentriesaresortedbythestrcmp()functionwiththefilenamefield.
Itispossiblethatdifferentfilenameshavethesameoffsetandsize.

Theformatoftheres.rdicfile:
Itisjustthejoinofeachresourcefiles.
Youcandictzipthisfileasres.rdic.dz



{
9 }.TreeDictionary
Thetreedictionarysupportisusedforinformationviewing
, etc.

Atreedictionarycontainsthreefile:sometreedict.ifo
, sometreedict.tdx.gz
andsometreedict.dict.dz.

Itisbettertocompressthe.tdxfile
, asitisalwaysloadintomemory.

The.ifofilehasthefollowingformat:

StarDict'streedictifofile
version
= 2.4.2
[ options ]

Availableoptions:

bookname
= //required
tdxfilesize
= //required
wordcount
=
author
=
email
=
website
=
description
=
date
=
sametypesequence
=

wordcountisonlyusedforinfoviewinthedictmanagedialog
, soitisnot
importantintreedictionary.

The.tdxfileisjustthewordlist.
-----------
Thewordlistisatreelistofwordentries.

Eachentryinthewordlistcontainsfourfields
, oneaftertheother:
word_str
; //autf-8stringterminatedby'�'.
word_data_offset ; //worddata'soffsetin.dictfile
word_data_size ; //worddata'stotalsizein.dictfile.itcanbe0.
word_subentry_count ; //howmanysubwordthisentryhas,0meansnone.

Subentryisimmidiatelyfollowedbyitsparententry.Thismaketheorderis
justaswhenatreelistwithallitsnodesextended
, thensortfromtopto
bottom.

word_data_offset
, word_data_sizeandword_subentry_countshouldbe 32 -bits
unsignednumbersinnetworkbyteorder.

The.dictfile'sformatisthesameasthenormaldictionary.



{
10 }.Moreinformation.
Youcanread
" src/lib.cpp " , " src/dictmanagedlg.cpp " and
" src/tools/*.cpp " formoreinformation.

Afteryouhavebuildadictionary
, youcanuse " stardict_verify " toverifythe
dictionaryfiles.Youcanfinditat
" src/tools/ " .

Ifyouhaveanyquestions
, emailme.:)

ThankstoWillRobinson<wsr23@stanford.edu>forcleaningupthisfile's
English.

HuZheng<huzheng_001@
163 .com>
http://forlinux.yeah.net
2007.4.24

根据这个说明和参考其源代码,实现了读取词库的一个程序。
工程使用NetBeans6创建,请参看 附件:-)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值