问题类型
TagSpace 单词、标签的嵌入
用途: 学习从短文到相关主题标签的映射,例如,在 这篇文章 中的描述。这是一个典型的分类应用。
模型: 通过学习两者的嵌入,学习的映射从单词集到标签集。 例如,输入“restaurant has great food <\tab> #restaurant <\tab> #yum”将被翻译成下图。(图中的节点是要学习嵌入的实体,图中的边是实体之间的关系。
训练数据
training:
training data
The AG’s news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
新闻数据,4大类,12万篇。
World
Sports
Business
Sci/Tech
数据样例
The file classes.txt contains a list of classes corresponding to each label.
__label__2 , garca winds up best in tough going , given what sergio garca has achieved in his career already it is difficult to believe he is only 24 years old . he had a 67 yesterday , four under , to share the volvo masters lead with his fellow spaniard
__label__3 , us shares take a tumble on oil prices , new york , nov 23 ( afp ) - wall street shares slid on tuesday as oil prices surged higher and investors sensed weaknesses in the technology sector .
__label__4 , product review blackberry 7100t smartphone ( newsfactor ) , newsfactor - research in motion ' s ( nasdaq rimm ) quad-band \blackberry 7100t with \pda capabilities is a gsm/gprs ( 850/900/1800/1900 mhz ) cellular handset that can make and receive phone calls in more than 100 countries around the world .
训练
./classification_ag_news.sh
Downloading dataset ag_news
Compiling StarSpace
make: *** No targets specified and no makefile found. Stop.
Start to train on ag_news data:
Arguments:
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 20
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 0
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to initialize starspace model.
Build dict from input file : /tmp/starspace/data/ag_news.train
Read 5M words
Number of words in dictionary: 95811
Number of la