micropython是啥 知乎_100+ Chinese Word Vectors 上百种预训练中文词向量

Chinese Word Vectors 中文词向量

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks.

Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

Reference

Please cite the paper, if using these embeddings and CA8 dataset.

Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.

@InProceedings{P18-2023,

author = "Li, Shen

and Zhao, Zhe

and Hu, Renfen

and Li, Wensi

and Liu, Tao

and Du, Xiaoyong",

title = "Analogical Reasoning on Chinese Morphological and Semantic Relations",

booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",

year = "2018",

publisher = "Association for Computational Linguistics",

pages = "138--143",

location = "Melbourne, Australia",

url = "http://aclweb.org/anthology/P18-2023"

}

A detailed analysis of the relation between the intrinsic and extrinsic evaluations of Chinese word embeddings is shown in the paper:

Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, Lijiao Yang. Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. (CCL & NLP-NABD 2018 Best Paper)

@incollection{qiu2018revisiting,

title={Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings},

author={Qiu, Yuanyuan and Li, Hongzheng and Li, Shen and Jiang, Yingdi and Hu, Renfen and Yang, Lijiao},

booktitle={Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data},

pages={209--221},

year={2018},

publisher={Springer}

}

Format

The pre-trained vector files are in text format. Each line contains a word and its vector. Each value is separated by space. The first line records the meta information: the first number indicates the number of words in the file and the second indicates the dimension size.

Besides dense word vectors (trained with SGNS), we also provide sparse vectors (trained with PPMI). They are in the same format with liblinear, where the number before " : " denotes dimension index and the number after the " : " denotes the value.

Pre-trained Chinese Word Vectors

Basic Settings

Window Size

Dynamic Window

Sub-sampling

Low-Frequency Word

Iteration

Negative Sampling*

5

Yes

1e-5

10

5

5

*Only for SGNS.

Various Domains

Chinese Word Vectors trained with different representations, context features, and corpora.

Word2vec / Skip-Gram with Negative Sampling (SGNS)

Corpus

Context Features

Word

Word + Ngram

Word + Character

Word + Character + Ngram

Baidu Encyclopedia 百度百科

Wikipedia_zh 中文维基百科

People's Daily News 人民日报

Financial News 金融新闻

Complete Library in Four Sections

四库全书*

NAN

NAN

Mixed-large 综合

Baidu Netdisk / Google Drive

Positive Pointwise Mutual Information (PPMI)

Corpus

Context Features

Word

Word + Ngram

Word + Character

Word + Character + Ngram

People's Daily News 人民日报

Complete Library in Four Sections

四库全书*

NAN

NAN

Mixed-large 综合

Sparse

Sparse

Sparse

Sparse

*Character embeddings are provided, since most of Hanzi are words in the archaic Chinese.

Various Co-occurrence Information

We release word vectors upon different co-occurrence statistics. Target and context vectors are often called input and output vectors in some related papers.

In this part, one can obtain vectors of arbitrary linguistic units beyond word. For example, character vectors is in the context vectors of word-character.

All vectors are trained by SGNS on Baidu Encyclopedia.

Feature

Co-occurrence Type

Target Word Vectors

Context Word Vectors

Word

Word → Word

Ngram

Word → Ngram (1-2)

Word → Ngram (1-3)

Ngram (1-2) → Ngram (1-2)

Character

Word → Character (1)

Word → Character (1-2)

Word → Character (1-4)

Radical

Radical

300d

300d

Position

Word → Word (left/right)

Word → Word (distance)

Global

Word → Text

300d

300d

Syntactic Feature

Word → POS

300d

300d

Word → Dependency

300d

300d

Representations

Existing word representation methods fall into one of the two classes, dense and sparse represnetations. SGNS model (a model in word2vec toolkit) and PPMI model are respectively typical methods of these two classes. SGNS model trains low-dimensional real (dense) vectors through a shallow neural network. It is also called neural embedding method. PPMI model is a sparse bag-of-feature representation weighted by positive-pointwise-mutual-information (PPMI) weighting scheme.

Context Features

Three context features: word, ngram, and character are commonly used in the word embedding literature. Most word representation methods essentially exploit word-word co-occurrence statistics, namely using word as context feature (word feature). Inspired by language modeling problem, we introduce ngram feature into the context. Both word-word and word-ngram co-occurrence statistics are used for training (ngram feature). For Chinese, characters (Hanzi) often convey strong semantics. To this end, we consider using word-word and word-character co-occurrence statistics for learning word vectors. The length of character-level ngrams ranges from 1 to 4 (character feature).

Besides word, ngram, and character, there are other features which have substantial influence on properties of word vectors. For example, using entire text as context feature could introduce more topic information into word vectors; using dependency parse as context feature could add syntactic constraint to word vectors. 17 co-occurrence types are considered in this project.

Corpus

We made great efforts to collect corpus across various domains. All text data are preprocessed by removing html and xml tags. Only the plain text are kept and HanLP(v_1.5.3) is used for word segmentation. In addition, traditional Chinese characters are converted into simplified characters with Open Chinese Convert (OpenCC). The detailed corpora information is listed as follows:

Corpus

Size

Tokens

Vocabulary Size

Description

Baidu Encyclopedia

百度百科

4.1G

745M

5422K

Chinese Encyclopedia data from

https://baike.baidu.com/

Wikipedia_zh

中文维基百科

1.3G

223M

2129K

People's Daily News

人民日报

3.9G

668M

1664K

News data from People's Daily(1946-2017)

http://data.people.com.cn/

Sogou News

搜狗新闻

3.7G

649M

1226K

News data provided by Sogou labs

http://www.sogou.com/labs/

Financial News

金融新闻

6.2G

1055M

2785K

Financial news collected from multiple news websites

Zhihu_QA

知乎问答

2.1G

384M

1117K

Weibo

微博

0.73G

136M

850K

Literature

文学作品

0.93G

177M

702K

8599 modern Chinese literature works

Mixed-large

综合

22.6G

4037M

10653K

We build the large corpus by merging the above corpora.

Complete Library in Four Sections

四库全书

1.5G

714M

21.8K

The largest collection of texts in pre-modern China.

All words are concerned, including low frequency words.

Toolkits

All word vectors are trained by ngram2vec toolkit. Ngram2vec toolkit is a superset of word2vec and fasttext toolkit, where arbitrary context features and models are supported.

Chinese Word Analogy Benchmarks

The quality of word vectors is often evaluated by analogy question tasks. In this project, two benchmarks are exploited for evaluation. The first is CA-translated, where most analogy questions are directly translated from English benchmark. Although CA-translated has been widely used in many Chinese word embedding papers, it only contains questions of three semantic questions and covers 134 Chinese words. In contrast, CA8 is specifically designed for Chinese language. It contains 17813 analogy questions and covers comprehensive morphological and semantic relations. The CA-translated, CA8, and their detailed descriptions are provided in testsets folder.

Evaluation Toolkit

We present an evaluation toolkit in evaluation folder.

Run the following codes to evaluate dense vectors.

$ python ana_eval_dense.py -v -a CA8/morphological.txt

$ python ana_eval_dense.py -v -a CA8/semantic.txt

Run the following codes to evaluate sparse vectors.

$ python ana_eval_sparse.py -v -a CA8/morphological.txt

$ python ana_eval_sparse.py -v -a CA8/semantic.txt

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值