【NLP】中文命名实体公开语料

最新推荐文章于 2023-04-24 20:52:30 发布

zkq_1986

最新推荐文章于 2023-04-24 20:52:30 发布

阅读量647

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/zkq_1986/article/details/109364012

版权

NLP 专栏收录该内容

80 篇文章 11 订阅

订阅专栏

https://github.com/CLUEbenchmark/CLUENER2020

As we can see in Table 3, for MSRANER[7] and PeopleDailyNER3 dataset, they only have three classic categories (person name, location and organization), while WeiboNER[8, 9] add a category of Geo-political; For BOSONNER4[10], it add three more categories (time, product name, company name), but the it only has 2k samples. It should be mentioned that Resume NER [11] owns 8 categories in which Educational Institution and Ethnicity Background are unique. For Resume NER, the distribution is particularly unbalanced. The category with the largest amount of data is 134 times larger than the category with the smallest amount of data. However, in CLUENER2020, we control the amount of data in each category, making it on the same order of magnitude. See details in Figure 2. Except those three classic categories, CLUENER2020 has 7 other new categories than MSRANER and PeopleDailyNER, and more samples than BOSONNER. Besides diversity, our dataset is also more challenging than other datasets. Currently, state-of-the-art models in Chinese NER tasks got around f1 score 95 or more, while the best model in CLUENER2020 only got around 80 of the f1 score

参考文献：

CLUENER2020: FINE-GRAINED NAMED ENTITY RECOGNITION DATASET AND BENCHMARK FOR CHINESE