字符级CNN分类模型的实现

该博客介绍了如何实现字符级CNN分类模型,基于论文1509.01626,包括模型结构、数据集处理、配置、训练和测试过程。作者分享了项目的GitHub链接,提供了一个包含6个卷积层和3个全连接层的网络结构,并展示了训练和测试的结果,准确率达到0.8789。
摘要由CSDN通过智能技术生成

上次发了一条字符级分类模型的推文,

这两天在家里就是对字符级CNN分类论文进行了代码实现:1509.01626 Character-level Convolutional Networks for Text Classification

项目代码见:https://github.com/howie6879/charcnntext_classification

项目环境:

  • Python3.6

  • Anaconda+Pipenv管理

使用

# 下载代码	
git clone https://github.com/howie6879/char_cnn_text_classification.git	
# 利用anaconda建立Python3.6环境 	
conda create -n python36 python=3.6	
# 进入项目 	
cd char_cnn_text_classification	
# --python 后面的路径是上面conda创建的路径地址	
pipenv install --python  ~/anaconda3/envs/python36/bin/python3.6	
# 如果出错 否则跳过这段	
pipenv run pip install pip==18.0	
# 安装依赖 具体以来可查看Pipenv文件	
pipenv install	
# 进入代码目录	
cd char_cnn_text_classification

模型

模型结构和论文中介绍的一样:

AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值