2019年06月_phoebus_si

中科院高级人工智能符号主义知识点总结.pdf

中国科学院大学高级人工智能课程符号主义部分罗平老师部分所有考点重点总结和证明有完整的思路曲线对每一个考点都有涵盖和展开证明如归结原理的完备性

2020-01-04

496,835 条来自 AG 新闻语料库 4 大类别超过 2000 个新闻源的新闻文章，数据集仅仅援用了标题和描述字段。每个类别分别拥有 30,000 个训练样本及 1900 个测试样本。 README： AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".

2019-05-28

ag_news数据集

496,835 条来自 AG 新闻语料库 4 大类别超过 2000 个新闻源的新闻文章，数据集仅仅援用了标题和描述字段。每个类别分别拥有 30,000 个训练样本及 1900 个测试样本。 README： AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".

2019-05-28

fine_tuning_data.zip 可直接用bert进行微调的中文情绪数据

具体使用方法可以看我的博客：https://blog.csdn.net/weixin_40015791/article/details/90410083 下面也会简单介绍一下：在bert开源代码中的run_classifier.py中找到 processors = { "cola": ColaProcessor, "mnli": MnliProcessor, "mrpc": MrpcProcessor, "xnli": XnliProcessor, "intentdetection":IntentDetectionProcessor, "emotion":EmotionProcessor, #新加上这一行 } 然后在该文件中增加一个class： class EmotionProcessor(DataProcessor): """Processor for the MRPC data set (GLUE version).""" def get_train_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "fine_tuning_train_data.tsv")), "train") #此处的名字和文件夹中的训练集的名字要保持一致 def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "fine_tuning_val_data.tsv")), "dev") def get_test_examples(self, data_dir): """See base class.""" return self._create_examples( self._read_tsv(os.path.join(data_dir, "fine_tuning_test_data.tsv")), "test") def get_labels(self): """See base class.""" return ["0", "1","2","3","4","5","6"] #七分类则从0到6 def _create_examples(self, lines, set_type): """Creates examples for the training and dev sets.""" examples = [] for (i, line) in enumerate(lines): if i == 0: continue guid = "%s-%s" % (set_type, i) if set_type == "test": label = "0" text_a = tokenization.convert_to_unicode(line[0]) else: label = tokenization.convert_to_unicode(line[0]) text_a = tokenization.convert_to_unicode(line[1]) examples.append( InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) return examples 最后直接调用即可，运行的命令如下： python run_classifier.py \ --task_name=emotion \ --do_train=true \ --do_eval=true \ --data_dir=data \ #把数据解压到同一级的文件夹中，此处是该文件夹名字data --vocab_file=chinese_L-12_H-768_A-12/vocab.txt \ #中文数据要微调的原始bert模型 --bert_config_file=chinese_L-12_H-768_A-12/bert_config.json \ --init_checkpoint=chinese_L-12_H-768_A-12/bert_model.ckpt \ --max_seq_length=128 \ --train_batch_size=32 \ --learning_rate=2e-5 \ --num_train_epochs=3.0 \ --output_dir=output #生成文件所在的文件夹大概9个小时，最后文件夹中会有三个文件后缀分别为index/meta/00000-of-00001,分别将这个改成bert_model.ckpt.index/bert_model.ckpt.meta/bert_model.ckpt.data-00000-of-00001，再在同一个文件夹中放入chinese_L-12_H-768_A-12中的vocab.txt和bert_config.json 即最后该文件夹中有5个文件。然后像调用chinese_L-12_H-768_A-12一样将文件夹名改成自己的文件夹名即可。 bert-serving-start -model_dir output -num_worfer=3 即可调用微调后的语言通用模型。

2019-05-21

来自于NLPCC2013，解析成txt文件不均衡分类中文情感分析7类情感.zip

来自于NLPCC2013，解析后每一行为情感\t句子共有七类情感，且分布不均衡，划分训练集和测试集后数据数量为 1488 anger_data.txt 186 anger_test.txt 186 anger_val.txt 8:1:1 2459 disgust_data.txt 307disgust_test.txt 307 disgust_val.txt 8:1:1 201 fear_data.txt 50 fear_test.txt 50 fear_val.txt 4:1:1 2298happiness_data.txt 287happiness_test.txt 287happiness_val.txt 8:1:1 3286 like_data.txt 410 like_test.txt 410 like_val.txt 8:1:1 1917 sadness_data.txt 239 sadness_test.txt 239 sadness_val.txt 8:1:1 626 surprise_data.txt 78 surprise_test.txt 78 surprise_val.txt 8:1:1

2019-05-17

Multi-Domain Sentiment Datase英文情感分析数据正负情感 semantic_data.zip

Multi-Domain Sentiment Dataset解析成txt文件，只提取出文本和对应标签。 positive和negative二分类。包括dvd，kitchen，books，electronics四个domain数据，每一个domain分别有positive和negative数据各1000条。每一行为lable\tSentence。详见https://www.cs.jhu.edu/~mdredze/publications/sentiment_acl07.pdf

2019-05-17

CapsuleNetWork：从TensorFlow复现代码理解胶囊网络（DynamicRoutingBetweenCapsules）

从TensorFlow复现代码理解胶囊网络（Dynamic Routing Between Capsules）论文链接：https://arxiv.org/abs/1710.09829 Tensorflow代码复现链接：https://github.com/naturomics/CapsNet-Tensorflow

2018-10-17

weixin_40015791的博客

原创 Could not install packages due to an EnvironmentError: pip install 安装时遇到环境错误的解决办法

原创 Cannot uninstall 'numpy'. It is a distutils installed project and thus we cannot accurately determin

原创 error: could not create 'xxxxxx': Permission denied

中科院高级人工智能符号主义知识点总结.pdf

ag_news_csv.tgz

ag_news数据集

fine_tuning_data.zip 可直接用bert进行微调的中文情绪数据

来自于NLPCC2013，解析成txt文件不均衡分类中文情感分析7类情感.zip

Multi-Domain Sentiment Datase英文情感分析数据正负情感 semantic_data.zip

CapsuleNetWork：从TensorFlow复现代码理解胶囊网络（DynamicRoutingBetweenCapsules）

空空如也

原创 Could not install packages due to an EnvironmentError: pip install 安装时遇到环境错误的解决办法

原创 Cannot uninstall 'numpy'. It is a distutils installed project and thus we cannot accurately determin

原创 error: could not create 'xxxxxx': Permission denied

中科院高级人工智能符号主义知识点总结.pdf

ag_news_csv.tgz

ag_news数据集

fine_tuning_data.zip 可直接用bert进行微调的中文情绪数据

来自于NLPCC2013，解析成txt文件 不均衡分类 中文情感分析7类情感.zip

Multi-Domain Sentiment Datase英文情感分析数据 正负情感 semantic_data.zip

CapsuleNetWork：从TensorFlow复现代码理解胶囊网络（DynamicRoutingBetweenCapsules）

空空如也

来自于NLPCC2013，解析成txt文件不均衡分类中文情感分析7类情感.zip

Multi-Domain Sentiment Datase英文情感分析数据正负情感 semantic_data.zip