摘要数据集
cnn/dailymail
Gigaword
Gigaword corpus [Graff and Cieri, 2003] preprocessed identically to [Rush et al., 2015], which leads to around 3.8M training samples, 190K validation samples and 1951 test samples for evaluation. The input summary pairs consist of the head- line and the first sentence of the source articles.
中文摘要数据集
a large corpus of Chinese short text summarization (LCSTS) dataset [Hu et al., 2015] collected and constructed from the Chinese microblogging website Sina Weibo.
散文生成数据集
数据集和代码地址
论文:Topic-to-Essay Generation with Neural Networks
数据集介绍:
In order to guarantee the quality of the crawled text, we only crawl the compositions which contain some reviews and scores. The process of the data collection is summarized as follows: a) We crawl 228,110 a