Chinese Literature Publishment in the Era of UGC

最新推荐文章于 2024-04-23 11:25:56 发布

desk696

最新推荐文章于 2024-04-23 11:25:56 发布

阅读量700

点赞数

本文链接：https://blog.csdn.net/desk696/article/details/103571988

版权

Chinese Literature Publishment in the Era of UGC

(too many spelling mistakes, sorry so much )

Introduction

In April of 2019, one news report was largely forwarded on Moments, a virtual community embedded in Wechat which is one of the most popular social media in China. A professor of Economics and Management School of Wuhan University, found his girl was arrested since she sold her orginal writing boy’s love novels through an online shopping platform Taobao, whcih was tipped off by another online writier for their dispute. The arrested girl, with the pen name of Shenhai, was charged for ‘illegal business operations’.
Although we won’t talk deeper about this case, it did show us some facts, new trends and even dilemma about Chinese literature writing and publishment in the era of UGC. Firstly, any authors and publishers who want to pubish books should submit applications to General Administration of Press and Publication. The works won’t appear on the market before having been permitted and own ISBN. Secondly, in the era of UGC, online literature platforms and writers have springed up, therefore push the readers to read online. Thirdly, as China is still not a sexual liberal country and the censorship system doesn’t allow publications to be exposed largly to the public, some sub-culture works, like boys’ love novels, can only be circulated through online platforms. The youngsters nowadays seems to read online more frequently.
In light of this trend, we really want to know the status quo of Chinese literature publishment.
How it is affected by online literature platforms?
We used data from Douban Book and want to figure out the following questions:

How the number of published books changes these years?
Who published the most these years? And what are the average ratings of these top authors? Are they famed authors?
Which publisher published the most these years? Are these publishers provided us with high rating book?
Has reading paper books cost more？

Data source

Douban.com was launched on March 6, 2005 and is a Chinese influential networking service website. It allows users to record their information and create content related to film, book, music, recent events, and activities in Chinese cities in Chinese cities. Since its launch in 2005, it has become one of China’s most popular commericial platforms for facilitating user-genrated content. Douban acts like the hybirdiezed Amazon-IMDB-Facebook Web 2.0 site. Around 270 million vistors each month reportedly access this online social network, they create and share inforamtion, recommendations, and ratings with both dedicated followers and casual users (Yecies 2016).
Considering lots of tags in Douban, which were commented by readers, we choose “Chinese literature”（中国文学） as our only tag for our researching. This tag could help us focuse on our reserach target and could represent the situation of Chinese Publishment. However, on Douban site, we were limited within first 50 pages access to acheive data. Therefore, the research time range of publishments was from January 2013 to November 2019.

Date Acquisition

We would like to scrape 6 variables: title, author, publishers, date of publication, prices, rating. What has to be noted is that 4 of them: author, publishers, date of publication, and prices, are stored in the same class, which means we actually only picked 3 lists from the html.

titles=[]
authors=[]
ratings_comments=[]

For the variable of title, they are stored in h2 of
< div class=‘info’>, we first used BeautifulSoup to find all of the these class, then we use for loops to pick h2, the text is the title. The function of strip() and replace() is to delete some unnecessary space.
< code site>
在这里插入图片描述
For the variable of author and other 3 variables, they are stored in < div class=‘pub’>, we use find_all() to scrape these data again, and they will be cleaned later.
< code site>

For the variable of rating, at first, we repeated the same way as above, but the result showed that this list collected one item less than other two. We tried to find out the reason and it should be blamed for the website. One item doesn’t have the information of rating, which cause the mismatching data. Then we decided to first collect the class of < div class=‘star_clearfix’>, which contains the information of star, rating, and number of comments. To do so, we ensure that all the data are match to the titles. We can extract the rating at the stage of data cleaning.
< code site>
在这里插入图片描述
To turn the page, we first set headers with information of user-agent, which can pretend viewing action in browser to anti anti-crawler. Press F12, click any item in the column of Name, then you can find user-agent in Request Headers of Headers.
We scraped the webpages sorting by dates of publication. The bascial url is made up by ‘https://book.douban.com/tag/%E4%B8%AD%E5%9B%BD%E6%96%87%E5%AD%A6?start=’+‘0’+’&type=R’, the middle message represent the pagecode, when turn up one page, the number would plus 20. We have to note that, we can only view 50 pages, when enter page 51, it shows no information here.
We use this code to set the ranger of pages and the function of requests.get() to scrape information of 50 pages.

pages = [str(i) for i in range(0,1000,20)] 
for pagecode in pages:
    hds={'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
    source = requests.get('https://book.douban.com/tag/%E4%B8%AD%E5%9B%BD%E6%96%87%E5%AD%A6?start='+pagecode+'&type=R',headers=hds)
    sleep(randint(1,3))
    print(source.status_code)
    print(source.text)

在这里插入图片描述

soup=bs.BeautifulSoup(source.content,'html.parser')
    
    for t in soup.find_all('div',class_='info'):
        for title in t.find_all('h2',class_=""):
            print(title.text.strip())
            titles.append(title.text.strip().replace('\n',''))
    for author in soup.find_all("div",class_="pub") :
        print(author.text.strip())
        authors.append(author.text.strip()) 
    for rating_comment in soup.find_all('div',class_='star clearfix'):
        print(rating_comment.text.strip().replace('人评价)','').replace('\n',''))
        ratings_comments.append(rating_comment.text.strip().replace('人评价)','').replace('\n',''))

在这里插入图片描述

Data Cleaning

To extract the rating from dataframe of rating_comments, we divided this column into different by the symbol of ‘(’. The messy information of 'XX人评价）'has been deleted on the stage of data collection with the function of replace(). Then the column of rating was joined into the original dataframe.
After acquiring from the website, we did some data cleaning. We found the second class “author” was stored with the information of author, press, publication date and name of translator. After lots of attempts to separate them, we decided to use python and manual work to figure it out. First, we slipt most items by code: split("/", expand true), but there are some special items with extra information such as tranlators and illustator that we could not handle them by code. Second, in light of few of these special items, we cleaned them by Excel and maual work. We also found some press were not input with the uniform name. Therefore we identified and integrated them with the help of Excel one by one with the function of filter. Third,after finishing the “author” class, the class “price” included other currencies such as USA dollars, New Taiwai dollar, and one item that is not for sale. We deleted these items for our reseeach target about chinese publishments.
在这里插入图片描述

We reimport the csv into jupyter notebook, them deal with the year, put the column of year to the last column, first divided them by the symbol of ‘-’, them deleted the last two columns contain information of month and day.

Visualization and Results

In response to former questions about Chinese literature publishment. Based on the data from 2013 to 2019, we did some statistics and visualizations to explain our findings.

           ![在这里插入图片描述](https://img-blog.csdnimg.cn/20191216225321433.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2Rlc2s2OTY=,size_16,color_FFFFFF,t_70)![Alt](https://i-blog.csdnimg.cn/blog_migrate/362e25fd9dc5b3afaf4c6a53f15d4700.gif#pic_center =30x30)

In total, the amount of publication sligtly dropped even though it saw a temporary rise. From this line chart, obviously , in 2017 the amout of chinese literature publicaitons reached peak. And then it has a little recession in 2018. Generally speaking, the amout of the publishments has a relative incerese after 2016.
在这里插入图片描述

在这里插入图片描述
In the paper book market, the republished works of traditional Chinese writers are the mainstream, and the average ratings are mixed, which are ranging between 8.0 and 9.4.
We choosed Top 10 most published authors and average rating of Top 10 most published authors to explian the status quo of Chinese literature authors.
From these 2 bar plots, we find Wang Zengqi(汪曾祺)‘s compositions are published most and classical chinese literature authors get the highest rating such as Dream of the Red Mansion, Poems of Cao Zhi.
Gennerally speaking , in paper book market, the composition during the Republic of China can resonate more with contemporary readers regardless of the amout of the publishments or the rating of compositions.
在这里插入图片描述

These publishing houses are all established publishers in the field of traditional literature, with average ratings around 8.1.
From the publisher’s historgrams, we can esaily find the The People’s Literature Publishing House(人民文学出版社) published most compositions and the average ratings of Top 10 most Published Publisher are very close between 8.0 and 8.2. Therefore, readers are generally satisfied with these compositions.
在这里插入图片描述

We picked the Top 10 most popular books according to the number of comments and famous press. From these books, they are almost ancient Chinese Literature.

Limitaion and Response Notes

We found that the pages from which we scraped data are not well structed. That is, some information are stored in the same class in form of text, which made us get into trouble when cleaning data. What’s more, even though the information are divided by ‘/’, about 50 items’ information of authore are not structed as the other items. We found it hard to clean some messy information and restruct them through talking with the machine, therefore, we exported them into csv and cleaned data by hands on Excel. We cannot hope there is always a perfect website for us to scrape, the only thing we can do is to see the problem, solve the problem.

在这里插入图片描述

The stage of data cleaning is too complex, we kept exporting the dataset, processing the data in Excel and importing the csv back to Jupyter Notebook, which also unfavorable for data management. What’s more, you can repeat the same results as ours because of the hands-on cleaning and updating information.
We should have studyed harder last semester and practiced more.

Yecies, B., Yang, J., Shim, A. G., Soh, K. R., & Berryman, M. J. (2016). The Douban online social media barometer and the Chinese reception of Korean popular culture flows.

desk696

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Chinese Literature Publishment in the Era of UGC

Chinese Literature Publishment in the Era of UGCIntroductionIn April of 2019, one news report was largely forwarded on Moments, a virtual community embedded in Wechat which is one of the most popula...
复制链接

扫一扫