[Clojure] Data Collection and Data Analysis on the music of www.xiami.com - Part 4

Source code

https://github.com/bluesilence/Lisp/tree/master/clojure/projects/xiami-crawler


Data Recording

The parsed data is recorded in .log file:

https://github.com/bluesilence/Lisp/blob/master/clojure/projects/xiami-crawler/records/


There is a last_album_id.log to record the last album that has been crawled, so that the crawler can continue after previous stop:

https://github.com/bluesilence/Lisp/blob/master/clojure/projects/xiami-crawler/logs/last_album_id.log


Here are some examples of crawled data:

Songs

<pre name="code" class="plain">2014-06-26;238241;花;4627;19883
2014-06-26;238242;Punk;1903;19883
2014-06-26;238240;Stupid;1364;19883
2014-06-26;238239;The End Of The World;1289;19883
2014-06-26;238243;ナイフ;1025;19883
2014-06-26;251086;東京心中;4896;20927
2014-06-26;251071;千鹤;2685;20924
2014-06-26;251046;그녀가 울고 있네요;2026;20921
2014-06-26;251077;Cockroach;1808;20926
2014-06-26;251089;7月8日;1803;20927
2014-06-26;251073;Regret;1685;20925
2014-06-26;251085;花言葉;1617;20927
2014-06-26;251095;Beautiful 5;1598;20929
2014-06-26;251090;さらば;1351;20927
2014-06-26;251083;ザクロ型の憂鬱;1302;20927
2014-06-26;251070;Hyena;1226;20924
2014-06-26;251080;Sugar Pain;1132;20926
2014-06-26;251093;別れ道;1095;20928
2014-06-26;251057;머리칼;1082;20921
2014-06-26;251047;반대편에 서서;1033;20921

 

Albums

2014-06-26;19883;ELLEGARDEN;EP、单曲;朋克 Punk Rock;4263;9.5;4;62
2014-06-26;20836;カラダとカラダ;录音室专辑;N/A;4432;-1;0;3
2014-06-26;20837;風味堂;录音室专辑;N/A;4432;-1;0;5
2014-06-26;20921;The Very Surprise;录音室专辑;韩国抒情曲 Korean Ballad;4462;-1;1;13
2014-06-26;20925;Regret-Auditory Impression-;EP、单曲;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.8;2;46
2014-06-26;20929;Cockayne Soup;EP、单曲;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.3;3;49
2014-06-26;20924;Hyena-Optical Impression-;EP、单曲;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.4;5;50
2014-06-26;20928;スペルマルガリィタ;EP、单曲;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.6;4;56
2014-06-26;20926;蛾蟇;EP、单曲;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.9;4;77
2014-06-26;20927;DISORDER ;录音室专辑;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.9;10;93

Artists

2014-06-26;4263;ELLEGARDEN
2014-06-26;4432;風味堂
2014-06-26;4432;風味堂
2014-06-26;4462;Gavy NJ
2014-06-26;4464;the GazettE
2014-06-26;4464;the GazettE
2014-06-26;4464;the GazettE
2014-06-26;4464;the GazettE
2014-06-26;4464;the GazettE
2014-06-26;4464;the GazettE

There are duplicate artists because the data is crawled from album's page, and each artists can have multiple albums.


Performance Issue

I tried to crawl 8 days with 10 threads. The results are 500,000 albums and their songs and artists.

However, xiami has far more albums.

The problem is that there are many ids pointing to an empty album (Eg. http://www.xiami.com/artist/300000), which dramatically decreases the percentage of effective crawling.

One possible solution could be jumping to new albums based on current album, which is exactly a "crawler" is defined to do.


For now, I can live with the limited data and do some data analysis.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值