Source code
https://github.com/bluesilence/Lisp/tree/master/clojure/projects/xiami-crawler
Data Recording
The parsed data is recorded in .log file:
https://github.com/bluesilence/Lisp/blob/master/clojure/projects/xiami-crawler/records/
There is a last_album_id.log to record the last album that has been crawled, so that the crawler can continue after previous stop:
https://github.com/bluesilence/Lisp/blob/master/clojure/projects/xiami-crawler/logs/last_album_id.log
Here are some examples of crawled data:
Songs
<pre name="code" class="plain">2014-06-26;238241;花;4627;19883
2014-06-26;238242;Punk;1903;19883
2014-06-26;238240;Stupid;1364;19883
2014-06-26;238239;The End Of The World;1289;19883
2014-06-26;238243;ナイフ;1025;19883
2014-06-26;251086;東京心中;4896;20927
2014-06-26;251071;千鹤;2685;20924
2014-06-26;251046;그녀가 울고 있네요;2026;20921
2014-06-26;251077;Cockroach;1808;20926
2014-06-26;251089;7月8日;1803;20927
2014-06-26;251073;Regret;1685;20925
2014-06-26;251085;花言葉;1617;20927
2014-06-26;251095;Beautiful 5;1598;20929
2014-06-26;251090;さらば;1351;20927
2014-06-26;251083;ザクロ型の憂鬱;1302;20927
2014-06-26;251070;Hyena;1226;20924
2014-06-26;251080;Sugar Pain;1132;20926
2014-06-26;251093;別れ道;1095;20928
2014-06-26;251057;머리칼;1082;20921
2014-06-26;251047;반대편에 서서;1033;20921
Albums
2014-06-26;19883;ELLEGARDEN;EP、单曲;朋克 Punk Rock;4263;9.5;4;62
2014-06-26;20836;カラダとカラダ;录音室专辑;N/A;4432;-1;0;3
2014-06-26;20837;風味堂;录音室专辑;N/A;4432;-1;0;5
2014-06-26;20921;The Very Surprise;录音室专辑;韩国抒情曲 Korean Ballad;4462;-1;1;13
2014-06-26;20925;Regret-Auditory Impression-;EP、单曲;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.8;2;46
2014-06-26;20929;Cockayne Soup;EP、单曲;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.3;3;49
2014-06-26;20924;Hyena-Optical Impression-;EP、单曲;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.4;5;50
2014-06-26;20928;スペルマルガリィタ;EP、单曲;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.6;4;56
2014-06-26;20926;蛾蟇;EP、单曲;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.9;4;77
2014-06-26;20927;DISORDER ;录音室专辑;新金属 Nu Metal,视觉摇滚 Visual Rock;4464;9.9;10;93
Artists
2014-06-26;4263;ELLEGARDEN
2014-06-26;4432;風味堂
2014-06-26;4432;風味堂
2014-06-26;4462;Gavy NJ
2014-06-26;4464;the GazettE
2014-06-26;4464;the GazettE
2014-06-26;4464;the GazettE
2014-06-26;4464;the GazettE
2014-06-26;4464;the GazettE
2014-06-26;4464;the GazettE
There are duplicate artists because the data is crawled from album's page, and each artists can have multiple albums.
Performance Issue
I tried to crawl 8 days with 10 threads. The results are 500,000 albums and their songs and artists.
However, xiami has far more albums.
The problem is that there are many ids pointing to an empty album (Eg. http://www.xiami.com/artist/300000), which dramatically decreases the percentage of effective crawling.
One possible solution could be jumping to new albums based on current album, which is exactly a "crawler" is defined to do.
For now, I can live with the limited data and do some data analysis.