Python 爬取小说点评网站,用大数据方法找小说

优书网是一个老白常用的第三方小说点评网站
首先爬取优书网–>书库
通过书库翻页来获得书籍相关信息

def get_url():
    url = "http://www.yousuu.com/bookstore/?channel&classId&tag&countWord&status&update&sort&page="
    headers = {
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}
    html = requests.get(url+"1",headers=headers)
    html.encoding = "UTF-8"
    js_info = xpathnode(html)
    
    js_info = js_info.get('Bookstore')
    account_info = js_info.get('total')
    pages = math.ceil(float(account_info/20))  #get the upper integer
    url = [url+str(i+1) for i in range(pages)]    #this is the array of waited crawl url ,just return to another block
    return pages,url

def xpathnode(html):            #return the structure of json data
    tree = etree.HTML(html.text)
    node = tree.xpath('//script/text()')   #get the account of books
    info = node[0][25:-122]
    js_info = json.loads(info)
    return js_info

def crawl():    #the core
    pages,url_combine = get_url()
    conn = conn_sql()
    create_tab(conn)
    cursor = conn.cursor()
    flag = 0
    for url in url_combine:       #page turning
        flag  = flag+1
        headers = {
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}
        html = requests.get(url,headers=headers)
        html.encoding = "UTF-8"
        book_js_info = xpathnode(html)
        book_js_info = book_js_info.get('Bookstore')
        book_js_info = book_js_info.get('books')
        print('rate of progress:'+str(round(flag*100/pages,2))+'%')   #rate of progress
        for i in range(20):       
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值