Python利用selenium模拟浏览器抓取异步加载等难爬页面信息

最新推荐文章于 2024-04-29 21:50:55 发布

abc200941410128

最新推荐文章于 2024-04-29 21:50:55 发布

阅读量6.5k

点赞数 2

分类专栏：爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/abc200941410128/article/details/75579394

版权

爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Python利用selenium模拟浏览器抓取异步加载等难爬页面信息

背景
已在我之前的文章《 R语言利用RSelenium包或者Rwebdriver模拟浏览器爬取异步加载等难爬取的网页信息》中提过了http://blog.csdn.net/abc200941410128/article/details/72511931
本次补上上次博客中说的用python实现。其他背景和一些包的介绍就不多做说明了。
程序说明
从中文起点网抓取信息后，存入本地MySQL数据库，其中有一些细节处理，在此提一下：
1、有一些数据没有评分的，用到了try…except…pass语句处理，避免报错和数据格式不一致问题；
2、不知道什么原因，火狐浏览器总是在抓取500本书籍信息以上（不超过1000）后总会提示崩溃，因此，我这里设置了，每抓取300本书籍便重启一下浏览器，虽然会耽误时间，但是避免了浏览器崩溃。另外，利用谷歌浏览器抓取时老出现启动的问题，换了几个版本都不好，不如火狐好用。
3、因为一条条写入数据库太慢，全部也不合适，我也跟上面第二条设置一样，用了300条批量写一次。
代码
下面将代码全部贴出来，供大家参考，基本上学会了模拟浏览器，绝大部分的网页都可以爬取了。其他的就是速度问题了，当然能不用浏览器最好不用。

# -*- coding: utf-8 -*-
"""
Created on Fri Apr 28 11:32:42 2017

@author: tiger
"""
from selenium import webdriver
from bs4 import BeautifulSoup
import datetime
import random
import requests
import MySQLdb

######获取所有的入选书籍页面链接
# 获得进入每部书籍相应的页面链接
def get_link(soup_page):
    soup = soup_page                                           
    items = soup('div','book-mid-info')
    ## 提取链接
    links = []
    for item in items:
        links.append('https:'+item.h4.a.get('href'))
    return links

### 进入每个链接，提取需要的信息

def get_book_info(link):
    driver.get(link)
    #soup = BeautifulSoup(driver.page_source)
    #根据日期随机分配的id
    book_id=datetime.datetime.now().strftime("%Y%m%d%H%M%S")+str(random.randint(1000,9999))
    ### 名称
    title = driver.find_element_by_xpath("//div[@class='book-information cf']/div/h1/em").text
    ### 作者
    author = driver.find_element_by_xpath("//div[@class='book-information cf']/div/h1/span/a").text
    ###类型
    types = driver.find_element_by_xpath("//div[@class='book-information cf']/div/p[1]/a").text
    ###状态
    status = driver.find_element_by_xpath("//div[@class='book-information cf']/div/p[1]/span[1]").text
    ###字数
    words = driver.find_element_by_xpath("//div[@class='book-information cf']/div/p[3]/em[1]").text
    ###点击
    cliks = driver.find_element_by_xpath("//div[@class='book-information cf']/div/p[3]/em[2]").text
    ###推荐
    recoms = driver.find_element_by_xpath("//div[@class='book-information cf']/div/p[3]/em[3]").text
    ### 评论数
    try :
        votes = driver.find_element_by_xpath("//p[@id='j_userCount']/span").text
    except (ZeroDivisionError,Exception) as e:
        votes=0
        print e
        pass
    #### 评分
    score = driver.find_element_by_id("j_bookScore").text
    ##其他信息
    info = driver.find_element_by_xpath("//div[@class='book-intro']").text.replace('\n','')

    return (book_id,title,author,types,status,words,cliks,recoms,votes,score,info)

#############保持数据到mysql
def to_sql(data):

    conn=MySQLdb.connect("localhost","root","tiger","test",charset="utf8" )
    cursor = conn.cursor()
    sql_create_database = 'create database if not exists test'
    cursor.execute(sql_create_database)
#    try :
#        cursor.select_db('test')
#    except (ZeroDivisionError,Exception) as e:
#        print e
    #cursor.execute("set names gb2312")
    cursor.execute('''create table if not exists test.tiger_book2(book_id bigint(80),
                                                              title varchar(50),
                                                              author varchar(50),
                                                              types varchar(30),
                                                              status varchar(20),
                                                              words numeric(8,2),
                                                              cliks numeric(10,2),
                                                              recoms numeric(8,2),
                                                              votes varchar(20),
                                                              score varchar(20),
                                                              info varchar(3000),
                                                              primary key (book_id));''')
    cursor.executemany('insert ignore into test.tiger_book2 values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s);',data)
    cursor.execute('select * from test.tiger_book2 limit 5;')

    conn.commit()
    cursor.close()
    conn.close()
#####进入每部影片的介绍页面提取信息
base_url = "http://a.qidian.com/?size=-1&sign=-1&tag=-1&chanId=-1&subCateId=-1&orderId=&update=-1&page="

links = []

Max_Page = 30090

rank = 0

for page in range(1,Max_Page+1):
    print "Processing Page ",page,".Please wait..."
    CurrentUrl = base_url +unicode(page)+u'&month=-1&style=1&action=-1&vip=-1'
    CurrentSoup = BeautifulSoup(requests.get(CurrentUrl).text,"lxml")
    links.append(get_link(CurrentSoup))
    #sleep(1)

print links[9][19]

### 获得所有书籍信息
books = []
rate = 1
driver = webdriver.Firefox()

for i in range(0,Max_Page):
    for j in range(0,20): 
        try:
            print "Getting information of the",rate,"-th book."
            books.append(get_book_info(links[i][j]))
            #sleep(0.8)
        except Exception,e:
            print e

        rate+=1
    if i % 15 ==0 :
            driver.quit()
            #写入数据库
            to_sql(books)
            books=[]
            driver = webdriver.Firefox()

driver.quit()
to_sql(books)
 ###添加id
#n=len(books)
#books=zip(*books)
#books.insert(0,range(1,n+1))
#books=zip(*books)
##print books[198]

4、比较
python相对R在安装Selenium容易些，而且不需要在命令提示符里启动selenium，不过在不做性能优化的前提下，R速度更快些，编码问题也相对少。

abc200941410128

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Python利用selenium模拟浏览器抓取异步加载等难爬页面信息

Python利用selenium模拟浏览器抓取异步加载等难爬页面信息背景已在我之前的文章《 R语言利用RSelenium包或者Rwebdriver模拟浏览器爬取异步加载等难爬取的网页信息》中提过了http://blog.csdn.net/abc200941410128/article/details/72511931 本次补上上次博客中说的用python实现。其他背景和一些包的介绍就不多做说
复制链接

扫一扫