【转】BeautifulSoup和Selenium对比

最新推荐文章于 2024-07-10 17:28:32 发布

自由翱翔_雨

最新推荐文章于 2024-07-10 17:28:32 发布

阅读量2.6k

点赞数 1

参照http://blog.csdn.net/eastmount/article/details/53932775#
BeautifulSoup：处理速度快，同时可以连续查找，主要用于静态网页
经过BeautifulSoup处理以后，编码方式都变成了Unicode,需要将其变成所需的编码方式：可以利用encode(‘需要的编码’)，还可以利用
BeautifulSoup(网页/html,”lxml/xml”).prettify(‘需要的编码’)
可以利用soup.original_encoding检测原来的编码
我们想用 class 过滤，不过 class 是 python 的关键词，这怎么办？加个下划线就可以
.find_all(标签，类别)

#! /user/bin/env python
#encoding=utf-8
__author__ = 'chw'
from bs4 import BeautifulSoup
import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36'}
file = open('250.txt' , 'w')
def craw(url):
    response=requests.get(url,headers=headers)
    text=response.text
    soup=BeautifulSoup(text,'lxml')
    print u'豆瓣电影250: 序号 \t影片名\t 评分 \t评价人数'
    for tag in soup.find_all("div",***class_***="item"):
        # num
        num=tag.find('em').get_text()
        # 名字
        name1=tag.find_all('span')[0].get_text()
        name2=tag.find_all('span')[1].get_text()
        name3 = tag.find_all('span')[2].get_text()
        name=name1+name2+name3
        print num+'\t'+name
        file.write(num+'\t'+name)
        print u'中文名字：'+name1
        file.write(u'中文名字：'+name1)
        # 分数和评价人数
        ping_fen=tag.find('div', class_="star").find_all('span')[1].**get_text**()
        ping_jia = tag.find('div', class_="star").find_all('span')[3].get_text()
        print u'评分'+ping_fen
        print u'评价'+ping_jia
        file.write(u'评分'+ping_fen)
        file.write(u'评分' + ping_jia)

if __name__=='__main__':
    for j in xrange(0,10):
        j=j*25
        url='https://movie.douban.com/top250?start='+str(j)+'&filter='
        craw(url)
    file.close()

Selenium：主要用于动态网页，查找速度慢，解析时要注意
.find_**elements**_by_xpath和.find_**element**_by_xpath有区别，同时利用浏览器时要配置。
.PhantomJS：
drive=webdriver.PhantomJS(‘D:\Anaconda2\phantomjswindow**s\bin**phantomjs.exe’)
谷歌浏览器：要下载驱动

#! /user/bin/env python
#encoding=utf-8
__author__ = 'chw'
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
file=open('200.txt','w')
def craw(url):
    drive.get(url)
    print u'豆瓣电影250: 序号 \t影片名\t 评分 \t评价人数'
    content=drive.find_elements_by_xpath("//div[@class='item']")
    for num in content:
        print num.***text***
    # content=drive.find_elements_by_xpath("//div[@class='item']")
    # for tag in content:
    #     print tag.text
if __name__=='__main__':
    drive=webdriver.PhantomJS('D:\Anaconda2\phantomjswindows\phantomjs.exe')
    # drive = webdriver.PhantomJS('C:\Program Files\Google\Chrome\Application\chromedriver.exe')
    for j in xrange(0,2):
        j=j*25
        url='https://movie.douban.com/top250?start='+str(j)+'&filter='
        craw(url)
    file.close()

自由翱翔_雨

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【转】BeautifulSoup和Selenium对比

参照http://blog.csdn.net/eastmount/article/details/53932775# BeautifulSoup：处理速度快，同时可以连续查找，主要用于静态网页经过BeautifulSoup处理以后，编码方式都变成了Unicode,需要将其变成所需的编码方式：可以利用encode(‘需要的编码’)，还可以利用 BeautifulSoup(网页/html,”lxml...
复制链接

扫一扫