python爬虫----初次使用selenium

最新推荐文章于 2024-05-10 16:25:53 发布

老问题

最新推荐文章于 2024-05-10 16:25:53 发布

阅读量917

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/qq_32511479/article/details/75675027

版权

python 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

这两天都在研究selenium，光是装这个鬼东西就费了好大劲

不过这东西确实挺好用的

为了熟悉使用selenium，我还是跟随大佬的脚步，用他们的项目来练练手

可以去看看州的先生的知乎文章：https://www.zhihu.com/people/zmister/pins/posts。写的都很基础，容易理解

这次是要使用selenium来抓取QQ空间好友的说说

关于selenium的具体操作可以去看《selenium webdriver(python)第三版》，网上有资料。安装selenium的方法也在里面了

思路分析：

Selenium是一个用于Web应用的功能自动化测试工具，Selenium 直接运行在浏览器中，就像真正的用户在操作一样。

我用的是Chrome浏览器

1、首先访问好友空间，输入链接后，会有一个登录界面，这时候就要用selenium模拟人的操作完成登陆。

在做这一步时，在网页中右击点开审查元素，会帮你定位到想要的位置（很实用的一招，以前没怎么用过，这次学习了）

2、通过审查元素定位到说说那一部分，就可以抓取数据了

代码：

from selenium import webdriver
import time
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.maximize_window()



def get_friend_shuoshuo(qq):
    url = 'http://user.qzone.qq.com/{}/311'.format(qq)
    driver.get(url)
    time.sleep(3)#为了脚本运行的稳定性，需要脚本中添加等待时间
    print(url)
    
    try:
        driver.find_element_by_id('login_frame')
        a = True
    except:
        a = False


    #print('a ：' + str(a))
    
    if a == True:
        driver.switch_to_frame('login_frame')
        driver.find_element_by_id('switcher_plogin').click()#帐号密码登录
        driver.find_element_by_id('u').clear()
        driver.find_element_by_id('u').send_keys('你的qq')
        driver.find_element_by_id('p').clear()
        driver.find_element_by_id('p').send_keys('你的密码')
        driver.find_element_by_id('login_button').click()
        time.sleep(3)

    driver.implicitly_wait(3)#智能等待时间


    try:
        driver.find_element_by_id('QM_OwnerInfo_Icon')
        b = True
    except:
        b = False
        print('sorry,你没有权限访问该好友的空间')

    #print('b：' + str(b))


    if b == True:
        driver.switch_to_frame('app_canvas_frame')#框架定位，在审查元素中找到iframe的标签，括号里写id
        content = driver.find_elements_by_css_selector('pre.content')
        stime = driver.find_elements_by_css_selector('a.c_tx.c_tx3')
        

        count = 0
        for con,sti in zip(content,stime):
            count += 1
            print('第%d条' % count)
            print('内容：' + str(con.text))
            print('时间：' + str(sti.text))
            print('\n\n')
    else:
        print('出错了！')

    #pages = driver.page_source
    #soup = BeautifulSoup(pages,'html.parser')
    #print(soup)


    print("==========完成================")
            


qq = input('请输入要访问的qq：')
get_friend_shuoshuo(qq)

遇到的问题：

1、driver.switch_to_frame('login_frame')这一行代码，开始没弄明白。这一行是多层框架定位的意思

要在审查元素中找到iframe标签，才是这个框架的部分。括号里写id属性

2、driver.find_elements_by_css_selector（）返回的是一个list列表

3、在写完后，我还打算加上翻页的功能。就是可以把好友全部的说说都抓取下来。

但是我发现审查元素和查看源代码中的代码并不一样，审查元素里的才是网页上看到的。

可我并没有办法弄到审查元素的代码，于是我试了下driver.page_source 和 BeautifulSoup。

但是driver.page_source返回的有时候是审查元素的代码，有时候又是查看源代码中的代码

这就有点搞不懂了。好像是什么js动态数据，在网上看了很久也没找到个好的解决方案。

等过几天学会了怎么弄动态的数据，在加上这个功能吧

解决方法：

本来我昨天是打算用BeautifulSoup解析审查元素的代码，在用正则表达式获取id = "pager_next_\d+"这一段，

但是没办法获取到审查元素的代码，只好作罢，用一种比较常规的办法。

通过点击审查元素可观察到，下一页那个button，是这样的

因此可以用driver.find_element_by_id().click来定位并点击，从而实现翻页功能

但因为pager_next_后面的数字会改变。数字初始是0，改变的规律是每次点击页码或是下一页都会加1。

我们要实现的只是一页页的翻，所以设置一个计数器，每次加1就好了。

值得一提的是，每次翻页后要设置一个等待时间，不然可能会因为网页没加载出来而出现错误

修改后的代码：

from selenium import webdriver
import time
import re
from bs4 import BeautifulSoup
import requests

driver = webdriver.Chrome()
driver.maximize_window()

global count
count = 0


def next_page():#获取每一页的说说
    global count

    content = driver.find_elements_by_css_selector('pre.content')
    stime = driver.find_elements_by_css_selector('a.c_tx.c_tx3')
    
    for con,sti in zip(content,stime):
        count += 1
        print('第%d条' % count)
        print('内容：' + str(con.text))
        print('时间：' + str(sti.text))
        print('\n\n')
    content.clear()
    stime.clear()



def get_friend_shuoshuo(qq):
    url = 'http://user.qzone.qq.com/{}/311'.format(qq)
    driver.get(url)
    time.sleep(3)
    print(url)
    
    try:
        driver.find_element_by_id('login_frame')
        a = True
    except:
        a = False


    
    if a == True:
        driver.switch_to_frame('login_frame')
        driver.find_element_by_id('switcher_plogin').click()#帐号密码登录
        driver.find_element_by_id('u').clear()
        driver.find_element_by_id('u').send_keys('qq')
        driver.find_element_by_id('p').clear()
        driver.find_element_by_id('p').send_keys('密码')
        driver.find_element_by_id('login_button').click()
        time.sleep(3)

    driver.implicitly_wait(2)



    try:
        driver.find_element_by_id('QM_OwnerInfo_Icon')
        b = True
    except:
        b = False
        print('sorry,你没有权限访问该好友的空间')

    #pages = driver.page_source
    #print(pages)
    #soup = BeautifulSoup(pages,'html.parser')
    #print(soup)
    #pnext = soup.find('div',attrs = {'id':'pager'})
    #print(str(pnext) + '\n')

        
    count1 = 0 #审查元素的代码是pager_next_再加上一个数字，初始是0，每次点击页码或下一页会加1
                #因为这里是点下一页，不用跳页，所以弄一个count1来实现翻页的功能

    driver.switch_to_frame('app_canvas_frame')#框架定位，在审查元素中找到iframe的标签，括号里写id
    while b == True:
        next_page()

        try:
            driver.find_element_by_id('pager_next_' + str(count1)).click()
            b = True
        except:
            b = False
            continue
        count1 += 1
        time.sleep(5)#给个缓冲时间，不然网页没加载出来导致出错
        
        


    cookie = driver.get_cookies()
    cookie_dict = []
    for c in cookie:
        ck = "{0}={1};".format(c['name'],c['value'])
        cookie_dict.append(ck)
    i = ''
    for c in cookie_dict:
        i += c
    print('Cookies:',i)
    print("==========完成================")
            


qq = input('请输入要访问的qq：')
get_friend_shuoshuo(qq)