selenium+多线程爬虫爬取博客信息

Python+Selenium多线程基础微博爬虫

一、随便扯扯的概述

        大家好,虽然我自上大学以来就一直在关注着CSDN,在这上面学到了很多知识,可是却从来没有发过博客(还不是因为自己太菜,什么都不会),这段时间正好在机房进行期末实训,我们组做的是一个基于微博信息的商品推荐系统,不说这个系统是不是真的可行可用,有价值,我们的主要目的只是通过构建这个系统来学习更多的技术,新学一门语言(额,原谅我现在才开始学Python)。

        好,废话不多说,这次我主要是想记录一下我在这个项目中的进展,纯属是想做个日志之类的东西留个纪念吧,我虽然都已经大三了,但还是个菜鸟,其中在爬取微博内容部分引用了某位大神的代码https://blog.csdn.net/d1240673769/article/details/74278547,希望各位大神多多给出意见和建议。

        这篇文章主要是讲我如何通过selenium这个工具来实现通过模拟浏览器搜索微博用户昵称,进入用户微博主页,并将内容保存到本地,其中也顺带着把用户的微博头像保存了。

二、环境配置

        1.首先我安装的环境是python3.6,使用的IDE是pycharm,在pycharm中可以直接安装所需要的selenium和webdriver等等一系列的package。

         如果需要导入相关的package,建立了项目之后,点击File -> settings -> Project: “项目名称” -> Project Interperter,如下图所示:

                                                    

     接下来在右侧双击击pip,进入所有Package界面,搜索所需要的package,点击install package即可:



                这里可以同时安装很多个,选完之后可以直接将窗口叉掉,然后点击OK,程序会在后台进行安装。


                    安装完成后,pycharm下方会有提示。

        2. 下载chromdriver,进入http://npm.taobao.org/mirrors/chromedriver/,通过查看notes.txt下载与自己的chrome浏览器相对应的chromedriver。


        下载之后将解压包直接复制到项目目录下,例如我这里直接复制到:


        3.下面开始编写程序。爬取微博我这里使用的是m站的微博,通过构造https://m.weibo.cn/u/“用户的OID”来直接访问用户的所有微博内容,无需登录。如果通过访问wap站的话,每个人的微博主页地址可以更改,规律难寻,技术水平有限。

                整个爬虫的具体思路如下:

               模拟浏览器访问https://weibo.com -> 通过搜索框搜索微博用户昵称 -> 切换到找人页面 -> 爬取用户微博主页地址            并访问 -> 爬取用户oid -> 访问https://m.weibo.cn/u/'oid' -> 正则表达式匹配内容并抓取。

        首先我们来构造OidSpider类,引入相关package:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import re
from pyquery import PyQuery as pq

        定义driver,访问https://www.weibo.com,通过定位器在搜索框输入用户微博昵称,定位器selector获得如下图,


代码如下:

self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
self.driver.get("https://www.weibo.com/")
input = self.wait.until(EC.presence_of_element_located(
    (By.CSS_SELECTOR, "#weibo_top_public > div > div > div.gn_search_v2 > input")))
submit = self.wait.until(
    EC.element_to_be_clickable((By.CSS_SELECTOR, "#weibo_top_public > div > div > div.gn_search_v2 > a")))
input.send_keys(self.nickName)

然后同样的方法定位搜索按钮点击搜索,再通过定位器切换到找人界面:


submit = self.wait.until(
    EC.element_to_be_clickable((By.CSS_SELECTOR, "#weibo_top_public > div > div > div.gn_search_v2 > a")))
submit.click()
submit = self.wait.until(EC.element_to_be_clickable(
    (By.CSS_SELECTOR, '#pl_common_searchTop > div.search_topic > div > ul > li:nth-child(2) > a')))
submit.click()

        接下来通过正则表达式匹配获取用户微博主页url

html = self.driver.page_source
doc = pq(html)
return (re.findall(r'a target="_blank"[\s\S]href="(.*)"[\s\S]title=', str(doc))[0])

        访问用户微博主页url,通过正则表达式匹配用户oid

self.driver.get('HTTPS:'+url)
html = self.driver.page_source
soup = BeautifulSoup(html, 'lxml')
script = soup.head.find_all('script')
self.driver.close()
return (re.findall(r"'oid']='(.*)'", str(script))[0])

        接下来进行WeiboSpider类的构建,引入相关package

from selenium import webdriver
import urllib.request
import json
from selenium.webdriver.support.ui import WebDriverWait

        构造请求头

req = urllib.request.Request(url)
req.add_header("User-Agent",
               "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")
proxy = urllib.request.ProxyHandler({'http': self.__proxyAddr})
opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
data = urllib.request.urlopen(req).read().decode('utf-8', 'ignore')
return data

        通过xpath找到微博用户头像(xpath的用法跟selecor类似,但功能比selector更强大),然后直接保存在本地

self.driver.get("https://m.weibo.cn/u/" + self.oid)

src = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath(’//*[@id=“app”]/div[1]/div[2]/div[1]/div/div[2]/span/img’))
imgurl = src.get_attribute(‘src’)
urllib.request.urlretrieve(imgurl, ‘D://微博用户头像/’+nickName+’.jpg’)
self.driver.get(imgurl)

        然后后循环抓取微博内容,写到txt中

while True:
weibo_url = ‘https://m.weibo.cn/api/container/getIndex?type=uid&value=’ + self.oid + ‘&containerid=’ + self.searchContainerId(url) + ‘&page=’ + str(i)
try:
data = self.constructProxy(weibo_url)
content = json.loads(data).get(‘data’)
cards = content.get(‘cards’)
if (len(cards) > 0):
for j in range(len(cards)):
print("-----正在爬取第" + str(i) + “页,第” + str(j) + “条微博------”)
card_type = cards[j].get(‘card_type’)
if (card_type == 9):
mblog = cards[j].get(‘mblog’)
attitudes_count = mblog.get(‘attitudes_count’)
comments_count = mblog.get(‘comments_count’)
created_at = mblog.get(‘created_at’)
reposts_count = mblog.get(‘reposts_count’)
scheme = cards[j].get(‘scheme’)
text = mblog.get(‘text’)
with open(nickName+’.txt’, ‘a’, encoding=‘utf-8’) as fh:
fh.write("----第" + str(i) + “页,第” + str(j) + “条微博----” + “\n”)
fh.write(“微博地址:” + str(scheme) + “\n” + “发布时间:” + str(
created_at) + “\n” + “微博内容:” + text + “\n” + “点赞数:” + str(
attitudes_count) + “\n” + “评论数:” + str(comments_count) + “\n” + “转发数:” + str(
reposts_count) + “\n”)
i += 1
else:
break
except Exception as e:
print(e)

        当然最后不能忘了关闭driver

self.driver.close()

        

        接下来到多线程,多线程其实比较简单,python3和python2有些许区别,这里推荐使用python3里的threading

from oidspider import OidSpider
from weibospider import WeiboSpider
from threading import Thread

class MultiSpider:
userList=None
threadList=[]

<span style="color:#cc7832;">def </span><span style="color:#b200b2;">__init__</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>userList):
    <span style="color:#94558d;">self</span>.userList=userList


<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">weiboSpider</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">,</span>nickName):
    oidspider = OidSpider(nickName)
    url = oidspider.constructURL()
    oid = oidspider.searchOid(url)
    weibospider = WeiboSpider(oid)
    weibospider.searchWeibo(nickName)

<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">mutiThreads</span>(<span style="color:#94558d;">self</span>):
    <span style="color:#cc7832;">for </span>niName <span style="color:#cc7832;">in </span><span style="color:#94558d;">self</span>.userList:
        t=Thread(<span style="color:#aa4926;">target</span>=<span style="color:#94558d;">self</span>.weiboSpider<span style="color:#cc7832;">,</span><span style="color:#aa4926;">args</span>=(niName<span style="color:#cc7832;">,</span>))
        <span style="color:#94558d;">self</span>.threadList.append(t)

    <span style="color:#cc7832;">for </span>threads <span style="color:#cc7832;">in </span><span style="color:#94558d;">self</span>.threadList:
        threads.start()



以下是完整代码:

#######################################################
#
# OidSpider.py
# Python implementation of the Class OidSpider
# Generated by Enterprise Architect
# Created on: 20-六月-2018 10:27:14
# Original author: McQueen
#
#######################################################

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import re
from pyquery import PyQuery as pq

class OidSpider:
“”"
用户微博ID爬取

使用selenium模拟浏览器操作,通过搜索用户微博昵称找到用户,
爬取网页重定向所需要的微博用户主页URL地址,进入主页后对HTML代码分析,
找到并爬取用户微博的ID

nickName: 微博昵称
driver: 浏览器驱动
wait: 模拟浏览器进行操作时所需等待的时间
“”"
nickName=None
driver=None
wait=None
def init(self, nickName):
“”“初始化Oid爬虫

根据用户输入的nickName进行初始化

“””
self.nickName=nickName

<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">constructURL</span>(<span style="color:#94558d;">self</span>):
    <span style="color:#629755;"><em>"""构造URL


模拟浏览器搜索用户微博昵称,分析需要跳转到用户微博主页的URL地址

返回值为用户微博主页的URL地址
“”"
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
self.driver.get(“https://www.weibo.com/”)
input = self.wait.until(EC.presence_of_element_located(
(By.CSS_SELECTOR, “#weibo_top_public > div > div > div.gn_search_v2 > input”)))
submit = self.wait.until(
EC.element_to_be_clickable((By.CSS_SELECTOR, “#weibo_top_public > div > div > div.gn_search_v2 > a”)))
input.send_keys(self.nickName)
submit.click()
submit = self.wait.until(EC.element_to_be_clickable(
(By.CSS_SELECTOR, ‘#pl_common_searchTop > div.search_topic > div > ul > li:nth-child(2) > a’)))
submit.click()
html = self.driver.page_source
doc = pq(html)
return (re.findall(r’a target="_blank"[\s\S]href="(.*)"[\s\S]title=’, str(doc))[0])

<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">searchOid</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>url):
    <span style="color:#629755;"><em>"""爬取用户Oid


分析用户微博主页HTML代码,抓取用户ID

url: 用户微博主页的URL地址

返回值为用户的ID
“”"
self.driver.get(‘HTTPS:’+url)
html = self.driver.page_source
soup = BeautifulSoup(html, ‘lxml’)
script = soup.head.find_all(‘script’)
self.driver.close()
return (re.findall(r"‘oid’]=’(.*)’", str(script))[0])

#######################################################
#
# WeiboSpider.py
# Python implementation of the Class WeiboSpider
# Generated by Enterprise Architect
# Created on: 20-六月-2018 10:55:18
# Original author: McQueen
#
#######################################################

from selenium import webdriver
import urllib.request
import json
from selenium.webdriver.support.ui import WebDriverWait

class WeiboSpider:
“”“初始化微博爬虫并根据Oid构造加载微博用户信息和微博内容的xhr

oid: 用户ID
url: m站用来加载微博用户的xhr
driver: 浏览器驱动器
“””
__proxyAddr = “122.241.72.191:808”
oid=None
url=None
driver=None
def init(self, oid):
self.oid=oid
self.url = ‘https://m.weibo.cn/api/container/getIndex?type=uid&value=’ + oid
self.driver=webdriver.Chrome()

<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">constructProxy</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">,</span>url):
    <span style="color:#629755;"><em>"""构造代理


构造请求包,获取微博用户的xhr信息

返回值为xhr信息
“”"
req = urllib.request.Request(url)
req.add_header(“User-Agent”,
“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0”)
proxy = urllib.request.ProxyHandler({‘http’: self.__proxyAddr})
opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
data = urllib.request.urlopen(req).read().decode(‘utf-8’, ‘ignore’)
return data

<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">searchContainerId</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">,</span>url):
    <span style="color:#629755;"><em>"""构造用户信息的xhr地址


url: 需要进行分析的URL地址

返回值为xhr地址
“”"
data = self.constructProxy(url)
content = json.loads(data).get(‘data’)
for data in content.get(‘tabsInfo’).get(‘tabs’):
if (data.get(‘tab_type’) == ‘weibo’):
containerid = data.get(‘containerid’)
return containerid

<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">searchWeibo</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>nickName):
    <span style="color:#629755;"><em>"""爬取微博内容,存为文本文档


对每一个用户微博内容的xhr信息进行分析,爬取用户的微博内容,并将内容输出到txt文件中
使用selenium的xpath进行用户微博头像的定位,并将用户头像下载到本地

nickName: 用户微博昵称
“”"
i = 1
self.driver.get(“https://m.weibo.cn/u/” + self.oid)

    src = WebDriverWait(<span style="color:#94558d;">self</span>.driver<span style="color:#cc7832;">, </span><span style="color:#6897bb;">10</span>).until(<span style="color:#cc7832;">lambda </span>driver: <span style="color:#94558d;">self</span>.driver.find_element_by_xpath(<span style="color:#6a8759;">'//*[@id="app"]/div[1]/div[2]/div[1]/div/div[2]/span/img'</span>))
    imgurl = src.get_attribute(<span style="color:#6a8759;">'src'</span>)
    urllib.request.urlretrieve(imgurl<span style="color:#cc7832;">, </span><span style="color:#6a8759;">'D://微博用户头像/'</span>+nickName+<span style="color:#6a8759;">'.jpg'</span>)
    <span style="color:#94558d;">self</span>.driver.get(imgurl)

    url=<span style="color:#94558d;">self</span>.url
    <span style="color:#cc7832;">while True</span>:
        weibo_url = <span style="color:#6a8759;">'https://m.weibo.cn/api/container/getIndex?type=uid&amp;value=' </span>+ <span style="color:#94558d;">self</span>.oid + <span style="color:#6a8759;">'&amp;containerid=' </span>+ <span style="color:#94558d;">self</span>.searchContainerId(url) + <span style="color:#6a8759;">'&amp;page=' </span>+ <span style="color:#8888c6;">str</span>(i)
        <span style="color:#cc7832;">try</span>:
            data = <span style="color:#94558d;">self</span>.constructProxy(weibo_url)
            content = json.loads(data).get(<span style="color:#6a8759;">'data'</span>)
            cards = content.get(<span style="color:#6a8759;">'cards'</span>)
            <span style="color:#cc7832;">if </span>(<span style="color:#8888c6;">len</span>(cards) &gt; <span style="color:#6897bb;">0</span>):
                <span style="color:#cc7832;">for </span>j <span style="color:#cc7832;">in </span><span style="color:#8888c6;">range</span>(<span style="color:#8888c6;">len</span>(cards)):
                    <span style="color:#8888c6;">print</span>(<span style="color:#6a8759;">"-----正在爬取第" </span>+ <span style="color:#8888c6;">str</span>(i) + <span style="color:#6a8759;">"页,第" </span>+ <span style="color:#8888c6;">str</span>(j) + <span style="color:#6a8759;">"条微博------"</span>)
                    card_type = cards[j].get(<span style="color:#6a8759;">'card_type'</span>)
                    <span style="color:#cc7832;">if </span>(card_type == <span style="color:#6897bb;">9</span>):
                        mblog = cards[j].get(<span style="color:#6a8759;">'mblog'</span>)
                        attitudes_count = mblog.get(<span style="color:#6a8759;">'attitudes_count'</span>)
                        comments_count = mblog.get(<span style="color:#6a8759;">'comments_count'</span>)
                        created_at = mblog.get(<span style="color:#6a8759;">'created_at'</span>)
                        reposts_count = mblog.get(<span style="color:#6a8759;">'reposts_count'</span>)
                        scheme = cards[j].get(<span style="color:#6a8759;">'scheme'</span>)
                        text = mblog.get(<span style="color:#6a8759;">'text'</span>)
                        <span style="color:#cc7832;">with </span><span style="color:#8888c6;">open</span>(nickName+<span style="color:#6a8759;">'.txt'</span><span style="color:#cc7832;">, </span><span style="color:#6a8759;">'a'</span><span style="color:#cc7832;">, </span><span style="color:#aa4926;">encoding</span>=<span style="color:#6a8759;">'utf-8'</span>) <span style="color:#cc7832;">as </span>fh:
                            fh.write(<span style="color:#6a8759;">"----第" </span>+ <span style="color:#8888c6;">str</span>(i) + <span style="color:#6a8759;">"页,第" </span>+ <span style="color:#8888c6;">str</span>(j) + <span style="color:#6a8759;">"条微博----" </span>+ <span style="color:#6a8759;">"</span><span style="color:#cc7832;">\n</span><span style="color:#6a8759;">"</span>)
                            fh.write(<span style="color:#6a8759;">"微博地址:" </span>+ <span style="color:#8888c6;">str</span>(scheme) + <span style="color:#6a8759;">"</span><span style="color:#cc7832;">\n</span><span style="color:#6a8759;">" </span>+ <span style="color:#6a8759;">"发布时间:" </span>+ <span style="color:#8888c6;">str</span>(
                                created_at) + <span style="color:#6a8759;">"</span><span style="color:#cc7832;">\n</span><span style="color:#6a8759;">" </span>+ <span style="color:#6a8759;">"微博内容:" </span>+ text + <span style="color:#6a8759;">"</span><span style="color:#cc7832;">\n</span><span style="color:#6a8759;">" </span>+ <span style="color:#6a8759;">"点赞数:" </span>+ <span style="color:#8888c6;">str</span>(
                                attitudes_count) + <span style="color:#6a8759;">"</span><span style="color:#cc7832;">\n</span><span style="color:#6a8759;">" </span>+ <span style="color:#6a8759;">"评论数:" </span>+ <span style="color:#8888c6;">str</span>(comments_count) + <span style="color:#6a8759;">"</span><span style="color:#cc7832;">\n</span><span style="color:#6a8759;">" </span>+ <span style="color:#6a8759;">"转发数:" </span>+ <span style="color:#8888c6;">str</span>(
                                reposts_count) + <span style="color:#6a8759;">"</span><span style="color:#cc7832;">\n</span><span style="color:#6a8759;">"</span>)
                i += <span style="color:#6897bb;">1

else:
break
except Exception as e:
print(e)

    <span style="color:#94558d;">self</span>.driver.close()

from oidspider import OidSpider
from weibospider import WeiboSpider
from threading import Thread

class MultiSpider:
userList=None
threadList=[]

<span style="color:#cc7832;">def </span><span style="color:#b200b2;">__init__</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>userList):
    <span style="color:#94558d;">self</span>.userList=userList


<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">weiboSpider</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">,</span>nickName):
    oidspider = OidSpider(nickName)
    url = oidspider.constructURL()
    oid = oidspider.searchOid(url)
    weibospider = WeiboSpider(oid)
    weibospider.searchWeibo(nickName)

<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">mutiThreads</span>(<span style="color:#94558d;">self</span>):
    <span style="color:#cc7832;">for </span>niName <span style="color:#cc7832;">in </span><span style="color:#94558d;">self</span>.userList:
        t=Thread(<span style="color:#aa4926;">target</span>=<span style="color:#94558d;">self</span>.weiboSpider<span style="color:#cc7832;">,</span><span style="color:#aa4926;">args</span>=(niName<span style="color:#cc7832;">,</span>))
        <span style="color:#94558d;">self</span>.threadList.append(t)

    <span style="color:#cc7832;">for </span>threads <span style="color:#cc7832;">in </span><span style="color:#94558d;">self</span>.threadList:
        threads.start()

from MultiSpider import MultiSpider
def main():
list=[‘孟美岐’,‘吴宣仪’,‘杨超越’,‘紫宁’]
multispider=MultiSpider(list)
multispider.mutiThreads()

if name == ‘main’:
main()

        好啦,现在我们就可以爬取各位小姐姐的微博和头像了,下面就是爬取到的内容






  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值