Scrapy框架爬虫和百度帖吧评论的爬取

scrapy 框架基本知识

scrapy安装命令

pip install scrapy

或者用conda命令安装,个人感觉conda命令安装更方便,因为用pip安装一般会需要下载其他包并且要自己设置,而conda命令可以直接安装,更加方便

conda install scrapy

下面命令都是在cmd窗口下面cd到文件路径开始的,自己在指定文件路径创建scrapy文件,便于管理

scrapy startproject baidutieba#创建项目
scrapy crawl baidu#运行scrapy爬虫
scrapy shell “url”#爬取页面测试(shell)
scrapy view “url”#查看请求页面是否为你爬取的内容

scrapy爬虫介绍
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下

Scrapy运行流程大概如下:
首先,引擎从调度器中取出一个链接(URL)用于接下来的抓取
引擎把URL封装成一个请求(Request)传给下载器,下载器把资源下载下来,并封装成应答包(Response)
然后,爬虫解析Response
若是解析出实体(Item),则交给实体管道进行进一步的处理。
若是解析出的是链接(URL),则把URL交给Scheduler等待抓取。

百度帖吧爬虫

  1. 目录划分
    在这里插入图片描述
  2. 文件介绍
    function.py:用于放置一些额外函数。
    items.py:添加item用于保存。
    middlewares.py:为请求增添,修改请求头,cookie。
    pipelines.py: 数据去重,保存。
    settings.py:settings中有十分庞大的设置选项,还可以存放一些需要修改的参数。
    spiders内部文件:爬虫函数,爬取数据的主要程序。
  3. 注意事项
    5)参数设置
    操作使可修改文件:settings
    1.可修改参数及注释:
    在这里插入图片描述
    2.数据库信息,修改成要读取,保存信息的数据库。
    3.JUDEGEMENT_TYPE是一个爬取数据的时间限制参数,当它等于True,按小时来爬取,(比如下面的TIME_INTERVAL是两个小时,只爬取两小时前到现在的数据),当其等于FALSE,是按照天数来爬取,而TIME_INTERVAL这个参数受上面JUDEGEMENT_TYPE影响,所以此时如果TIME_INTERVAL仍等于2,则是爬两天内的数据。
    在这里插入图片描述
    判断是按小时爬取还是按天爬取,True为小时爬取,False为按天爬取

这两个都是设置时间结点的,第一个是对于贴吧的帖子和回帖,帖子的回复的受JUDGEMENT_TYPE的影响,第二给是对于回复的回复的链接进入时间相对较大,不受JUDGEMENT_TYPE的影响。防止信息爬漏,所以要放大范围。TIME_INTERVAL1始终是按天来爬取数据。不受.JUDEGEMENT_TYPE影响。
Scrapy爬虫框架目录
FILE=‘D:\College_sensation\web_crawler\baidutieba’
在这里插入图片描述
文件夹所在的目录,由于function是外部函数,无法直接调用,所以要给出SCrapy爬虫框架所在的目录。
调用请求头在middlewares 中设置,保存数据设置在items里面设定。

  • 爬虫代码(爬取相关信息)
import scrapy
from scrapy import Request,Selector
import re
from baidutieba.items import TiebaItem
from scrapy.utils.project import get_project_settings
import sys
settings = get_project_settings()
sys.path.append(settings.get('FILE')+'/baidutieba/baidutieba')
from function import open_url
from function import tieba_standard_time
from function import judge_tieba
from function import time_save
from function import judge_tieba1
TiebaItem = TiebaItem()
class baidu_spider(scrapy.Spider):
    name = 'baidu'
    start_urls = ['https://tieba.baidu.com/mo/q---32FEF31A4E9D438CA4A1B75C0C58874A%3AFG%3D1--1-3-0--2--wapp_1540453380315_486/m?kw=%E6%B9%96%E5%8C%97%E7%BB%8F%E6%B5%8E%E5%AD%A6%E9%99%A2&lp=6024']

    def parse(self, response):
        #id = response.meta['news_type']
        #if(id=='A'):
        id ='A'
        selecter = Selector(text=response.body)
        url = selecter.xpath('//div[@class="i"]/a/@href').extract()#爬取的链接
        url_prefix = 'https://tieba.baidu.com/mo/q---32FEF31A4E9D438CA4A1B75C0C58874A%3AFG%3D1--1-3-0--2--wapp_1540453380315_486/'#链接前缀
        title = selecter.xpath('//div[@class="i"]/a/text()').extract()#帖子的标题
        #top_tieba = ['-']*len(title)
        titlesum = selecter.xpath('//div[@class="i"]/p/text()').extract()
        url_next1 = selecter.xpath('//div/form/div[@class="bc p"]/a/@href').extract()  # 下一页的爬取
        url_next = url_prefix + url_next1[0]#链接的合成
        time =[]
        attention_rate = []#回复数作为关注度的统计
        j = 0
        try:
            top_tieba = selecter.xpath('//div[@class="i"]/span[last()]/text()').extract()#判断是否为置顶帖
            j = len(top_tieba)
        except:
            top_tieba =[]
        for i in range(len(title)):#这里 加一个置顶帖的判断,判断是否置顶,一旦是置顶帖,后面不需要继续
            if(i<j):
                title[i] = title[i].split("\xa0")[1]  # 数据清洗
                time.append(titlesum[i].split("\xa0")[2])#作为爬取时间的分界点
                attention_rate.append(titlesum[i].split("\xa0")[1][1:])#回复数作为关注度,并对数据进行初步清洗
                attention_rate[i] = int(attention_rate[i])
            else:
                title[i] = title[i].split("\xa0")[1]#数据清洗
                time.append(titlesum[i].split("\xa0")[2])#作为爬取时间的分界点
                attention_rate.append(titlesum[i].split("\xa0")[1][1:])#回复数作为关注度,并对数据进行初步清洗
                attention_rate[i] = int(attention_rate[i])
                url[i] = url_prefix + url[i]
                if(judge_tieba(tieba_standard_time(time[i]))and attention_rate[i]):

                    yield Request(url[i], meta={'sch_id': id, 'url': url[i],'reply_num':attention_rate[i],'url_prefix':url_prefix}, callback=self.parse_content1)#传递回复数,楼主发帖链接,拼接网址链接,楼主名
        if(judge_tieba(tieba_standard_time(time[-1]))):
            if(url_next):#爬取下一页
                yield Request(url_next, meta={'sch_id': id,'url':url[i]}, callback=self.parse)#传递链接名和拼接网址
            else:
                pass
        else:
            pass
    def parse_content1(self, response):#爬取贴吧内容
        sch_id = response.meta['sch_id']
        url = response.meta['url']
        url_prefix = response.meta['url_prefix']
        reply_num = response.meta['reply_num']
        selecter1 = Selector(text=response.body)
        tieba_name = selecter1.xpath('//div[@class="d"]/div/span/a/text()').extract()#楼主下面回帖人的用户名
        tieba_time = selecter1.xpath('//div[@class="d"]/div/span[@class="b"]/text()').extract()#楼主下面回帖的时间
        tieba_content1 = selecter1.xpath('//div[@class="d"]/div[@class="i"]').xpath('string(.)').extract()#回复内容
        tieba_content2 = selecter1.xpath('//div[@class="d"]/div[@class="i"]/span[@class="g"]/a/text()').extract()#回复内容清洗用的数据
        tieba_content_2 = selecter1.xpath('//div[@class="d"]/div[@class="i"]/a/text()').extract()#回复的回复爬取
        reply2_url1 = selecter1.xpath('//div[@class="d"]/div[@class="i"]/a/@href').extract()#回复的回复爬取链接
        tieba_content_next1 = selecter1.xpath('//div[@class="d"]/form/div/a/@href').extract()  # 回复的下一页的爬取的链接
        #url_prefix = 'https://tieba.baidu.com/mo/q---32FEF31A4E9D438CA4A1B75C0C58874A%3AFG%3D1--1-3-0--2--wapp_1540453380315_486/'
        tieba_content_next = 0
        if(tieba_content_next1):
            tieba_content_next = url_prefix + tieba_content_next1[0]#回复的下一页的网址
        pinlunhuifushu = [0]*(len(tieba_name))
        tieba_content = []
        reply2_url = ['']*len(tieba_name)
        tieba_content1[0] = tieba_content1[0].split("\xa0")[0].split(tieba_content2[0])[0]  # 对爬取内容的初步处理
        tieba_content1[0] = re.split(r'楼. ', tieba_content1[0])[1]#因为只有九个,第一条就是楼主,所以没有对应的回复链接
        if(judge_tieba(tieba_standard_time(tieba_time[0]))):#回复第一个是楼主的
            TiebaItem['sch_id'] = sch_id
            TiebaItem['send_id'] = tieba_name[0]
            TiebaItem['time'] = time_save(tieba_time[0])
            TiebaItem['url'] = url
            TiebaItem['content'] = tieba_content1[0]
            TiebaItem['reply_num'] = reply_num
            yield TiebaItem
        for i in range(1,len(tieba_name)):
            tieba_content1[i] = tieba_content1[i].split("\xa0")[0].split(tieba_content2[i])[0]#对爬取内容的初步处理
            tieba_content1[i] = re.split(r'楼. ',tieba_content1[i])[1]
            if(judge_tieba(tieba_standard_time(tieba_time[i]))):
                TiebaItem['sch_id'] = sch_id
                TiebaItem['send_id'] = tieba_name[i]
                TiebaItem['time'] = time_save(tieba_time[i])
                TiebaItem['url'] = url
                TiebaItem['content'] = tieba_content1[i]
                TiebaItem['reply_num'] = reply_num
                yield TiebaItem
            else:
                pass
            if(len(tieba_content_2[i-1])>2):
                # pinlunhuifushu[i] = re.findall(r'[^()]+',tieba_content_2[i-1] )[1]#楼主下面回帖的回复数
                # pinlunhuifushu[i] = int(pinlunhuifushu[i])
                reply2_url[i] = url_prefix + reply2_url1[i-1]
                if (judge_tieba1(tieba_standard_time(tieba_time[i]))):
                    yield Request(reply2_url[i], meta={'sch_id': sch_id, 'tieba_replyname': tieba_name[0], 'reply2_url': reply2_url[i], 'tieba_replyname': tieba_name[0],'reply_num':reply_num,'url_prefix':url_prefix,'url':url},callback=self.parse_content2)
            else:
                # pinlunhuifushu[i] = pinlunhuifushu[i]
                pass
        if(tieba_content_next):
            yield Request(tieba_content_next, meta={'sch_id': sch_id, 'tieba_content_next': tieba_content_next, 'tieba_replyname':tieba_name[0],'reply_num':reply_num,'url_prefix':url_prefix,'url':url}, callback=self.parse_content1)
        else:
            pass
    def parse_content2(self, response):#爬取贴吧内容
        sch_id = response.meta['sch_id']
        url = response.meta['url']
        url_prefix = response.meta['url_prefix']
        reply_num = response.meta['reply_num']
        tieba_replyname = response.meta['tieba_replyname']
        selecter2 = Selector(text=response.body)
        tieba_replycontent = selecter2.xpath('//div[@class="m t"]/div[@class="i"]').xpath('string(.)').extract()#回复内容
        #tieba_replyname = selecter2.xpath('//div[@class="m t"]/div[@class="i"]/a[1]/text()').extract()
        tieba_replyname1 = selecter2.xpath('//div[@class="m t"]/div[@class="i"]/br/following::a[1]/@href').extract()
        reply_time = selecter2.xpath('//div[@class="m t"]/div[@class="i"]/span/text()').extract()
        try:
            reply_next = selecter2.xpath('//div[@class="h"]/a/@href').extract()
        except:
            reply_next =[]
        for i in range(len(tieba_replycontent)):
            tieba_replyname1 [i] = tieba_replyname1[i].split("i?un=")[1]
            if(tieba_replyname1 [i] ==''):#空名字补楼主名
                #tieba_replyname1[i] = '贴吧用户_QWUAW5a '#空名字补楼主名
                tieba_replyname1[i] = tieba_replyname
                tieba_replycontent[i] = tieba_replycontent[i].split("\xa0")[0]  # 对爬取内容的初步处理
            else:
                tieba_replycontent[i] = tieba_replycontent[i].split(tieba_replyname1 [i])[0]
            tieba_replycontent[i] = tieba_replycontent[i].replace('\xa0','')
            if (judge_tieba(tieba_standard_time(reply_time[i]))):
                TiebaItem['sch_id'] = sch_id
                TiebaItem['send_id'] = time_save(tieba_replyname1[i])
                TiebaItem['time'] = time_save(reply_time[i])
                TiebaItem['url'] = url
                TiebaItem['content'] = tieba_replycontent[i]
                TiebaItem['reply_num'] = reply_num
                yield TiebaItem
        if(len(reply_next)):
            reply_next = url_prefix + reply_next[0]#链接前缀
            Request(reply_next,meta={'sch_id': sch_id,  'reply2_urlnext': reply_next[0], 'tieba_replyname': tieba_replyname,'reply_num':reply_num,'url_prefix':url_prefix,'url':url}, callback=self.parse_content2)

  • 自己编写的函数
import datetime
import datetime
import re
import pymysql
import pandas as pd
import random
import numpy as np
import requests
from bs4 import BeautifulSoup
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
import pymysql
def open_url():
    conn=connect_sql()
    sql_use = 'SELECT news_id,news_url FROM news_url'
    sql_stu_data = pd.read_sql(sql_use, conn)
    train_data = np.array(sql_stu_data)  # np.ndarray()
    main_list = train_data.tolist()  # list
    urs_list=[]
    for i in range(len(main_list)):
        ids=main_list[i][0]
        url=main_list[i][1]
        urs_list.append({'news_type':ids,'url':url})
    return urs_list
#正则化时间函数
def tieba_standard_time(clean_time):  # 时间修正
    if (":" in clean_time):
        # 判断是不是一天内
        if(len(clean_time)>5):
            return datetime.datetime.now().strftime('%Y') +'-' +clean_time.split(' ')[0] +'-'+clean_time.split(' ')[1].split(':')[0]
        else:
            return datetime.datetime.now().strftime('%Y-%m-%d') +'-' +clean_time.split(':')[0]
    elif (len(clean_time) <= 5):  # 判断是不是今天发的微博
        return datetime.datetime.now().strftime('%Y') + '-' + clean_time + '-00'
    else:
        return clean_time + '-00'
def time_save(time):
    if (":" in time):
        # 判断是不是一天内
        if(len(time)>5):
            time =  datetime.datetime.now().strftime('%Y') +'-' +time.split(' ')[0]
        else:
            time =  datetime.datetime.now().strftime('%Y-%m-%d')
    elif (len(time) <= 5):  # 判断是不是今天发的微博
        time =  datetime.datetime.now().strftime('%Y') + '-' + time
    else:
        pass
    return time
#判断爬取方式
def judge_tieba1(time):
    jugfe_day = False
    time_long = settings.get("TIME_INTERVAL1")
    test = False
    today = datetime.datetime.now().strftime('%Y-%m-%d')
    time_day =time[:-3]
    time_hour =time.split(time_day)[1].split('-')[1]
    # 对于按时长来爬去(按天爬取)
    cut_day = (datetime.datetime.now() - datetime.timedelta(days=time_long)).strftime('%Y-%m-%d')
    cut_day = int(cut_day.split('-')[0]) * 366 + int(cut_day.split('-')[1]) * 31 + int(cut_day.split('-')[2])
    time_day = int(time.split('-')[0]) * 366 + int(time.split('-')[1]) * 31 + int(time.split('-')[2])
    if (time_day >= cut_day):
        test = True
    return test
def judge_tieba(time):
    jugfe_day = settings.get("JUDGEMENT_TYPE")
    time_long = settings.get("TIME_INTERVAL")
    test = False
    today = datetime.datetime.now().strftime('%Y-%m-%d')
    time_day =time[:-3]
    time_hour =time.split(time_day)[1].split('-')[1]
    if (jugfe_day ):  # 按小时爬取
        hour = datetime.datetime.now().strftime('%H')
        yestday = (datetime.datetime.now() - datetime.timedelta(days=1)).strftime('%Y-%m-%d')
        if (today == time_day):  # 这是对于设置时长——23点的爬去以及第二天凌晨的爬去
            if (int(hour) - int(time_hour) <= time_long):
                test = True
        elif (yestday == time_day and int(hour) < time_long):  # 这是对于第二天凌晨对于第一天午夜的爬去
            if (24 - int(time_hour) <= time_long - int(hour)):
                test = True
            else:
                pass
        else:
            pass
    else:  # 对于按时长来爬去(按天爬取)
        cut_day = (datetime.datetime.now() - datetime.timedelta(days=time_long)).strftime('%Y-%m-%d')
        cut_day = int(cut_day.split('-')[0]) * 366 + int(cut_day.split('-')[1]) * 31 + int(cut_day.split('-')[2])
        time_day = int(time.split('-')[0]) * 366 + int(time.split('-')[1]) * 31 + int(time.split('-')[2])
        if (time_day >= cut_day):
            test = True
    return test
def connect_sql():
    conn = pymysql.connect(
        host=settings.get('MYSQL_HOST'),
        db=settings.get('MYSQL_DBNAME'),
        user=settings.get('MYSQL_USER'),
        passwd=settings.get('MYSQL_PASSWD'),
        charset='utf8mb4',
        use_unicode=True)
    cur = conn.cursor()
    return(conn)
  • 设置相关参数

#爬取判断时间是天还是小时计算:True为按小时计算,False为按天计算
JUDGEMENT_TYPE = False
#爬取楼主下面贴吧的回帖时间设置为一天
TIME_INTERVAL = 10
TIME_INTERVAL1 = 30
#MySQL数据库的名字和密码
# SQL_NAME = '#3##333#3#'
# SQL_PASSWORD = '123456'#数据库名称
#文件夹根目录
FILE='C:/Users/离殇/Desktop/baidutieba'
import pymysql
from baidutieba.items import TiebaItem
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
from baidutieba.items import TiebaItem
import datetime
import pandas as pd
import numpy as np
types=settings.get('JUDGEMENT_TYPE')
cut =settings.get('TIME_INTERVAL')
conn = pymysql.connect(
                host = settings.get('MYSQL_HOST'),
                db = settings.get('MYSQL_DBNAME'),
                user = settings.get('MYSQL_USER'),
                passwd = settings.get('MYSQL_PASSWD'),
                charset='utf8mb4',
                use_unicode=True)
cur=conn.cursor()
if(types):
    cut_time=(datetime.datetime.now()-datetime.timedelta(days=1)).strftime('%Y-%m-%d')
else:
    cut_time=(datetime.datetime.now()-datetime.timedelta(days=cut)).strftime('%Y-%m-%d')

sql_use='SELECT url,content,send_id FROM tieba WHERE time>=\''+cut_time+'\''
sql_stu_data = pd.read_sql(sql_use,conn)
train_data = np.array(sql_stu_data)#np.ndarray()
main_list = train_data.tolist()#list
de_latest_com=[]
for i in range(len(main_list)):
    de_latest_com.append(main_list[i][0]+main_list[i][1]+main_list[i][2])
conn.commit()
conn.close()
class TiebaPipeline(object):
     def process_item(self, item, spider):
         cut_re = item['url'] + item['content'] + item['send_id']
         if not (cut_re in de_latest_com):
             col = ''
             all_values = []
             for key in item:
                 col = col + '`' + key + '`,'
             for values in item.values():
                 all_values.append(values)
             col = col[0:(len(col) - 1)]
             placeholders = (len(item) - 1) * '%s,' + '%s'
             sql = 'INSERT INTO tieba'
             sql_lag = sql + '(' + col + ')VALUES(' + placeholders + ')'
             all_values = tuple(all_values)
             #        all_values='('+all_values[0:(len(all_values)-1)]+')'
             self.cur.execute(sql_lag, all_values)
         self.conn.commit()
    #     return item
     def open_spider(self, spider):  # 开始爬虫时调用
         # 连接数据库
         from baidutieba import settings
         self.conn = pymysql.connect(
             host=settings.MYSQL_HOST,
             db=settings.MYSQL_DBNAME,
             user=settings.MYSQL_USER,
             passwd=settings.MYSQL_PASSWD,
             charset='utf8mb4',
             use_unicode=True)
         self.cur = self.conn.cursor()

     def close_spider(self, spider):  # 关闭爬虫时调用
         self.conn.close()
         pass
  • 爬取内容
import scrapy
class TiebaItem(scrapy.Item):#贴吧保存Item
    # define the fields for your item here like:
    sch_id = scrapy.Field()#学校编号
    send_id = scrapy.Field()#发送人,贴吧发帖人
    url = scrapy.Field()#楼主
    time = scrapy.Field()#发帖时间
    content = scrapy.Field()#发帖内容
    reply_num = scrapy.Field()#回复数

  • 请求头设置文件
import pymysql
import pandas as pd
import random
import json
import numpy as np
from scrapy.utils.project import get_project_settings
import sys
settings = get_project_settings()
#from scrapy.conf import settings
settings = get_project_settings()
sys.path.append(settings.get('FILE')+'/baidutieba/baidutieba')
from function import connect_sql
class CookiesMiddleware():#到数据库随机获取一个cookie加到request上
    def process_request(self,request,spider):
        with open('C:/Users/离殇/Desktop/user_agent.txt','r',encoding='utf-8-sig') as f:
            main_list = f.readlines()
        cookies = random.choice(main_list)
        cookie = json.loads(cookies).get("SUB")
        request.cookies = {'SUB':cookie}
class RandomUserAgentMiddleware(object):
    conn = pymysql.connect(host='127.0.0.1', user='root', password='123456', db='yuqingyuebao', charset="utf8mb4")
    cur = conn.cursor()
    user_agent = 'SELECT * FROM user_agent'
    sql_data = pd.read_sql(user_agent, conn)
    train_data = np.array(sql_data)  # np.ndarray()
    main_list = train_data.tolist()  # list
    USER_AGENT_LIST = []
    for row in range(len(main_list)):
        USER_AGENT_LIST.append(main_list[row][0])
    def process_request(self, request, spider):
        ua = random.choice(USER_AGENT_LIST)
        if ua:
            request.headers.setdefault('User-Agent', ua)

这个爬虫代码是基于百度帖吧的反爬机制比较弱编写的,然后只设置请求时间,爬取信息设置了时间卡子,可以在settings 里面修改爬取的时间,这个爬虫爬取多层回复,大体上把信息都爬取完整,然后对数据进行简单的预处理,统计了相关格式,方便数据库操作,建立了与数据库直接的联系。其中可能有需要改正的地方,希望大家指出。

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值