scrapy 框架基本知识
scrapy安装命令
pip install scrapy
或者用conda命令安装,个人感觉conda命令安装更方便,因为用pip安装一般会需要下载其他包并且要自己设置,而conda命令可以直接安装,更加方便
conda install scrapy
下面命令都是在cmd窗口下面cd到文件路径开始的,自己在指定文件路径创建scrapy文件,便于管理
scrapy startproject baidutieba#创建项目
scrapy crawl baidu#运行scrapy爬虫
scrapy shell “url”#爬取页面测试(shell)
scrapy view “url”#查看请求页面是否为你爬取的内容
scrapy爬虫介绍
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下
Scrapy运行流程大概如下:
首先,引擎从调度器中取出一个链接(URL)用于接下来的抓取
引擎把URL封装成一个请求(Request)传给下载器,下载器把资源下载下来,并封装成应答包(Response)
然后,爬虫解析Response
若是解析出实体(Item),则交给实体管道进行进一步的处理。
若是解析出的是链接(URL),则把URL交给Scheduler等待抓取。
百度帖吧爬虫
- 目录划分
- 文件介绍
function.py:用于放置一些额外函数。
items.py:添加item用于保存。
middlewares.py:为请求增添,修改请求头,cookie。
pipelines.py: 数据去重,保存。
settings.py:settings中有十分庞大的设置选项,还可以存放一些需要修改的参数。
spiders内部文件:爬虫函数,爬取数据的主要程序。 - 注意事项
5)参数设置
操作使可修改文件:settings
1.可修改参数及注释:
2.数据库信息,修改成要读取,保存信息的数据库。
3.JUDEGEMENT_TYPE是一个爬取数据的时间限制参数,当它等于True,按小时来爬取,(比如下面的TIME_INTERVAL是两个小时,只爬取两小时前到现在的数据),当其等于FALSE,是按照天数来爬取,而TIME_INTERVAL这个参数受上面JUDEGEMENT_TYPE影响,所以此时如果TIME_INTERVAL仍等于2,则是爬两天内的数据。
判断是按小时爬取还是按天爬取,True为小时爬取,False为按天爬取
这两个都是设置时间结点的,第一个是对于贴吧的帖子和回帖,帖子的回复的受JUDGEMENT_TYPE的影响,第二给是对于回复的回复的链接进入时间相对较大,不受JUDGEMENT_TYPE的影响。防止信息爬漏,所以要放大范围。TIME_INTERVAL1始终是按天来爬取数据。不受.JUDEGEMENT_TYPE影响。
Scrapy爬虫框架目录
FILE=‘D:\College_sensation\web_crawler\baidutieba’
文件夹所在的目录,由于function是外部函数,无法直接调用,所以要给出SCrapy爬虫框架所在的目录。
调用请求头在middlewares 中设置,保存数据设置在items里面设定。
- 爬虫代码(爬取相关信息)
import scrapy
from scrapy import Request,Selector
import re
from baidutieba.items import TiebaItem
from scrapy.utils.project import get_project_settings
import sys
settings = get_project_settings()
sys.path.append(settings.get('FILE')+'/baidutieba/baidutieba')
from function import open_url
from function import tieba_standard_time
from function import judge_tieba
from function import time_save
from function import judge_tieba1
TiebaItem = TiebaItem()
class baidu_spider(scrapy.Spider):
name = 'baidu'
start_urls = ['https://tieba.baidu.com/mo/q---32FEF31A4E9D438CA4A1B75C0C58874A%3AFG%3D1--1-3-0--2--wapp_1540453380315_486/m?kw=%E6%B9%96%E5%8C%97%E7%BB%8F%E6%B5%8E%E5%AD%A6%E9%99%A2&lp=6024']
def parse(self, response):
#id = response.meta['news_type']
#if(id=='A'):
id ='A'
selecter = Selector(text=response.body)
url = selecter.xpath('//div[@class="i"]/a/@href').extract()#爬取的链接
url_prefix = 'https://tieba.baidu.com/mo/q---32FEF31A4E9D438CA4A1B75C0C58874A%3AFG%3D1--1-3-0--2--wapp_1540453380315_486/'#链接前缀
title = selecter.xpath('//div[@class="i"]/a/text()').extract()#帖子的标题
#top_tieba = ['-']*len(title)
titlesum = selecter.xpath('//div[@class="i"]/p/text()').extract()
url_next1 = selecter.xpath('//div/form/div[@class="bc p"]/a/@href').extract() # 下一页的爬取
url_next = url_prefix + url_next1[0]#链接的合成
time =[]
attention_rate = []#回复数作为关注度的统计
j = 0
try:
top_tieba = selecter.xpath('//div[@class="i"]/span[last()]/text()').extract()#判断是否为置顶帖
j = len(top_tieba)
except:
top_tieba =[]
for i in range(len(title)):#这里 加一个置顶帖的判断,判断是否置顶,一旦是置顶帖,后面不需要继续
if(i<j):
title[i] = title[i].split("\xa0")[1] # 数据清洗
time.append(titlesum[i].split("\xa0")[2])#作为爬取时间的分界点
attention_rate.append(titlesum[i].split("\xa0")[1][1:])#回复数作为关注度,并对数据进行初步清洗
attention_rate[i] = int(attention_rate[i])
else:
title[i] = title[i].split("\xa0")[1]#数据清洗
time.append(titlesum[i].split("\xa0")[2])#作为爬取时间的分界点
attention_rate.append(titlesum[i].split("\xa0")[1][1:])#回复数作为关注度,并对数据进行初步清洗
attention_rate[i] = int(attention_rate[i])
url[i] = url_prefix + url[i]
if(judge_tieba(tieba_standard_time(time[i]))and attention_rate[i]):
yield Request(url[i], meta={'sch_id': id, 'url': url[i],'reply_num':attention_rate[i],'url_prefix':url_prefix}, callback=self.parse_content1)#传递回复数,楼主发帖链接,拼接网址链接,楼主名
if(judge_tieba(tieba_standard_time(time[-1]))):
if(url_next):#爬取下一页
yield Request(url_next, meta={'sch_id': id,'url':url[i]}, callback=self.parse)#传递链接名和拼接网址
else:
pass
else:
pass
def parse_content1(self, response):#爬取贴吧内容
sch_id = response.meta['sch_id']
url = response.meta['url']
url_prefix = response.meta['url_prefix']
reply_num = response.meta['reply_num']
selecter1 = Selector(text=response.body)
tieba_name = selecter1.xpath('//div[@class="d"]/div/span/a/text()').extract()#楼主下面回帖人的用户名
tieba_time = selecter1.xpath('//div[@class="d"]/div/span[@class="b"]/text()').extract()#楼主下面回帖的时间
tieba_content1 = selecter1.xpath('//div[@class="d"]/div[@class="i"]').xpath('string(.)').extract()#回复内容
tieba_content2 = selecter1.xpath('//div[@class="d"]/div[@class="i"]/span[@class="g"]/a/text()').extract()#回复内容清洗用的数据
tieba_content_2 = selecter1.xpath('//div[@class="d"]/div[@class="i"]/a/text()').extract()#回复的回复爬取
reply2_url1 = selecter1.xpath('//div[@class="d"]/div[@class="i"]/a/@href').extract()#回复的回复爬取链接
tieba_content_next1 = selecter1.xpath('//div[@class="d"]/form/div/a/@href').extract() # 回复的下一页的爬取的链接
#url_prefix = 'https://tieba.baidu.com/mo/q---32FEF31A4E9D438CA4A1B75C0C58874A%3AFG%3D1--1-3-0--2--wapp_1540453380315_486/'
tieba_content_next = 0
if(tieba_content_next1):
tieba_content_next = url_prefix + tieba_content_next1[0]#回复的下一页的网址
pinlunhuifushu = [0]*(len(tieba_name))
tieba_content = []
reply2_url = ['']*len(tieba_name)
tieba_content1[0] = tieba_content1[0].split("\xa0")[0].split(tieba_content2[0])[0] # 对爬取内容的初步处理
tieba_content1[0] = re.split(r'楼. ', tieba_content1[0])[1]#因为只有九个,第一条就是楼主,所以没有对应的回复链接
if(judge_tieba(tieba_standard_time(tieba_time[0]))):#回复第一个是楼主的
TiebaItem['sch_id'] = sch_id
TiebaItem['send_id'] = tieba_name[0]
TiebaItem['time'] = time_save(tieba_time[0])
TiebaItem['url'] = url
TiebaItem['content'] = tieba_content1[0]
TiebaItem['reply_num'] = reply_num
yield TiebaItem
for i in range(1,len(tieba_name)):
tieba_content1[i] = tieba_content1[i].split("\xa0")[0].split(tieba_content2[i])[0]#对爬取内容的初步处理
tieba_content1[i] = re.split(r'楼. ',tieba_content1[i])[1]
if(judge_tieba(tieba_standard_time(tieba_time[i]))):
TiebaItem['sch_id'] = sch_id
TiebaItem['send_id'] = tieba_name[i]
TiebaItem['time'] = time_save(tieba_time[i])
TiebaItem['url'] = url
TiebaItem['content'] = tieba_content1[i]
TiebaItem['reply_num'] = reply_num
yield TiebaItem
else:
pass
if(len(tieba_content_2[i-1])>2):
# pinlunhuifushu[i] = re.findall(r'[^()]+',tieba_content_2[i-1] )[1]#楼主下面回帖的回复数
# pinlunhuifushu[i] = int(pinlunhuifushu[i])
reply2_url[i] = url_prefix + reply2_url1[i-1]
if (judge_tieba1(tieba_standard_time(tieba_time[i]))):
yield Request(reply2_url[i], meta={'sch_id': sch_id, 'tieba_replyname': tieba_name[0], 'reply2_url': reply2_url[i], 'tieba_replyname': tieba_name[0],'reply_num':reply_num,'url_prefix':url_prefix,'url':url},callback=self.parse_content2)
else:
# pinlunhuifushu[i] = pinlunhuifushu[i]
pass
if(tieba_content_next):
yield Request(tieba_content_next, meta={'sch_id': sch_id, 'tieba_content_next': tieba_content_next, 'tieba_replyname':tieba_name[0],'reply_num':reply_num,'url_prefix':url_prefix,'url':url}, callback=self.parse_content1)
else:
pass
def parse_content2(self, response):#爬取贴吧内容
sch_id = response.meta['sch_id']
url = response.meta['url']
url_prefix = response.meta['url_prefix']
reply_num = response.meta['reply_num']
tieba_replyname = response.meta['tieba_replyname']
selecter2 = Selector(text=response.body)
tieba_replycontent = selecter2.xpath('//div[@class="m t"]/div[@class="i"]').xpath('string(.)').extract()#回复内容
#tieba_replyname = selecter2.xpath('//div[@class="m t"]/div[@class="i"]/a[1]/text()').extract()
tieba_replyname1 = selecter2.xpath('//div[@class="m t"]/div[@class="i"]/br/following::a[1]/@href').extract()
reply_time = selecter2.xpath('//div[@class="m t"]/div[@class="i"]/span/text()').extract()
try:
reply_next = selecter2.xpath('//div[@class="h"]/a/@href').extract()
except:
reply_next =[]
for i in range(len(tieba_replycontent)):
tieba_replyname1 [i] = tieba_replyname1[i].split("i?un=")[1]
if(tieba_replyname1 [i] ==''):#空名字补楼主名
#tieba_replyname1[i] = '贴吧用户_QWUAW5a '#空名字补楼主名
tieba_replyname1[i] = tieba_replyname
tieba_replycontent[i] = tieba_replycontent[i].split("\xa0")[0] # 对爬取内容的初步处理
else:
tieba_replycontent[i] = tieba_replycontent[i].split(tieba_replyname1 [i])[0]
tieba_replycontent[i] = tieba_replycontent[i].replace('\xa0','')
if (judge_tieba(tieba_standard_time(reply_time[i]))):
TiebaItem['sch_id'] = sch_id
TiebaItem['send_id'] = time_save(tieba_replyname1[i])
TiebaItem['time'] = time_save(reply_time[i])
TiebaItem['url'] = url
TiebaItem['content'] = tieba_replycontent[i]
TiebaItem['reply_num'] = reply_num
yield TiebaItem
if(len(reply_next)):
reply_next = url_prefix + reply_next[0]#链接前缀
Request(reply_next,meta={'sch_id': sch_id, 'reply2_urlnext': reply_next[0], 'tieba_replyname': tieba_replyname,'reply_num':reply_num,'url_prefix':url_prefix,'url':url}, callback=self.parse_content2)
- 自己编写的函数
import datetime
import datetime
import re
import pymysql
import pandas as pd
import random
import numpy as np
import requests
from bs4 import BeautifulSoup
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
import pymysql
def open_url():
conn=connect_sql()
sql_use = 'SELECT news_id,news_url FROM news_url'
sql_stu_data = pd.read_sql(sql_use, conn)
train_data = np.array(sql_stu_data) # np.ndarray()
main_list = train_data.tolist() # list
urs_list=[]
for i in range(len(main_list)):
ids=main_list[i][0]
url=main_list[i][1]
urs_list.append({'news_type':ids,'url':url})
return urs_list
#正则化时间函数
def tieba_standard_time(clean_time): # 时间修正
if (":" in clean_time):
# 判断是不是一天内
if(len(clean_time)>5):
return datetime.datetime.now().strftime('%Y') +'-' +clean_time.split(' ')[0] +'-'+clean_time.split(' ')[1].split(':')[0]
else:
return datetime.datetime.now().strftime('%Y-%m-%d') +'-' +clean_time.split(':')[0]
elif (len(clean_time) <= 5): # 判断是不是今天发的微博
return datetime.datetime.now().strftime('%Y') + '-' + clean_time + '-00'
else:
return clean_time + '-00'
def time_save(time):
if (":" in time):
# 判断是不是一天内
if(len(time)>5):
time = datetime.datetime.now().strftime('%Y') +'-' +time.split(' ')[0]
else:
time = datetime.datetime.now().strftime('%Y-%m-%d')
elif (len(time) <= 5): # 判断是不是今天发的微博
time = datetime.datetime.now().strftime('%Y') + '-' + time
else:
pass
return time
#判断爬取方式
def judge_tieba1(time):
jugfe_day = False
time_long = settings.get("TIME_INTERVAL1")
test = False
today = datetime.datetime.now().strftime('%Y-%m-%d')
time_day =time[:-3]
time_hour =time.split(time_day)[1].split('-')[1]
# 对于按时长来爬去(按天爬取)
cut_day = (datetime.datetime.now() - datetime.timedelta(days=time_long)).strftime('%Y-%m-%d')
cut_day = int(cut_day.split('-')[0]) * 366 + int(cut_day.split('-')[1]) * 31 + int(cut_day.split('-')[2])
time_day = int(time.split('-')[0]) * 366 + int(time.split('-')[1]) * 31 + int(time.split('-')[2])
if (time_day >= cut_day):
test = True
return test
def judge_tieba(time):
jugfe_day = settings.get("JUDGEMENT_TYPE")
time_long = settings.get("TIME_INTERVAL")
test = False
today = datetime.datetime.now().strftime('%Y-%m-%d')
time_day =time[:-3]
time_hour =time.split(time_day)[1].split('-')[1]
if (jugfe_day ): # 按小时爬取
hour = datetime.datetime.now().strftime('%H')
yestday = (datetime.datetime.now() - datetime.timedelta(days=1)).strftime('%Y-%m-%d')
if (today == time_day): # 这是对于设置时长——23点的爬去以及第二天凌晨的爬去
if (int(hour) - int(time_hour) <= time_long):
test = True
elif (yestday == time_day and int(hour) < time_long): # 这是对于第二天凌晨对于第一天午夜的爬去
if (24 - int(time_hour) <= time_long - int(hour)):
test = True
else:
pass
else:
pass
else: # 对于按时长来爬去(按天爬取)
cut_day = (datetime.datetime.now() - datetime.timedelta(days=time_long)).strftime('%Y-%m-%d')
cut_day = int(cut_day.split('-')[0]) * 366 + int(cut_day.split('-')[1]) * 31 + int(cut_day.split('-')[2])
time_day = int(time.split('-')[0]) * 366 + int(time.split('-')[1]) * 31 + int(time.split('-')[2])
if (time_day >= cut_day):
test = True
return test
def connect_sql():
conn = pymysql.connect(
host=settings.get('MYSQL_HOST'),
db=settings.get('MYSQL_DBNAME'),
user=settings.get('MYSQL_USER'),
passwd=settings.get('MYSQL_PASSWD'),
charset='utf8mb4',
use_unicode=True)
cur = conn.cursor()
return(conn)
- 设置相关参数
#爬取判断时间是天还是小时计算:True为按小时计算,False为按天计算
JUDGEMENT_TYPE = False
#爬取楼主下面贴吧的回帖时间设置为一天
TIME_INTERVAL = 10
TIME_INTERVAL1 = 30
#MySQL数据库的名字和密码
# SQL_NAME = '#3##333#3#'
# SQL_PASSWORD = '123456'#数据库名称
#文件夹根目录
FILE='C:/Users/离殇/Desktop/baidutieba'
import pymysql
from baidutieba.items import TiebaItem
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
from baidutieba.items import TiebaItem
import datetime
import pandas as pd
import numpy as np
types=settings.get('JUDGEMENT_TYPE')
cut =settings.get('TIME_INTERVAL')
conn = pymysql.connect(
host = settings.get('MYSQL_HOST'),
db = settings.get('MYSQL_DBNAME'),
user = settings.get('MYSQL_USER'),
passwd = settings.get('MYSQL_PASSWD'),
charset='utf8mb4',
use_unicode=True)
cur=conn.cursor()
if(types):
cut_time=(datetime.datetime.now()-datetime.timedelta(days=1)).strftime('%Y-%m-%d')
else:
cut_time=(datetime.datetime.now()-datetime.timedelta(days=cut)).strftime('%Y-%m-%d')
sql_use='SELECT url,content,send_id FROM tieba WHERE time>=\''+cut_time+'\''
sql_stu_data = pd.read_sql(sql_use,conn)
train_data = np.array(sql_stu_data)#np.ndarray()
main_list = train_data.tolist()#list
de_latest_com=[]
for i in range(len(main_list)):
de_latest_com.append(main_list[i][0]+main_list[i][1]+main_list[i][2])
conn.commit()
conn.close()
class TiebaPipeline(object):
def process_item(self, item, spider):
cut_re = item['url'] + item['content'] + item['send_id']
if not (cut_re in de_latest_com):
col = ''
all_values = []
for key in item:
col = col + '`' + key + '`,'
for values in item.values():
all_values.append(values)
col = col[0:(len(col) - 1)]
placeholders = (len(item) - 1) * '%s,' + '%s'
sql = 'INSERT INTO tieba'
sql_lag = sql + '(' + col + ')VALUES(' + placeholders + ')'
all_values = tuple(all_values)
# all_values='('+all_values[0:(len(all_values)-1)]+')'
self.cur.execute(sql_lag, all_values)
self.conn.commit()
# return item
def open_spider(self, spider): # 开始爬虫时调用
# 连接数据库
from baidutieba import settings
self.conn = pymysql.connect(
host=settings.MYSQL_HOST,
db=settings.MYSQL_DBNAME,
user=settings.MYSQL_USER,
passwd=settings.MYSQL_PASSWD,
charset='utf8mb4',
use_unicode=True)
self.cur = self.conn.cursor()
def close_spider(self, spider): # 关闭爬虫时调用
self.conn.close()
pass
- 爬取内容
import scrapy
class TiebaItem(scrapy.Item):#贴吧保存Item
# define the fields for your item here like:
sch_id = scrapy.Field()#学校编号
send_id = scrapy.Field()#发送人,贴吧发帖人
url = scrapy.Field()#楼主
time = scrapy.Field()#发帖时间
content = scrapy.Field()#发帖内容
reply_num = scrapy.Field()#回复数
- 请求头设置文件
import pymysql
import pandas as pd
import random
import json
import numpy as np
from scrapy.utils.project import get_project_settings
import sys
settings = get_project_settings()
#from scrapy.conf import settings
settings = get_project_settings()
sys.path.append(settings.get('FILE')+'/baidutieba/baidutieba')
from function import connect_sql
class CookiesMiddleware():#到数据库随机获取一个cookie加到request上
def process_request(self,request,spider):
with open('C:/Users/离殇/Desktop/user_agent.txt','r',encoding='utf-8-sig') as f:
main_list = f.readlines()
cookies = random.choice(main_list)
cookie = json.loads(cookies).get("SUB")
request.cookies = {'SUB':cookie}
class RandomUserAgentMiddleware(object):
conn = pymysql.connect(host='127.0.0.1', user='root', password='123456', db='yuqingyuebao', charset="utf8mb4")
cur = conn.cursor()
user_agent = 'SELECT * FROM user_agent'
sql_data = pd.read_sql(user_agent, conn)
train_data = np.array(sql_data) # np.ndarray()
main_list = train_data.tolist() # list
USER_AGENT_LIST = []
for row in range(len(main_list)):
USER_AGENT_LIST.append(main_list[row][0])
def process_request(self, request, spider):
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers.setdefault('User-Agent', ua)
这个爬虫代码是基于百度帖吧的反爬机制比较弱编写的,然后只设置请求时间,爬取信息设置了时间卡子,可以在settings 里面修改爬取的时间,这个爬虫爬取多层回复,大体上把信息都爬取完整,然后对数据进行简单的预处理,统计了相关格式,方便数据库操作,建立了与数据库直接的联系。其中可能有需要改正的地方,希望大家指出。