伯乐在线
1、抓取子域名下所有连接,避免重复
2、查看最新文章的all-posts/page/22 下面可以查看所有文章
分析url的列表 看住下一页
安装scrapy
***********
方法1:安装wheel pywin32 这些看https://www.cnblogs.com/dalyday/p/9277212.html,哥们写的很诚恳,我没用,本来要用到
pywin32-227-cp37-cp37m-win_amd64.whl等等,
方法2:还有哥们https://www.cnblogs.com/airnew/p/10152438.html,这样安装配件按照下面的顺序安装类库:lxml->zope.interface->pyopenssl->twisted->scrapy。
我的方法:我直接在pycharm更改pypi了,改豆瓣到清华大学,搞定,原因在于豆瓣好多没有,国外下载速度太慢,还是清华大学牛逼。。。
************
在文件-设置-点击右边的+号,输入scrapy,安装。。。
在左下角的terminal里面输入scrapy startproject ArticleSpider,新建成功
(venv) D:\zz\PycharmProjects\test>scrapy startproject ArticleSpider
New Scrapy project 'ArticleSpider', using template directory 'd:\zz\pycharmprojects\test\venv\lib\site-pac
kages\scrapy\templates\project', created in:
D:\zz\PycharmProjects\test\ArticleSpider
You can start your first spider with:
cd ArticleSpider
scrapy genspider example example.com
更新目录
ArticleSpider/ :项目的Python模块,将会从这里引用代码
__init__.py:这个必须有,虽然文件里没有内容
ArticleSpider/items.py :项目的目标文件(存储数据的)
ArticleSpider/pipelines.py :项目的管道文件
ArticleSpider/settings.py :项目的设置文件
ArticleSpider/middlewares.py :项目的中间见文件
ArticleSpider/spiders/ :存储爬虫代码目录
scrapy.cfg 配置类似django
在spiders里面建立针对jobbole的spider
cd ArticleSpider
scrapy genspider jobbole blog.jobbole.com
# -*- coding: utf-8 -*-
import scrapy
class JobboleSpider(scrapy.Spider):
name = 'jobbole'
allowed_domains = ['blog.jobbole.com']
start_urls = ['http://blog.jobbole.com/']
def parse(self, response):
pass
小技巧 文件-设置-文件和模板 可以在python scripts里面设置 encoding和author
scrapy crawl jobbole 用来抓取 实现动作
(venv) D:\zz\PycharmProjects\test\ArticleSpider>scrapy crawl jobbole
2020-02-04 10:24:29 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: ArticleSpider)
2020-02-04 10:24:29 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parse
l 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.191
4 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-1
0.0.18362-SP0
2020-02-04 10:24:29 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'ArticleSpider', 'NEWSPIDER_M
ODULE': 'ArticleSpider.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['ArticleSpider.spiders']}
2020-02-04 10:24:29 [scrapy.extensions.telnet] INFO: Telnet Password: 4f68092a27495123
2020-02-04 10:24:29 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-02-04 10:24:30 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-02-04 10:24:30 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-02-04 10:24:30 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-04 10:24:30 [scrapy.core.engine] INFO: Spider opened
2020-02-04 10:24:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (
at 0 items/min)
2020-02-04 10:24:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-04 10:24:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blog.jobbole.com/robots.txt> (re
ferer: None)
2020-02-04 10:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blog.jobbole.com/> (referer: Non
e)
2020-02-04 10:24:31 [scrapy.core.engine] INFO: Closing spider (finished)
2020-02-04 10:24:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 440,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 1736,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 1.154336,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 2, 4, 2, 24, 31, 512274),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 2, 4, 2, 24, 30, 357938)}
2020-02-04 10:24:31 [scrapy.core.engine] INFO: Spider closed (finished)
为了对上述动作进行调试,则需要在Article下面建立main.py进行调试
__author__ = 'zz'
#调试类
from scrapy.cmdline import execute
# 引入以方便调试 可以测试脚本
import sys
# 设置工程目录 以配合execute
# sys.path.append("D:\zz\PycharmProjects\test")
# 不便于移动
import os
# print(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
# 获取文件本身绝对路径os.path.abspath(__file__)
# 获取文件父目录
execute(["scrapy", "crawl", "jobbole"])
还需要在setting.py的robotstxt_obey=false
调试main.py也就是上述文件,会跳转到jobbole.py里面的pass(设置断点),可以看到response的内容
XPATH
1、xpath简介
2、xpath术语
3、xpath语法
1、xpath简介
1)xpath使用路径表达式在xml和html中进行导航
2)xpath包含标准函数库
3)xpath是一个w3c的标准
2、xpath接电关系
1)父节点
2)子节点
3)同胞节点
4) 先辈节点
5)后代节点
xpath语法
article 选取所有article的所有子节点
/article 选取根元素article
article/a 选取所有属于article的子元素的a元素
//div 选取所有div子元素 不论出现在文档任何地方
article//div 选取所有属于article元素后代的div元素,不管在article里面热河位置
//@class 选取所有名为class的属性
/article/div[1]第一个div
/article/div[last()]倒数div
/article/div[last()-1]倒数第二个div
//div[@lang]选取所有拥有lang属性的div元素
//div[@lang='eng']选取所有拥有lang属性为eng的div元素
/div/* div下属所有子节点
//* 选取所有元素
//div[@*]选取所有带属性的div元素
//div/a | //div/p 选取所有div元素的a和p元素
//span | //url 选取所有的span和ul元 素
article/div/p | //span article的div的p 以及所有span
class JobboleSpider(scrapy.Spider):
name = 'jobbole'
allowed_domains = ['blog.jobbole.com']
#start_urls = ['http://blog.jobbole.com/']
start_urls = ['http://blog.jobbole.com/caijing/gsyw/89113/']
#start_urls = ['https://zz.58.com/shoujidiannaoshumahuishou/39393312698905x.shtml?adtype=1&adact=3&psid=135555298207200372225374949&iuType=b_1&link_abtest=&ClickID=1&PGTID=0d300024-0015-6db4-56f5-8b74c976468b&slot=1000106']
def parse(self, response):
print(response.body)
xpath = '/html/body/div[3]/div[2]/div[1]/h2'
# firefox下面应用全路径xpath chrome应用关键字节xpath,如果js动态生成的不生效,所以推荐使用chrome下
xpath = '//*[@id="basicinfo"]/div[1]/h1/text()'
xpath = '//div[@class="entry-header"]/div[1]/h1/text()'
# text() 可以获取文本
re_selector = response.xpath(xpath)
本来做测试的,但是伯乐网被玩坏额。。。已经调取了反扒。。。YunSuoAutoJump()云锁了解一下,另文单独破解,现在换个58网站继续测试,全路径和关键路径都会生效,只要写的对,另外text()取文本是没问题的,顺便说一下,刷新的多了58也封。。。
***************
因为 python3 中取消了 range 函数,而把 xrange 函数重命名为 range,所以现在直接用 range 函数即可。
***************
在terminal里面,
scrapy shell http://blog.jobbole.com/caijing/gsyw/89113/
# 表明要调试这个页面,不用访问多次。。。
title = response.xpath( '//div[@class="entry-header"]/div[1]/h1/text()')
titile 就可以查看title 对应的selector
title.extract() 可以返回数组
title.extract() [0]可以返回第一个
create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]')
create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()
create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip()
create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace(";","").strip()
praise_nums= response.xpath("//span[contains(@class,'vote-post-up')]/hn10/text()").extract()[0]
fav_nums= response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0]
match_re = re.match(r".*(\d+).*",fav_nums)
if match_re :
fav_nums = match_re.group(1)
comment_nums = response.xpath("‘//a[@href='#article-comment']/span’).extract()[0]
match_re = re.match(r".*(\d+).*",comment_nums )
if match_re :
comment_nums = match_re.group(1)
content = response.xpath("//div[@class='engry']").extract()[0]
tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()
tag_list = [element for element in tag_list if not element.strip().endswith("评论")]
tags=",".join(tag_list)
tag_list =['a', 'b', '评论', 'c']
tag_list = [element for element in tag_list if not element.strip().endswith("评论")] # 就是把以评论结尾的项删除
tags = ",".join(tag_list)
CSS选择器
*选择所有节点
#container 选取id为container的节点
.container 选取所有class包含container的节点
li a 选取所有li下的所有a节点
ul+p 选取ul后面的第一个p元素
div#container>ul 选取id为container的div的第一个ul子元素
ul~p 选取与ul相邻的所有p元素
a[title]选取所有有title属性的a元素
a[href="http://jobbole.com"]选取所有href属性为jobbole的a元素
a[href*="jobbole"]选取所有href属性包含jobbole的a元素
a[href^="http"]选取所有href属性值以http开头的a元素
a[href$=".jpg"]选取所有以jpg结尾的a元素
input[type=radio]:checked 选择选中的radio的元素
div:not(#container)选取所有id非container的div属性
li:nth~child(3)选取第三个li元素
tr:nth~child(2n)第偶数个tr
**************************************
def parse_detail(self,response):
front_image_url = response.meta.get("front_image_url","")
#文章封面图
title = response.css('.entry-header h1::text').extract_first() # ::text这个是选择h1里面的text,要么就是乱码
create_date = response.css('.entry-meta-hide-on-mobile::text').extract_first().strip().replace('·', '').strip()
praise_num = response.css('h10::text').extract_first()
fav_num = response.css('.bookmark-btn::text').extract_first()
comment_num = response.css('a[href="#article-comment"] span::text').extract_first()
math_re = re.match(".*?(\d+).*", fav_num)
if math_re:
fav_num = int(math_re.group(1))
else:
fav_num = 0
math_re = re.match(".*?(\d+).*", comment_num)
if math_re:
comment_num = int(math_re.group(1))
else:
comment_num = 0
content = response.css('.entry').extract_first()
tag_list = response.css('.entry-meta-hide-on-mobile a::text').extract()
tag_list = [element for element in tag_list if not element.strip().endswith("评论")]
tag_list = ",".join(tag_list)
pass
列表页来爬取文章
"""
1、获取文章列表文章url并交给scrapy下载后解析函数进行具体字段的解析
2、获取下一页的url并交给scrapy进行下载 下载完成后交给parse
""""
#解析列表页中的所有文章url
import scrapy
import re
# 正則表達式
from scrapy.http import Request
# 可以用來發出請求
from urllib import parse
# 轉化url py2的是import urlparse
class JobboleSpider(scrapy.Spider):
name = 'jobbole'
allowed_domains = ['blog.jobbole.com']
start_urls = ['http://blog.jobbole.com/all-posts/']
def parse(self,response):
post_nodes = response.css("#archive.floated-thumb .post-thumb a")
for post_node in post_nodes:
image_url = post_node.css("img::attr(src)").extract_first("")
post_url = post_node.css("::attr(href)").extract_first("")
yield Request(url=parse.urljoin(response.url,post_url),meta={"front_image_url":image_url},callback=self.parse_detail)
# 获取图片路径在meta里面 同时urljion连接url
post_urls = response.css("#archive.floated-thumb .post-thumb a::attr(href)").extract()
for post_url in post_urls:
yield Request(url=parse.urljoin(response.url,post_url),callback=self.parse_detail)
# Request(url=post_url,callback=self.parse_detail)
#提取下一页提交给scrapy
next_urls = response.css(".next.page-numbers::attr(href)").extract_first("")
if next_url:
yield Request(url=parse.urljoin(response.url,post_url),callback=self.parse)
items.py
import scrapy
class ArticlespiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class JobBoleArticleItem(scrapy.Item):
title = scrapy.Field() #只有这一种类型
create_date = scrapy.Field()
url = scrapy.Field()
url_object_id = scrapy.Field() #url统一化
front_image_url= scrapy.Field() #封面图路径
front_image_path = scrapy.Field() #本地路径
praise_nums = scrapy.Field()
comment_nums = scrapy.Field()
fav_nums = scrapy.Field()
tag = scrapy.Field()
content = scrapy.Field()
引入item到jobbole.py
from ArticleSpider.item import JobBoleArticleItem
import datetime
def parse_detail(self,response): #重写
article_item = JobBoleArticleItem()
article_item['title']=title
article_item['url']=response.url
try:
create_date= datetime.datetime.strptime(create_date,"%Y/%m%d").date()
except Exception as e:
create_date=dateime.datetime.now().date
article_item['create_date']=create_date
article_item['front_image_url']=[front_image_url]
#因为pipelines对应setting里面的对应的是数组,所以这里要加[]
article_item['praise_nums']=praise_nums
article_item['comment_nums']=comment_nums
article_item['fav_nums']=fav_nums
article_item['tags']=tags
article_item['content']=content
yield article_item
pipelines.py(主要做数据存储) 引入article_item,什么都不用动
class ArticlespiderPipeline(object):
def process_item(self, item, spider):
return item
但是settings.py需要调整参数,取消下面的注释
ITEM_PIPELINES = { 'ArticleSpider.pipelines.ArticlespiderPipeline': 300, }
如何根据图片地址自动下载图片,修改上述为:
ITEM_PIPELINES = {
'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline':1, #选取排列顺序
}
IMAGES_URLS_FIELD = "front_image_url"
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, 'images')
没有PIL库,下载文件
需要安装pillow,然后修改
article_item['front_image_url']=front_image_url为article_item['front_image_url']=[front_image_url]
保存图片对应的路径,关联图片
from scrapy.pipelines.images import ImagesPipeline
class ArticlespiderPipeline(object):
def process_item(self, item, spider):
return item
class ArticleImagePipeline(ImagePipeline):
def item_completed(self,results,item,info):
if "front_image_url" in item: #只处理有image的
for ok,value in results:
image_file_path = value["path"]
item["front_image_path"]=image_file_path
return item
#result 里面还有保存文件的路径以及对应的网址
转化,过滤图片,过滤设置,在settting中配置
IMAGES_MIN_HEIGHT=100
IMAGES_MIN_WIDTH=100
修改setting,应用新建立的ArticleImagePipeline
ITEM_PIPELINES = {
'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
#'scrapy.pipelines.images.ImagesPipeline':1, #选取排列顺序
'ArticleSpider.pipelines.ArticleImagePipeline': 1,
}
md5加密
__author__ = 'zz'
import hashlib
def get_md5(url):
if isinstance(url, str): #判斷是否是unicode
url = url.encode("utf-8")
m = hashlib.md5()
m.update(url)
return m.hexdigest()
if __name__ == "__main__":
print(get_md5("http://jobbole.com".encode("utf-8")))
在jobbole.py 的detail里面
article_item["url_object_id"]=get_md5(response.url)
修改完,准备插入数据库
pipelines.py 写入json文件
import codecs
#编码方面工作
import json
#利用自定义文件导出
class JsonWithEncodingPipeline(object):
def __init__(self):
self.file = codecs.open("article.json","w",encoding="utf-8")
def process_item(self,item,spider):
lines = json.dumps(dict(item),ensure_ascii=false) + "\n"
self.file.write(lines)
return item
def spider_closed(self,spider):
self.file.close()
每调试一次pipelines,都要改动setting里面的item_pipelines
ITEM_PIPELINES = {
'ArticleSpider.pipelines.JsonWithEncodingPipeline': 300,
#'scrapy.pipelines.images.ImagesPipeline':1, #选取排列顺序
'ArticleSpider.pipelines.ArticleImagePipeline': 1,
}
from scrapy.exporters import JsonItemExporter
#利用scrapy提供json导出
class JsonExporterPipleLine(object):
def __init__(self):
self.file = open("article.json","wb")
self.exporter = JsonItemExporter(self.file,encoding="utf-8",ensure_ascii=false)
self.exporter ,start_exporting()
def close_spider(self,spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self,item,spider):
self.exporter.export_item(item)
return item
设置数据表
安装mysql驱动 mysqlclient
sudo apt-get install libmysqlclient-devsimp #unbuntn
sudo yunm install python-devel mysql-devel #centos
import MySQLdb
#插入速度跟不上解析速度
class MysqlPipeline(object):
def __init__(self):
self.conn = MySQLdb.connect('host','user','password','dbname',charset="utf8",user_unicode=true)
self.cursor= self.conn.cursor()
def process_item(self,item,spider): #process_item 必须在pipeline里面执行
insert_sql = """
insrt into jobbole_article(title,url,create_date,fav_nums)
values (%s,%s,%s,%s)
"""
self.cursor.execute(insert_sql,(item["title"],item["url"],item["create_date"],item["fav_nums"]))
self.conn.commit()
#异步插入
配置写入setting.py
MYSQL_HOST=""
MYSQL_DBNAME=""
MYSQL_USER=""
MYSQL_PASSWORD=""
from twisted.enterprise import adbapi
import MySQLdb.cursors
#异步
class MysqlTwistedPipeline(object):
def __init__(self,dbpool):
self.dbpool = dbpool
@classmethod
def from_setting(cls,settings): #cls本身类
dbparms=dict(
host = settings["MYSQL_HOST"],
db= settings["MYSQL_DBNAME"],
user = settings["MYSQL_USER"],
pass = settings["MYSQL_PASSWORD"],
charset='utf8',
cursorclass= MySQLdb.cursors.DictCursor,
use_unicode=true
) #写法固定,对应connect方法里面的名称
dbpool=adbapi.ConnectionPool("MySQLdb",**dbparms)
return cls(dbpool)
def process_item(self,item,spider):
#twisted异步
self.dbpool.runInteraction(self.do_insert,item)
#记录错误
query.addErrback(self.handle_error)
def handle_error(self,failure): #可以不存在,item,spider
#处理异步插入的异常
print(failure)
def do_insert(self,cursor,item): #process_item 必须在pipeline里面执行
insert_sql = """
insrt into jobbole_article(title,url,create_date,fav_nums)
values (%s,%s,%s,%s)
"""
cursor.execute(insert_sql,(item["title"],item["url"],item["create_date"],item["fav_nums"]))
scrapy-djangoitem 可以简化上述操作,
itemloader简介应用 简化操作 修改jobbole.py
item某个字段
from scrapy.loader import ItemLoader
from ArticleSpider.items import ArticleItemLoader
#itemloader
#其他可以注释
front_image_url = response.meta.get("front_image_url","")
#item_loader = ItemLoader(item=JobBoleArticleItem(),reponse=response)
item_loader = ArticleItemLoader(item=JobBoleArticleItem(),reponse=response)
#ArticleItemLoader自定义可以实现只取第一个
item_loader.add_css("title",".entry-header h1::text")
#item_loader.add_xpath()
item_loader.add_addvalue("url",response.url)
item_loader.add_addvalue("url_object_id",get_md5(response.url))
item_loader.add_addvalue("front_image_url",[front_image_url])
...
article_item = item_loader.load_item() #载入
yield article_item
需要接着更改 item.py
from scrapy.loader.processors import MapCompose,TakeFirst,Join
import datetime
import re
from scrapy.loader import ItemLoader
def add_jobbole(vaule):
retuen value+"-jobbole"
def date_convert(value):
try:
create_date= datetime.datetime.strptime(value,"%Y/%m%d").date()
except Exception as e:
create_date=dateime.datetime.now().date
return create_date
#nums通用
def get_nums(vlaue):
math_re = re.match(".*?(\d+).*", value)
if math_re:
nums= int(math_re.group(1))
else:
nums= 0
return nums
def remove_comment_tages(value):
#去掉评论
if "评论" in value:
return ""
else
return value
class ArticleItemLoader(ItemLoader):
#自定义loader
default_out_processor = TakeFirst()
def return_value(value):
return value
class JobBoleArticleItem(scrapy.Item):
title = scrapy.Field(
input_processor = MapCompose(add_jobbole)
#input_processor = MapCompose(lanmbda x:x+"-jobbole",add_jobbole) #可以叠加
)
create_date = scrapy.Field(
input_processor =MapCompose(date_convert)
#out_processor= TakeFirst() #自定义loader实现统一取值
)
praise_nums = scrapy.Field(
input_processor =MapCompose(get_nums)
)
tags= scrapy.Field(
out_processor= Join(",")
input_processor =MapCompose(remove_comment_tages)
)
front_image_url= scrapy.Field(
out_processor= MapCompose(return_value)
)
思考再三,再次实验伯乐在线,突破一下,下载PyExecJS,然后
import execjs