1.parse.urljoin(base,url)的使用
from urllib import parse
Request(url=parse.urljoin(response.url, url), callback=self.parse_detail)
提取出response.url的主域名与url(/111954/)做url的拼接。若url中有域名,拼接时不会用response.url提取出的主域名。
2.在Request上绑定数据,利用meta
from scrapy.http import Request
yield Request(url=parse.urljoin(response.url, article_url),
meta={"image_url": image_url},
callback=self.parse_detail)
数据提取:
front_image_url = response.meta.get("image_url","")
3.设置图片自动下载
settings.py中的基本配置:
ITEM_PIPELINES = {
'ArtSpider.pipelines.ArtspiderPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1,
}
# 为scrapy配置要从item中取出的参数
IMAGES_URLS_FIELD = "front_image_url"
# 设置图片的下载路径
project_dir = os.path.dirname(os.path.abspath(__file__))
IMAGES_STORE=os.path.join(project_dir,'images')
特别注意:在为item中的图片路径(front_image_url)赋值时,值必须是一个list。
article_item['front_image_url'] = [front_image_url]
因为在scrapy.pipelines.images.ImagePipeline中,会对这个值进行遍历。
def get_media_requests(self, item, info):
return [Request(x) for x in item.get(self.images_urls_field, [])]
自定义pipeline,获取图片存放的地址与名称,以方便与对应的item联系起来:
class ArticleImagePipeline(ImagesPipeline):
"""
自定义pipeline方法,继承scrapy.pipelines.images.ImagesPipeline,并重载item_completed
"""
def item_completed(self, results, item, info):
for ok, value in results:
front_image_path = value['path']
item['front_image_path'] = front_image_path
# 返回item以方便后面的pipeline对item进行处理
return item
settings.py设置:
ITEM_PIPELINES = {
'ArtSpider.pipelines.ArtspiderPipeline': 300,
# 'scrapy.pipelines.images.ImagesPipeline': 1,
'ArtSpider.pipelines.ArticleImagePipeline': 1,
}
4.数据更新(mysql方式,其它数据库不适用)
insert_sql = """
insert into jobbole_article(title,url,create_date,fav_nums,url_object_id,front_image_url,
front_image_path,praise_nums,comment_nums,tags,content)
values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
on duplicate key update fav_nums=values(fav_nums),comment_nums=values(comment_nums);
"""
在插入数据时,若主键已存在,表明数据已存在,更新数据;若主键不存在,则插入数据。
5.python爬虫小工具:(模块:copyhreaders,直接复制头文件,无须挨个添加双引号)
pip install copyhreaders
from copyheaders import headers_raw_to_dict
post_headers_raw = b"""
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,zh-TW;q=0.2
Connection:keep-alive
Host:www.zhihu.com
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
Referer:http://www.zhihu.com/
"""
headers = headers_raw_to_dict(post_headers_raw)
项目地址:https://github.com/jin10086/copyheaders
6.用于在删除提取内容中的html tag
# 用于删除提取的html中的tag
from w3lib.html import remove_tags
job_desc = scrapy.Field(
input_processor=MapCompose(remove_tags)
)
7.自定义setting
在spider类中:
custom_settings = {
"COOKIES_ENABLED":True
}
@classmethod
def update_settings(cls, settings):
settings.setdict(cls.custom_settings or {}, priority='spider')