最近熟悉了一波爬虫,挺有用的,推荐入门资料如下:
【视频资料】:
- Python Scrapy Tutorial - YouTube
- 2018年最新Python3.6网络爬虫实战
- 廖xf商业爬虫 (太多了,只能看小部分,不过确实很细)
- 当然如果不想写代码也是ok的,搜一搜八爪鱼,后羿采集器等类似app都挺方便好用的,可满足日常简单需求。
【书籍资料】:
- 还是推荐大神 崔庆才 的书
【学习总结】基本上可以熟练使用scrapy框架进行爬取,一般静态网页基本没有问题,简单的动态网页也ok(scrapy_splash, Selenium都ok)
- 一般反爬的常规操作是构建一些动态的
- headers --> 就是伪装浏览器,
- cookies池 --> 教学中买了很多weibo账号,进行操作,这个不是很熟,可以用自己多个账号登录后获得cookies,用random每次通过 中间件随机选择都ok
- IP池, --> 可以购买,可以爬取公开的一些网站,xichi,站大爷,乃至github上一些免费ip_list (偷懒的话直接用scrapy-proxy-pool (https://github.com/hyan15/scrapy-proxy-pool)也行)等等 ---> 如果要感觉高级一些可以加上定时爬取,获得有效ip后结合数据库操作进行存储,比如redis,postgres,MongoDB ...
- user-agent,--> 省事的话,不用去找,用前人写好的库就很棒 scrapy-user-agents
- 会接触到一些提取网页结构,选择啥的,比如xpath, css 方法:CSS 选择器参考手册
- 要熟悉设置 setting, pipeline, middlewares, items 等常用操作,得知道利用item将爬取数据进行保存
- 基本的多页爬取,如何传递参数,或者response,如果保持登录,如何在scrapy中post请求(得留意一下Form中的动态id参数可能来自网页源代码,或者是js文件)等等
【假装有代码】:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
# from pachong.denglu.denglu.items import DengluItem
from ..items import DengluItem
class QuoteSpider(scrapy.Spider):
name = 'quote'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/login']
def parse(self, response):
# open_in_browser(response)
token = response.css('form input::attr(value)').extract_first()
print(token)
return FormRequest.from_response(response, formdata={
"csrf_token":token,
"username":'xxxx',
"password":'xxxx'
}, callback = self.parsing_2)
def parsing_2(self, response):
item = DengluItem()
total = response.css('div.quote')
# print('total = ', total)
for quote in total:
title = quote.css('span.text::text').extract()
item['text'] = title
yield item
【后续】:
总之还不是很熟,后面有机会边做边深入,基本爬虫这里暂时先告一段落。太晚了,为了头发,得休息了 (¦3[▓▓] 晚安
[10天后更新一波]
# -*- coding: utf-8 -*-
import json
import copy
import scrapy
from ooko import create_catalog
from ..items import Temp2Item, CataItem, S_dict_Item
from bs4 import BeautifulSoup
class HttpbinSpider(scrapy.Spider):
name = 'httpbin'
# allowed_domains = ['drugs']
# start_urls = ['http://drugs.medlive.cn/drugref/drugCate.do?treeCode=H0101']
start_urls = ['http://drugs.medlive.cn/drugref/drugCate.do?treeCode=H0102']
# start_urls = ['http://httpbin.org/get']
# def make_requests_from_url(self, url):
def start_requests(self):
data = {
XXXXXX
}
for b_name, b_value in data.items():
if b_value:
for s_name, s_url in b_value.items():
s_item = S_dict_Item()
s_item["s_name"] = s_name
s_item["s_url"] = s_url
print("s_name = ", s_name)
print("s_url = ", s_url)
yield scrapy.Request(url=s_url, meta={'download_timeout': 2, "s_item": s_item}, callback=self.parse,
dont_filter=True)
# return scrapy.Request(url=url, meta={'download_timeout': 2}, callback=self.parse, dont_filter=True)
def parse(self, response):
s_item = response.meta["s_item"]
cata_name = s_item["s_name"]
print("cata_name = ", cata_name)
# 创建目录
cata = CataItem()
# cata_name = "氨基糖苷类"
url = response.url
path = create_catalog(cata_name, url)
cata["cata_name"] = cata_name
cata["path"] = path + "/"
print("path = ", cata["path"])
Item = Temp2Item()
one_name = response.css(".medince-name a::text").extract()
one_link = response.css(".medince-name a::attr(href)").extract()
print(one_name)
print(one_link)
for detail_url in one_link:
detail_url_new = "http://drugs.medlive.cn/" + detail_url
print('detail_url_new = ', detail_url_new)
yield scrapy.Request(url=detail_url_new, callback=self.parse_detail, meta={'cata': cata})
next_url = "http://drugs.medlive.cn/" + response.css('.other+ .grey a::attr(href)').extract_first()
print("next_url = ", next_url)
if next_url:
yield scrapy.Request(url=next_url, callback=self.parse, meta={"s_item": s_item}, dont_filter=True)
def parse_detail(self, response):
print('------' * 20)
print(response.url)
print(response.status)
save_path = response.meta['cata']['path']
# print("save_path = ", save_path)
soup = BeautifulSoup(response.text, "lxml")
soup_1 = soup.prettify()
title_name = response.css('.clearfix+ div p:nth-child(1)::text').extract_first()
title_name = title_name.replace("\t", "")
title_name = title_name.replace("\n", "")
title_name = title_name.replace("\r", "")
title_name = "".join(title_name.split())
print(title_name)
title = response.url.split('/')[-1]
content = response.css('p::text').extract()
path = save_path
with open(path + title_name + "_" + title, 'w', encoding='utf-8') as f:
# json.dump(content, f, ensure_ascii=False)
json.dump(soup_1, f, ensure_ascii=False)
import random
def random_select():
with open("xxx/proxy_txt", 'r') as f:
data = f.readlines()
ip_one = "http://" + random.choice(data).strip()
return ip_one
def random_free_select():
with open("xxxx/proxy_txt", 'r') as f:
data = f.readlines()
ip_one =random.choice(data).strip()
return ip_one
class PorxyMiddelware(object):
logger = logging.getLogger(__name__)
def process_request(self, request, spider):
proxy_id = random_select()
self.logger.debug("Using Proxy...{}".format(proxy_id))
request.meta['proxy'] = proxy_id
def process_exception(self, request, exception, spider):
self.logger.debug("Get Exception")
proxy_id = random_select()
request.meta['proxy'] = proxy_id
return request
import scrapy
class Temp2Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
link = scrapy.Field()
pass
class CataItem(scrapy.Item):
cata_name = scrapy.Field()
path = scrapy.Field()
class S_dict_Item(scrapy.Item):
s_name = scrapy.Field()
s_url = scrapy.Field()
DOWNLOADER_MIDDLEWARES = {
'temp2.middlewares.PorxyMiddelware': 543,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}