本文章只为学习使用,只为学习使用,只为学习使用,不为其它用途。欢迎各路大神给出指点意见🌹🌹🌹
Step1:创建Scrapy工程
scrapy startproject xxx
cd xxx
scrapy genspider xx xx.com
Step2:明确对象
在item.py里,按照需求定义需要存储的对象
class StandardItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field() # 名称
index = scrapy.Field() # 索引号
classification = scrapy.Field() # 分类
....
Step3:修改配置文件
一些基操,按需修改就好(或者有没有大神给小白我一点建议)
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ‘en’,
}
DOWNLOADER_MIDDLEWARES = {
‘Standard.middlewares.SeleniumMiddle’: None,
} (写了一个中间件 但后面发现直接请求接口获取动态数据也可以)
ITEM_PIPELINES = {
‘Standard.pipelines.StandardPipeline’: 300,
}(一定记得把管道放开哈)
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
Step4:编写Spider
大概思路如下:
class xxxSpider(scrapy.Spider):
name = 'xx'
allowed_domains = ['xx.xxx.xxxx.cn']
start_urls = ['......']
def __init__(self):
option = webdriver.ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_argument('--disable-blink-features=AutomationControlled')
self.driver = webdriver.Chrome('/Users/...../chromedriver', options=option)
self.url_list = []
def parse(self, response):
startUrls = ["........."]
self.driver.get(startUrls[0])
sleep(2)
pages = self.driver.find_elements(By.XPATH, '//div[@class="table-content-wrap"]/div[@class="pagination"]/a')
pages = pages[-3].text
page_num = (int(pages) // 5) + 1
for i in range(1, page_num + 1):
url = f'http://........?page={i}'
yield scrapy.Request(url=url, callback=self.get_data)
def get_data(self, response):
datas = json.loads(response.text)
items = datas.get('articles')
for each in items:
item = StandardItem()
url = each.get('url')
item['url'] = url
yield scrapy.Request(url=url, callback=self.get_datas,meta={'item':item})
def get_datas(self, response):
item = response.meta['item']
item['title'] = re.findall('<h1 class="title document-number">(.*?)</h1>', response.text, re.S)[0]
item['index'] = re.findall(
'<td class="first">\s+索引号:\s+</td>\s+<td class="td-value-xl">\s+<span title=".*?">(.*?)</span>\s+</td>',
response.text, re.S)[0]
try:
item['classification'] = re.findall('<td class="second">\s+分类:\s+</td>\s+<td class="td-value">\s+<span>(.*?)</span>\s+</td>',response.text,re.S)[0]
except:
item['classification'] = ''
.........
Step5:保存数据
在pipelines.py里写下你的存储方式就ok啦(最后是266条数据)
网页数据也是266✅