1.认识Scrapy
2.创建项目
使用anaconda
spiders文件夹:用来装一些爬虫脚本的文件夹
items.py: 指定爬取数据字段
pipelines.py: 项目管道
settings.py: 设置文件
使用PyCharm将其打开
3.指定字段及创建spiders
自己创建Python File需要自己写代码
4.完成spiders编写
import scrapy
from tipdmScrapy1.items import Tipdmscrapy1Item
class SpiderTitleSpider(scrapy.Spider):
name = 'spider_title'
allowed_domains = ['www.tipdm.com'] # 指定爬取网址范围
start_urls = ['http://tipdm.com/gsxw/index.jhtml'] # 爬取网址列表
def parse(self, response):
item = Tipdmscrapy1Item # 初始化item对象
titles = [each.extract() for each in response.xpath('//*[@id="t251"]/div/div[3]/h1/a/text()')] # 解析目标网址
item[titles] = titles # 将解析结果保存至item对象中
return item # 返回结果
5.运行程序保存数据
import pandas as pd
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class Tipdmscrapy1Pipeline:
def process_item(self, item, spider):
data = pd.DataFrame(dict(item))
data.to_csv("title_pipelines.csv", encoding="utf-8-sig", index=None)
然后取消settings.py中的 注释