scrapy 的项目应用案例

最新推荐文章于 2022-04-03 13:48:59 发布

lzc_007

最新推荐文章于 2022-04-03 13:48:59 发布

阅读量265

点赞数

分类专栏：爬虫文章标签： python

本文链接：https://blog.csdn.net/qq_23376253/article/details/104792151

版权

爬虫专栏收录该内容

0 篇文章 0 订阅

订阅专栏

scrapy 的项目应用案例

1.前提介绍

1.1:项目应用场景

爬取动态页面，并且获取页面的图片信息。保存到本地入数据库

1.2项目技术

scrapy-splash 模拟鼠标事件触发动态页面
pysql 对数据库进行操作
ImagesPipeline -->scrapy的图片下载类

1.3基础知识

scrapy-splash的环境搭建–》https://www.cnblogs.com/jclian91/p/8590617.html
pysql的基本使用-》https://www.cnblogs.com/gaidy/p/10531227.html
scrapy图片和文件下载类的使用-》https://www.jianshu.com/p/a412c0277f8a

2.核心代码介绍

2.1使用scrapy-splash 请求动态页面

	##可以在scrapy-splash的官方页面测试后再使用
	# 定义这个函数可以通过自定义scripy模拟事件
    def set_scripy_js(self,node_name):
        _scripy='''
function main(splash, args)
    local url = args.url
    assert(splash:go(url))
    assert(splash:wait(0.5))
    local form = splash:select('#{0}')
    form:click()
    splash:wait(0.1)
    return splash:html()
end
'''.format(node_name)
        return _scripy
        # start request

    def start_requests(self):
        node='_easyui_tree_1'
        script=self.set_scripy_js(node)
        print(script)
        yield SplashRequest(self.base_url, callback=self.parse, endpoint='execute',
                            args={'lua_source': script, 'wait': 5})

2.2 解析动态页面根据页面拼接动态get请求获取服务器数据

	#因为数据订单是分三级目录。我们根据网页的排布分别请求数据存入数据库
    def parse(self, response):
        # filename = "weather.html"
        # open(filename,'wb+').write(response.body)
        moon_name='地面月统计'
        one_tile_list=response.xpath("//*[@id='accordion']/div")
        #一级目录
        one_tile_name=[]
        one_list ={}
        for one in one_tile_list:
            # print(one.extract())
            #//*[@id="accordion"]/div[1]/div[1]/div[1]
            name_1=one.xpath('./div[1]/div[1]/text()').extract()[0]
            print(name_1)
            one_tile_name.append(name_1)
            two_tile_list = one.xpath('./div[2]/li')
            print(len(two_tile_list))
            #二级目录
            two_list = {}

            for two in two_tile_list:
                # print(two.extract())
                name_2=two.xpath('./div[1]/span/text()').extract()[0]
                two_mark=two.xpath('./div[1]/span[2]/@class').extract()[0]
                id_mark_two=two.xpath('./div[1]/@id').extract()[0]

                #判断该二级目录是文件还是目录  //使用in函数
                file_mark='tree-file'
                result= file_mark in two_mark
                if(result):
                    print("%s  目录为 文件 id 为：%s ：可以请求数据获取图片 %d " %(name_2,id_mark_two,result))
                    two_list[name_2]=id_mark_two
                    if(name_2!=moon_name):
                        yield Request(url=self.get_url(id_mark_two,''), callback=self.parse_json_conten)
                    else:
                        print("通过其他方式获取关于地月统计的数据")
                else:
                    print("%s 为目录， id 为：%s ,即为继续请求子集目录   %d \n" %(name_2,id_mark_two,result))
                    #三级目录分类
                    three_tile_list=two.xpath('./ul/li')
                    three_dict={}
                    for three in three_tile_list:
                        print(three)
                        name_3=three.xpath('./div[1]/span/text()').extract()[0]
                        three_mark=three.xpath('./div[1]/span[3]/@class').extract()[0]
                        id_mark_three=three.xpath('./div[1]/@id').extract()[0]
                        print(" name %s   mark %s   id : %s" %(name_3,three_mark,id_mark_three))
                        three_dict[name_3]={id_mark_three}
                        yield Request(url=self.get_url(id_mark_three,''), callback=self.parse_json_conten)
                    two_list[name_2]=three_dict
                    print(two_list)
            one_list[name_1]=two_list

2.3 解析get的json数据发送给管道

    def parse_json_conten(self, response):
        content = json.loads(response.body_as_unicode())
        date=content['maxdate']
        data_list = content['data']
        print(data_list)
        for i in range(len(data_list)):
            print(data_list[i])
            item = WeatherdataItem()
            item['time'] = data_list[i]['v_SHIJIAN']
            item['name'] = data_list[i]['c_FNAME']
            item['url'] = data_list[i]['fileURL']
            item['code'] = data_list[i]['dataCode']
            item['id'] = data_list[i]['id']
            item['date']=date
            item['path']= get_project_settings().get('IMAGES_STORE')+item["code"] + '/' + item['date'] + '/'
            yield item

2.4 pipeline的数据库操作

 def __init__(self):
        self.ids_seen = set()
        settings = get_project_settings()
        self.conn = pymysql.connect(host=settings.get('MYSQL_HOST'), user=settings.get('MYSQL_USER'), passwd=settings.get('MYSQL_PASSWD'), db=settings.get('MYSQL_DBNAME'), charset='utf8',port=3306)
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
    	##当前根据id排重
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            self.insert_sql(item)
            print(item['id'])
            return item

    def insert_sql(self,item):

        tem_time = str(item['time'])
        time_info = self.formatDate(tem_time, 'YYYY-MM-DD HH:NN:SS')
        print(time_info)
        #print(item['img_path'])
        sql_conten = '''
        INSERT INTO weather_data(file_name,file_url,file_code,file_time) VALUES("{0}","{1}","{2}",{3});
        '''.format(item['name'], item['url'], item['code'], time_info)

        ##INSERT INTO weather_surface_obser(code,name,url,time,type) VALUES ('SURF_TL1','20200305000000_953466a4-6bfa-4fef-b0c1-48f2d49bbcbc_surf_TEM_Min.txt_0.png.png','http://image.data.cma.cn/vis/SURF_TL1/20200305/20200305000000_953466a4-6bfa-4fef-b0c1-48f2d49bbcbc_surf_TEM_Min.txt_0.png.png',11231231,'小时最低气温');

        # sql_content='INSERT INTO weather_surface_obser(name,type,url,time,type) VALUES (%s,%s,%s,%s,%s);'
        self.cursor.execute(sql_conten)
        self.conn.commit()
        pass
    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()
        pass

2.5 pipeline的图片下载类

关于使用ImagesPipeline和 FilesPipeline的使用方式 有几点
（1） setting.py 的图片和文件的下载变量是固定的！
图片：IMAGES_STORE=‘图片下载路径’
文件：FILES_STORE ='文件下载路径'
（2）关于items的自定义属性也是有要求可以看到上面的链接。
（3）记得设置setting.py管道文件的开关记得打开。

#使用ImagesPipeline 的方式则利用了类之间多态的方式完成关于关于图片的下载和重命名。当然还有其他操作可待发掘。
class MyImagesPipeline(ImagesPipeline):
    img_store = get_project_settings().get('IMAGES_STORE')

    def get_media_requests(self, item, info):
        image_url=item['url']
        yield scrapy.Request(image_url)

    def item_completed(self, results, item, info, ):
        #图片已经下载到默认路径了，需要移动,并且命名方式为原来文件的信息
        image_paths = [x["path"] for ok, x in results if ok]
        file_path=item['path']
        if os.path.exists(file_path) == False:
            os.makedirs(file_path)
        image_path=image_paths[0]
        pic_list=''
        pic_name=image_path.replace('full/','')
        shutil.move(self.img_store + 'full\\'+pic_name, file_path + "\\" + item['name'])
        pic_list=file_path + "\\" + pic_name
        item["img_path"] = pic_list
        return item

图片下载
在这里插入图片描述

2.6FilesPipeline的使用方法

class MyFilePlipeline(FilesPipeline):
    def get_media_requests(self, item, info):
        print('url %s ' %( item['file_urls']))
        yield Request(item['file_urls'], meta={'name': item['file_name']})
	#文件重命名
    def file_path(self, request, response=None, info=None):  # 修改文件名
        txt_guid = request.meta['name']
        return '%s' % (txt_guid)

3. 结束语

这个我第一次接触到scrapy这个框架。可能很多想法不够成熟。希望大家多多指正！

lzc_007

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
scrapy 的项目应用案例

scrapy 的项目应用案例1.前提介绍1.1:项目应用场景1.2项目技术1.3基础知识2.核心代码介绍2.1使用scrapy-splash 请求动态页面2.2 解析动态页面根据页面拼接动态get请求获取服务器数据2.3 解析get的json数据发送给管道2.4 pipeline的数据库操作2.5 pipeline的图片下载类3. 结束语1.前提介绍1.1:项目应用场景爬取动态页面，并且获取...
复制链接

扫一扫