scrapy处理个中文本格式HTML,XML,CSV

本文介绍如何利用Scrapy框架处理包含中文的HTML网页,XML数据源和CSV文件,详细讲解了HTMLParser, XMLFeedSpider及CSVFeedSpider的使用方法。" 123595447,12660379,华为云工程师认证与行业应用,"['云计算', '华为云', '大数据', '网络工程师', '认证']
摘要由CSDN通过智能技术生成

网页

#创建项目
 $scrapy startproject mypjt
#基于basic模板创建名为xxx的爬虫文件
$ scrapy genspider -t basic xxx sina.com.cn
html格式

class CaoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    urlname = scrapy.Field()
    urlkey = scrapy.Field()
    urlcr = scrapy.Field()
    urladd = scrapy.Field()

# 可以从命令行指定输入的地址
class AbcSpider(scrapy.Spider):
    name = 'abc'
    start_urls = [
        'http://python.jobbole.com/',
        'http://blog.csdn.net/
    ]

    def __init__(self,myurl=None,*args,**kwargs):
        super(AbcSpider, self).__init__(*args,**kwargs)
        print ("要爬取的网址为: %s" %myurl)
        self.start_urls=["%s" %myurl]

    def parse(self,response):
        item = CaoItem()
        item['urlname'] = response.xpath('/html/head/title/text()').extract()

$ scrapy crawl abc --nolog -a myurl="http://mp3.baidu.com"
要爬取的网址为: http://mp3.baidu.com
百度音乐-听到极致
XMLFeedSpider


class MyxmlItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    author = scrapy.Field()

class MycsvspiderSpider(CSVFeedSpider):
    name = 'mycsvspider'
    allowed_domains = ['iqianyue.com']
    start_urls = ['这里地址自行定义,找一个xml文档,有上述字段']
    headers = ['name','sex','add','email']

    # 定义间隔符
    delimiter = ','


    def parse_row(self, response, row):
        i = MycsvItem()
        i['name'] = row['name'].encode()
        i['sex'] = row['sex'].encode()

        print("名字是:")
        print (i['name'])
        print ("性别是:")
        print (i['sex'])
        print ('------------')
        return i
CSVFeedSpider
class MycsvItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    sex = scrapy.Field()

class MycsvspiderSpider(CSVFeedSpider):
    name = 'mycsvspider'
    allowed_domains = ['iqianyue.com']
    start_urls = ['自定义一个CSV文档用逗号分割的']
    headers = ['name','sex','add','email']

    # 定义间隔符
    delimiter = ','


    def parse_row(self, response, row):
        i = MycsvItem()
        i['name'] = row['name'].encode()
        i['sex'] = row['sex'].encode()

        print("名字是:")
        print (i['name'])
        print ("性别是:")
        print (i['sex'])
        print ('------------')
        return i
$ scrapy craw1 mycsvspider --nolog
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值