scrapy获取html标签文本,Scrapy是否可以从原始HTML数据中获取纯文本?

例如:scrapy shell http://scrapy.org/

content = hxs.select('//*[@id="content"]').extract()[0]

print content

然后,我得到以下原始HTML代码:

Welcome to Scrapy

What is Scrapy?

Scrapy is a fast high-level screen scraping and web crawling

framework, used to crawl websites and extract structured data from their

pages. It can be used for a wide range of purposes, from data mining to

monitoring and automated testing.

Features

Simple
Scrapy was designed with simplicity in mind, by providing the features

you need without getting in your way

Productive
Just write the rules to extract the data from web pages and let Scrapy

crawl the entire web site for you

Fast
Scrapy is used in production crawlers to completely scrape more than

500 retailer sites daily, all in one server

Extensible
Scrapy was designed with extensibility in mind and so it provides

several mechanisms to plug new code without having to touch the framework

core

Portable, open-source, 100% Python
Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD
Batteries included
Scrapy comes with lots of functionality built in. Check

href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this

section

of the documentation for a list of them.
Well-documented & well-tested
Scrapy is extensively documented and has an comprehensive test suite

with very good code

coverage

Healthy community

1,500 watchers, 350 forks on Github (link)

700 followers on Twitter (link)

850 questions on StackOverflow (link)

200 messages per month on mailing list (

href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link

)

40-50 users always connected to IRC channel (link)

Commercial support
A few companies provide Scrapy consulting and support

Still not sure if Scrapy is what you're looking for?. Check out

href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a

glance

.

Companies using Scrapy

Scrapy is being used in large production environments, to crawl

thousands of sites daily. Here is a list of Companies

using Scrapy.

Where to start?

Start by reading Scrapy at a glance,

then download Scrapy and follow the

href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial

.

但我想直接从scrapy获取纯文本。

我不想使用任何xPath选择器来提取p、h2、h3。。。标签,因为我正在爬行一个网站,其主要内容被递归地嵌入到table,tbody;中。查找xPath可能是一项乏味的任务。

这可以通过Scrapy中的内置函数实现吗?或者我需要外部工具来转换它?我读了斯皮奇所有的文件,但一无所获。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值