scrapy获取html标签文本,Scrapy是否可以从原始HTML数据中获取纯文本？

最新推荐文章于 2021-06-09 12:59:43 发布

UXOFFER

最新推荐文章于 2021-06-09 12:59:43 发布

阅读量289

点赞数

文章标签： scrapy获取html标签文本

例如：scrapy shell http://scrapy.org/

content = hxs.select('//*[@id="content"]').extract()[0]

print content

然后，我得到以下原始HTML代码：

Welcome to Scrapy

What is Scrapy?

Scrapy is a fast high-level screen scraping and web crawling

framework, used to crawl websites and extract structured data from their

pages. It can be used for a wide range of purposes, from data mining to

monitoring and automated testing.

Features

Simple

Scrapy was designed with simplicity in mind, by providing the features

you need without getting in your way

Productive

Just write the rules to extract the data from web pages and let Scrapy

crawl the entire web site for you

Fast

Scrapy is used in production crawlers to completely scrape more than

500 retailer sites daily, all in one server

Extensible

Scrapy was designed with extensibility in mind and so it provides

several mechanisms to plug new code without having to touch the framework

core

Portable, open-source, 100% Python

Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD

Batteries included

Scrapy comes with lots of functionality built in. Check

href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this

section

of the documentation for a list of them.

Well-documented & well-tested

Scrapy is extensively documented and has an comprehensive test suite

with very good code

coverage

Healthy community

1,500 watchers, 350 forks on Github (link)

700 followers on Twitter (link)

850 questions on StackOverflow (link)

200 messages per month on mailing list (

href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link

)

40-50 users always connected to IRC channel (link)

Commercial support

A few companies provide Scrapy consulting and support

Still not sure if Scrapy is what you're looking for?. Check out

href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a

glance

.

Companies using Scrapy

Scrapy is being used in large production environments, to crawl

thousands of sites daily. Here is a list of Companies

using Scrapy.

Where to start?

Start by reading Scrapy at a glance,

then download Scrapy and follow the

href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial

.

但我想直接从scrapy获取纯文本。

我不想使用任何xPath选择器来提取p、h2、h3。。。标签，因为我正在爬行一个网站，其主要内容被递归地嵌入到table，tbody；中。查找xPath可能是一项乏味的任务。

这可以通过Scrapy中的内置函数实现吗？或者我需要外部工具来转换它？我读了斯皮奇所有的文件，但一无所获。

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy获取html标签文本,Scrapy是否可以从原始HTML数据中获取纯文本？

例如：scrapy shell http://scrapy.org/content = hxs.select('//*[@id="content"]').extract()[0]print content然后，我得到以下原始HTML代码：Welcome to ScrapyWhat is Scrapy?Scrapy is a fast high-level screen scraping and w...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。