例如:scrapy shell http://scrapy.org/
content = hxs.select('//*[@id="content"]').extract()[0]
print content
然后,我得到以下原始HTML代码:
Welcome to Scrapy
What is Scrapy?
Scrapy is a fast high-level screen scraping and web crawling
framework, used to crawl websites and extract structured data from their
pages. It can be used for a wide range of purposes, from data mining to
monitoring and automated testing.
Features
-
Simple
-
Scrapy was designed with simplicity in mind, by providing the features
you need without getting in your way
Productive
-
Just write the rules to extract the data from web pages and let Scrapy
crawl the entire web site for you
Fast
-
Scrapy is used in production crawlers to completely scrape more than
500 retailer sites daily, all in one server
Extensible
-
Scrapy was designed with extensibility in mind and so it provides
several mechanisms to plug new code without having to touch the framework
core
Portable, open-source, 100% Python
- Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD Batteries included
-
Scrapy comes with lots of functionality built in. Check
href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this
section
of the documentation for a list of them.
Well-documented & well-tested
-
Scrapy is
extensively documented and has an comprehensive test suite
with very good code
coverage
Healthy community
-
1,500 watchers, 350 forks on Github (link)
700 followers on Twitter (link)
850 questions on StackOverflow (link)
200 messages per month on mailing list (
href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link
)40-50 users always connected to IRC channel (link)
Commercial support
- A few companies provide Scrapy consulting and support
Still not sure if Scrapy is what you're looking for?. Check out
href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a
glance
.Companies using Scrapy
Scrapy is being used in large production environments, to crawl
thousands of sites daily. Here is a list of Companies
using Scrapy.
Where to start?
Start by reading Scrapy at a glance,
then download Scrapy and follow the
href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial
.但我想直接从scrapy获取纯文本。
我不想使用任何xPath选择器来提取p、h2、h3。。。标签,因为我正在爬行一个网站,其主要内容被递归地嵌入到table,tbody;中。查找xPath可能是一项乏味的任务。
这可以通过Scrapy中的内置函数实现吗?或者我需要外部工具来转换它?我读了斯皮奇所有的文件,但一无所获。