python解析html页面_python之urllib2简单解析HTML页面之篇一

最新推荐文章于 2024-03-18 15:46:55 发布

weixin_39843677

最新推荐文章于 2024-03-18 15:46:55 发布

阅读量236

点赞数

文章标签： python解析html页面

一、urllib2简单获取html页面

#!/usr/bin/env python

#-*- coding:utf-8 -*-

importurllib2

response= urllib2.urlopen('http://www.baidu.com');

html=response.read();print html

简单的几行代码就能拿到html页面，接下来局势html的解析工作了。

想象很美好，实际操作就出问题了。baidu没有禁止机器人抓取可以正常抓取到页面，但是比如：https://b.ishadow.tech/是禁止机器人抓取的，简单模拟浏览器头部信息也不行。

然后想找个GitHub上的爬虫来试验一下行不行，因此找到了https://github.com/scrapy/scrapy，看样子好像比较叼。

按照readme安装了一下，安装失败了，仔细看了一下文档。

pip install scrapy

Using a virtual environment (recommended)

TL;DR: We recommend installing Scrapy inside a virtual environment on all platforms.

Python packages can be installed either globally (a.k.a system wide), orin user-space. We donot recommend installing scrapy system wide.

Instead, we recommend that youinstall scrapy within a so-called “virtual environment” (virtualenv). Virtualenvs allow you to not conflict with already-installed Python system packages (which could break some of your system tools and scripts), and still install packages normally with pip (without sudo and the likes).

然后决定安装一个Python虚拟环境，命令如下：

$ sudo pip install virtualenv

查看基本使用

$virtualenv -h

Usage: virtualenv [OPTIONS] DEST_DIR

只需要 virtualenv加目标目录就可以了。

因此新建虚拟环境

$virtualevn e27

New python executablein ~/e27/bin/python

Installing setuptools, pip, wheel...done.

启用环境

$source ./bin/activate

注意切换环境成功后当前目录会有标识，如下

➜ e27 source ./bin/activate

(e27) ➜ e27

退出环境

$deactivate

pip install scrapy

大约三分钟后安装完成，之前直接在全局环境安装尽然还失败了。成功后shell输出如下：

......

Successfully built lxml PyDispatcher Twisted pycparser

Installing collected packages: lxml, PyDispatcher, zope.interface, constantly, incremental, attrs, Automat, Twisted, ipaddress, asn1crypto, enum34, idna, pycparser, cffi, cryptography, pyOpenSSL, queuelib, w3lib, cssselect, parsel, pyasn1, pyasn1-modules, service-identity, scrapy

Successfully installed Automat-0.5.0 PyDispatcher-2.0.5 Twisted-17.1.0 asn1crypto-0.22.0 attrs-16.3.0 cffi-1.10.0 constantly-15.1.0 cryptography-1.8.1 cssselect-1.0.1 enum34-1.1.6 idna-2.5 incremental-16.10.1 ipaddress-1.0.18 lxml-3.7.3 parsel-1.1.0 pyOpenSSL-16.2.0 pyasn1-0.2.3 pyasn1-modules-0.0.8 pycparser-2.17 queuelib-1.4.2 scrapy-1.3.3 service-identity-16.0.0 w3lib-1.17.0 zope.interface-4.3.3

安装好scrapy后尝试一个简单的连接

(e27) ➜ e27 scrapy shell 'http://quotes.toscrape.com/page/1/'

得到一堆结果如下

2017-03-30 22:08:42[scrapy.core.engine] INFO: Spider opened2017-03-30 22:08:43 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)

[s] Available Scrapy objects:

[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s] crawler[s] item {}

[s] request

[s] response <200 http://quotes.toscrape.com/page/1/>

[s] settings [s] spider[s] Useful shortcuts:

[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)

[s] fetch(req) Fetch a scrapy.Request and update local objects

[s] shelp() Shell help (print this help)

[s] view(response) View responsein a browser

证明是可以工作的，然后试一下连接：https://b.ishadow.tech/

(e27) ➜ e27 scrapy shell 'https://b.ishadow.tech/'

结果如下：

2017-03-30 22:10:21[scrapy.middleware] INFO: Enabled item pipelines:

[]2017-03-30 22:10:21 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2017-03-30 22:10:21[scrapy.core.engine] INFO: Spider opened2017-03-30 22:11:36 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying (failed 1 times): TCP connection timed out: 60: Operation timed out.

2017-03-30 22:12:51 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying (failed 2 times): TCP connection timed out: 60: Operation timed out.

2017-03-30 22:14:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying (failed 3 times): TCP connection timed out: 60: Operation timed out.

Traceback (most recent call last):

爬去超时了，看来是被识别出来是机器人爬取内容被拒绝的（当然此时站点通过浏览器是可以访问的），厉害了我的哥！到这里你是不是已经猜到我的真实目的了，没有的话请打开我爬取得连接看看就知道了。😁

接下来慢慢研究怎么突破封锁。