Python 网页抓取 Web Scraping with Python - 准备迁移更新

最新推荐文章于 2024-09-26 19:15:00 发布

weixin_34345753

最新推荐文章于 2024-09-26 19:15:00 发布

阅读量102

点赞数

文章标签： python 运维网络

原文链接：https://my.oschina.net/u/1433482/blog/495841

版权

2019独角兽企业重金招聘Python工程师标准>>>

前言

网页抓取适合收集和处理大量的数据。超越搜索引擎，比如能找到最便宜的机票。

API能提供很好的格式化的数据。但是很多站点不提供API，无统一的API。即便有API，数据类型和格式未必完全符合你的要求，且速度也可能太慢。

应用场景：市场预测、机器语言翻译、医疗诊断等。甚至时艺术，比如http://wefeelfine.org/。

本文基于python3，需要python基础。

代码下载：http://pythonscraping.com/code/。

第一个网页抓取

连接

from urllib.request import urlopen
html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
print(html.read())

执行结果：

$ python3 1-basicExample.py 
b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

BeautifulSoup简介

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read(), 'lxml');
print(bsObj.h1)

执行结果：

$ python3 2-beautifulSoup.py 
<h1>An Interesting Title</h1>

HTML代码层次如下：

• html → <html><head>...</head><body>...</body></html>
    — head → <head><title>A Useful Page<title></head>
        — title → <title>A Useful Page</title>
    — body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>
        — h1 → <h1>An Interesting Title</h1>
        — div → <div>Lorem Ipsum dolor...</div>

注意这里bsObj.h1和下面的效果相同：

bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1

urlopen容易发生的错误为：

•在服务器上找不到该页面(或获取错误), 404或者500
•找不到服务器

都体现为HTTPError。可以如下方式处理：

try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
    #return null, break, or do some other "Plan B"
else:
    #program continues. Note: If you return or break in the
    #exception catch, you do not need to use the "else" statement

一些临时文档

TCP客户端

import socket

target_host = "automationtesting.sinaapp.com"
target_port = 80

# create a socket object
client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# connect the client
client.connect((target_host,target_port))

# send some data
client.send("GET / HTTP/1.1\r\nHost: automationtesting.sinaapp.com\r\n\r\n")

# receive some data
response = client.recv(4096)

print response

执行结果：

]#python tcp_test.py
HTTP/1.1 200 Ok
Server: nginx
Date: Mon, 22 Dec 2014 08:23:52 GMT
Content-Type: text/html;charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By-60WZB: wangzhan.360.cn
via: yq26.pyruntime
Content-Encoding: gzip
VAR-Cache: HIT
cache-control: max-age=14400
age: 0
...

转载于:https://my.oschina.net/u/1433482/blog/495841