Python 网页抓取 Web Scraping with Python - 准备迁移更新

前言

网页抓取适合收集和处理大量的数据。超越搜索引擎,比如能找到最便宜的机票。

API能提供很好的格式化的数据。但是很多站点不提供API,无统一的API。即便有API,数据类型和格式未必完全符合你的要求,且速度也可能太慢。

应用场景:市场预测、机器语言翻译、医疗诊断等。甚至时艺术,比如http://wefeelfine.org/。

本文基于python3,需要python基础。

代码下载:http://pythonscraping.com/code/。

第一个网页抓取 

连接
from urllib.request import urlopen
html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
print(html.read())

执行结果:

$ python3 1-basicExample.py 
b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'
BeautifulSoup简介
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read(), 'lxml');
print(bsObj.h1)

执行结果:

$ python3 2-beautifulSoup.py 
<h1>An Interesting Title</h1>

HTML代码层次如下:

• html → <html><head>...</head><body>...</body></html>
    — head → <head><title>A Useful Page<title></head>
        — title → <title>A Useful Page</title>
    — body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>
        — h1 → <h1>An Interesting Title</h1>
        — div → <div>Lorem Ipsum dolor...</div>

注意这里bsObj.h1和下面的效果相同:

bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1

urlopen容易发生的错误为:

•在服务器上找不到该页面(或获取错误), 404或者500
•找不到服务器

都体现为HTTPError。可以如下方式处理:

try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
    #return null, break, or do some other "Plan B"
else:
    #program continues. Note: If you return or break in the
    #exception catch, you do not need to use the "else" statement

 

 

 

一些临时文档

 

TCP客户端

import socket

target_host = "automationtesting.sinaapp.com"
target_port = 80

# create a socket object
client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# connect the client
client.connect((target_host,target_port))

# send some data
client.send("GET / HTTP/1.1\r\nHost: automationtesting.sinaapp.com\r\n\r\n")

# receive some data
response = client.recv(4096)

print response

执行结果:

]#python tcp_test.py
HTTP/1.1 200 Ok
Server: nginx
Date: Mon, 22 Dec 2014 08:23:52 GMT
Content-Type: text/html;charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By-60WZB: wangzhan.360.cn
via: yq26.pyruntime
Content-Encoding: gzip
VAR-Cache: HIT
cache-control: max-age=14400
age: 0
...

 

转载于:https://my.oschina.net/u/1433482/blog/495841

The internet contains a wealth of data. This data is both provided through structured APIs as well as by content delivered directly through websites. While the data in APIs is highly structured, information found in web pages is often unstructured and requires collection, extraction, and processing to be of value. And collecting data is just the start of the journey, as that data must also be stored, mined, and then exposed to others in a value-added form. With this book, you will learn many of the core tasks needed in collecting various forms of information from websites. We will cover how to collect it, how to perform several common data operations (including storage in local and remote databases), how to perform common media-based tasks such as converting images an videos to thumbnails, how to clean unstructured data with NTLK, how to examine several data mining and visualization tools, and finally core skills in building a microservices-based scraper and API that can, and will, be run on the cloud. Through a recipe-based approach, we will learn independent techniques to solve specific tasks involved in not only scraping but also data manipulation and management, data mining, visualization, microservices, containers, and cloud operations. These recipes will build skills in a progressive and holistic manner, not only teaching how to perform the fundamentals of scraping but also taking you from the results of scraping to a service offered to others through the cloud. We will be building an actual web-scraper-as-a-service using common tools in the Python, container, and cloud ecosystems.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值