我的python学习之路----Python 3 抓取网页的 N 种方法

最新推荐文章于 2024-04-16 10:54:00 发布

kapuliyuehan

最新推荐文章于 2024-04-16 10:54:00 发布

阅读量1.2k

点赞数

分类专栏： python 文章标签： python import login url windows user

python 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

Python 3 抓取网页的 N 种方法：

1、最简单

import urllib.request

response = urllib.request.urlopen('http://python.org/')

html = response.read()

2、使用 Request

import urllib.request

req = urllib.request.Request('http://python.org/')

response = urllib.request.urlopen(req)

the_page = response.read()

3、发送数据

#! /usr/bin/env python3

import urllib.parse

import urllib.request

url = 'http://localhost/login.php'

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

values = {

'act' : 'login',

'login[email]' : 'yzhang@i9i8.com',

'login[password]' : '123456'

}

data = urllib.parse.urlencode(values)

req = urllib.request.Request(url, data)
req.add_header('Referer', 'http://www.python.org/')

response = urllib.request.urlopen(req)

the_page = response.read()

print(the_page.decode("utf8"))

4、发送数据和header

#! /usr/bin/env python3

import urllib.parse

import urllib.request

url = 'http://localhost/login.php'

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

values = {

'act' : 'login',

'login[email]' : 'yzhang@i9i8.com',

'login[password]' : '123456'

}

headers = { 'User-Agent' : user_agent }

data = urllib.parse.urlencode(values)

req = urllib.request.Request(url, data, headers)

response = urllib.request.urlopen(req)

the_page = response.read()

print(the_page.decode("utf8"))

5、http 错误

#! /usr/bin/env python3

import urllib.request

req = urllib.request.Request('http://www.python.org/fish.html')

try:

urllib.request.urlopen(req)

except urllib.error.HTTPError as e:

print(e.code)

print(e.read().decode("utf8"))

6、异常处理1

#! /usr/bin/env python3

from urllib.request import Request, urlopen

from urllib.error import URLError, HTTPError

req = Request("http://twitter.com/")

try:

response = urlopen(req)

except HTTPError as e:

print('The server couldn\'t fulfill the request.')

print('Error code: ', e.code)

except URLError as e:

print('We failed to reach a server.')

print('Reason: ', e.reason)

else:

print("good!")

print(response.read().decode("utf8"))

7、异常处理2

#! /usr/bin/env python3

from urllib.request import Request, urlopen

from urllib.error import URLError

req = Request("http://twitter.com/")

try:

response = urlopen(req)

except URLError as e:

if hasattr(e, 'reason'):

print('We failed to reach a server.')

print('Reason: ', e.reason)

elif hasattr(e, 'code'):

print('The server couldn\'t fulfill the request.')

print('Error code: ', e.code)

else:

print("good!")

print(response.read().decode("utf8"))

8、HTTP 认证

#! /usr/bin/env python3

import urllib.request

# create a password manager

password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()

# Add the username and password.

# If we knew the realm, we could use it instead of None.

top_level_url = "https://cms.tetx.com/"

password_mgr.add_password(None, top_level_url, 'yzhang', 'cccddd')

handler = urllib.request.HTTPBasicAuthHandler(password_mgr)

# create "opener" (OpenerDirector instance)

opener = urllib.request.build_opener(handler)

# use the opener to fetch a URL

a_url = "https://cms.tetx.com/"

x = opener.open(a_url)

print(x.read())

# Install the opener.

# Now all calls to urllib.request.urlopen use our opener.

urllib.request.install_opener(opener)

a = urllib.request.urlopen(a_url).read().decode('utf8')

print(a)

9、使用代理

#! /usr/bin/env python3

import urllib.request

proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})

opener = urllib.request.build_opener(proxy_support)

urllib.request.install_opener(opener)

a = urllib.request.urlopen("http://g.cn").read().decode("utf8")

print(a)

10、超时

#! /usr/bin/env python3

import socket

import urllib.request

# timeout in seconds

timeout = 2

socket.setdefaulttimeout(timeout)

# this call to urllib.request.urlopen now uses the default timeout

# we have set in the socket module

req = urllib.request.Request('http://twitter.com/')

a = urllib.request.urlopen(req).read()

print(a)

参考

http://www.tetx.com/program/htm/tetx/blog/view/blog_id/1291521414/index.htm

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
我的python学习之路----Python 3 抓取网页的 N 种方法

Python 3 抓取网页的 N 种方法：1、最简单import urllib.requestresponse = urllib.request.urlopen('http://python.org/')html = response.read() 2、使
复制链接

扫一扫

专栏目录

kapuliyuehan CSDN认证博客专家 CSDN认证企业博客

码龄17年

99: 原创

19万+: 周排名

37万+: 总排名

25万+: 访问

: 等级

3644: 积分

20: 粉丝

8: 获赞

10: 评论

39: 收藏

私信

关注

热门文章

分类专栏

svn 15篇
.net 5篇
hudson 8篇
python 33篇
shell 2篇
软件技术 2篇
batch 3篇
java 11篇
敏捷 1篇
研发管理 6篇
git 4篇
mercurial 1篇
maven 9篇
测试 2篇
服务器 11篇
windows 4篇
配置管理 6篇
管理 1篇
TFS 8篇
架构 1篇
jira
gitlab 13篇
docker 1篇
读书
android 1篇

最新评论

python 变量作用域
Xd聊架构: 内容太有意思的!方便的话可以加个关注。共同学习！一起进步！
Server certificate verification failed: certificate issued for a different hostname, issuer is not t
Super_Popper: 打开终端，随便打几个svn的命令，比如svn update, 然后终端会给你弹出错误消息，表示验证失败。比如 svn update, 弹出 Error validating server certificate for 'https://192.168.0.XX:XXX': - The certificate is not issued by a trusted authority. Use the fingerprint to validate the certificate manually! - The certificate hostname does not match. Certificate information: - Hostname: WIN-11181526 - Valid: from Mon, 18 Nov 2013 08:00:53 GMT until Thu, 16 Nov 2023 08:00:53 GMT - Issuer: WIN-11181526 - Fingerprint: 84:10:b8:1c:58:6c:0f:78:42:d9:68:66:18:a4:c6:cc:fb:91:83:27 (R)eject, accept (t)emporarily or accept (p)ermanently? 弹出提示告诉你需要重新验证，R表示不打算验证，T和P表示临时和总是相信。你可以输入P，这个时候弹出提示让你输入密码，如 Authentication realm: <https://192.168.0.XX:XXX> VisualSVN Server Password for 'liuj': 这个时候你只要输入密码就可以了（会验证2次）然后就会执行你最早的svn命令了（本例中的是svn update）
windows 远程部署
啊啊啊啊啊是: 啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊
eclipse 运行时没有自动保存的解决方法
kurt17: thx

您愿意向朋友推荐“博客详情页”吗？

强烈不推荐
不推荐
一般般
推荐
强烈推荐

提交

最新文章

目录

评论 1

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。