python爬虫（一）

最新推荐文章于 2024-07-27 12:20:46 发布

Seana_chao

最新推荐文章于 2024-07-27 12:20:46 发布

阅读量480

点赞数

分类专栏： python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/luoye_xiaochao/article/details/60977393

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

　　我们通过使用requests库来实现一个简易的网页爬虫　　

import requests
url="http://www.baidu.com"
try:
    r=requests.get(url)
    r.raise_for_status()
    r.encoding= r.apparent_encoding
    print r.text
except:
    print "wrong"

　　requests有七种方法，get put head post patch delete request其中get用来请求url位置的资源，put请求url位置存储一个资源覆盖源资源，head用来请求url位置的头部信息，post请求向url位置附加新的资源，patch请求改变该处的局部信息资源，delete用来请求删除该处的资源,在requests.get(url)中有些是缺省的参数，我们可以自己设置，比如

fs={'file':open('data.xls','rb')}
r=requests.request('post',url,files=fs)
px={'http':'http://user:@pass10.101.10.124:1234',  
    'https':'http://10.10.10.1:1234'}
r=requests.request('GET',url,proxies=px)
kv={'user-agent':'MOzilla/5.0'}
r=requests.request('GET',url,headers=kv)

通过proxies我们可以设置访问代理服务器，可以增加登陆认证，通过headers我们可以将我们的爬虫伪装成浏览器。
　　我们在写爬虫的时候有可能读取到的网页是404 not found所以我们可以采用异常处理的机制，r.raise_for_status如果是200的返回码则正常否则是错误的，在终端我们可以用r.status_code 判断结果是否是200如果是则正常访问。
　　至对于页面的解析时我们可以采用beautifulsoup库首先我们要把读取到的东西熬成鸡汤
　　

from bs5 import BeautifulSoup
demo=r.text
soup=BeautifulSoup(demo,'html.parser')

　　我们用html.parser来进行解析，将r.text解析成正常的树结构，这样我们就能得到网页各个标签的结构关系，对他们的访问我们可以分成三种形式，上行，下行和平行。可以通过tag.children(迭代类型，要循环才能访问)或者tag.contents来访问它的儿子进行下行访问，也可以通过tag.parents来上行访问，可以通过tag.next_sbling或者tag.previous_sbling来进行平行访问，后者tag.next_sblings,tag.previous_sblings迭代。
　　python爬虫对列表的解析才是重点，应当勤加锻炼。

Seana_chao

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫（一）

我们通过使用requests库来实现一个简易的网页爬虫　　import requestsurl="http://www.baidu.com"try: r=requests.get(url) r.raise_for_status() r.encoding= r.apparent_encoding print r.textexcept: print "
复制链接

扫一扫

专栏目录