python实现简单爬虫

最新推荐文章于 2024-06-19 17:27:45 发布

半夏映浮光

最新推荐文章于 2024-06-19 17:27:45 发布

阅读量272

点赞数 1

本文链接：https://blog.csdn.net/HXiao0805/article/details/81874745

版权

刚刚接触Python，使用其进行简单数据爬取。主要分一下两步：

1.取出页面对应的html源代码。

2.使用正则表达式等对取出数据进行过滤，得到自己所需数据。

注意：在使用公司网络时，可能去要设置代理，否则会访问不到外网。

代码如下，仅供初学者参考

# -*- coding:utf8 -*-
"""第一个简单爬虫，爬取博客阅读量与创建时间"""
import urllib2
import re


# 设置代理的方法
def url_build_proxy_opener(proxy_info):
    passmgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
    passmgr.add_password(None, proxy_info['server'] , proxy_info['user'], proxy_info['password'])
    auth = urllib2.ProxyBasicAuthHandler(passmgr)
    opener = urllib2.build_opener(urllib2.ProxyHandler({'http':proxy_info['server']}) , auth)
    return opener
# 设置代理
proxy_info = {'user':'用户名', 'password':'密码' , 'server':'代理服务器ip:端口号'}

# We create an opener which uses this handler:（调用上述代理方法）
opener = url_build_proxy_opener(proxy_info)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://www.cnblogs.com/vamei"  # 注意有些网址爬取数据会失败
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
html =html.split("\r\n") #  分割成行，此时html是一个列表，每个元素是超文本的一行
#print html

# 使用正则表达式筛选出需要部分
pattern="posted @ (\d{4}-[0-1]\d-[0-3]\d [0-2]\d:[0-6]\d) Vamei 阅读\((\d+)\) 评论" # 编写正则表达式，\(\)为真正的匹配括号
for line in html:
    m=re.search(pattern,line) # 查找满足正则表达式的行
    if m!=None:
        print (m.group(1),m.group(2)) # 正则表达式中第一个不带|的括号为group（1），以此类推第二个不带\为group（2）。。。
        print "时间"+m.group(1)+"评论"+m.group(2)
#pattern="posted @ (\d{4}-[0-1]\d-[0-3]\d [0-2]\d:[0-6]\d) Vamei 阅读\((\d+)\) 评论"

#m=re.search(pattern,'<div class="postDesc">posted @ 2016-12-26 09:17 Vamei 阅读(5078) 评论(9)  <a href ="https://i.cnblogs.com/EditPosts.aspx?postid=6206331" rel="nofollow">编辑</a></div>')

#print m

喜欢我的文章希望和我一起成长的宝宝们，可以搜索并添加公众号TryTestwonderful ，或者扫描下方二维码添加公众号

半夏映浮光

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
python实现简单爬虫

刚刚接触Python，使用其进行简单数据爬取。主要分一下两步：1.取出页面对应的html源代码。2.使用正则表达式等对取出数据进行过滤，得到自己所需数据。注意：在使用公司网络时，可能去要设置代理，否则会访问不到外网。代码如下，仅供初学者参考# -*- coding:utf8 -*-"""第一个简单爬虫，爬取博客阅读量与创建时间"""import urllib2impo...
复制链接

扫一扫