python——网络爬虫以及正则表达式

最新推荐文章于 2021-12-01 21:51:36 发布

2022冲鸭

最新推荐文章于 2021-12-01 21:51:36 发布

阅读量565

点赞数

分类专栏：编程

本文链接：https://blog.csdn.net/y805805/article/details/87381175

版权

编程专栏收录该内容

9 篇文章 0 订阅

订阅专栏

由于人工智能的火热，python也随之流行，主要用于网络爬虫，WEB网页以及实现自动化。下面就自己学到的一些爬虫案例进行总结！

1.抓取百度贴吧的页面

具体的步骤，已经在下面代码中给出注释，这里要说明一点的是，我们在抓取网页信息时，到底是通过网页还是程序代码，后台服务器都可以跟踪导，所以在程序中添加伪装浏览器，Windows NT 6.1 是windows 的学名。

import urllib
import urllib2
#定义网页地址
url="http://tieba.baidu.com/f?"
#拼接关键字
kw={"kw":"迪丽热巴"}
kw=urllib.urlencode(kw)
url=url+kw
#print url
for i in range(4):
	#创建第几页
    file=open("The% page.html"%(i+1),"w")
    #拼接贴吧第几页的url
    page="&pn=%s"(i*50)
    newurl=url+page
    #print newurl
    #伪装浏览器
    user_agent={"User-Agent":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;  Trident/5.0;"}
    #创建http 请求 get 传入user-agent 报头
    req=urllib2.Request(newurl,headers=user_agent)
    #打开请求链接 获取响应对象
    rsp=urllib2.urlopen(rsp)
    #从响应对象中 读取服务器返回的文本信息
    html=rsp.read()
    #将文本信息写入到响应文件中
    file.write(html)
    #文件关闭
    file.close()
    print("Over"%(i+1))

2.抓取个人的网页贴吧

import urllib
import urllib2

name=raw_input("请输入：")
kw={"wd":"迪丽热巴"}
kw=urllib.urlencode(kw)
url="http://www.baidu.com/s?"+kw

header={"User-Agent":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;  Trident/5.0;"}


req=urllib2.Request(url,headers=header)
rsp=urllib2.urlopen(req)
print(rsp.code)
#print(rsp.read())
file=open("dili.txt","w")
file.write(rsp.read())
file.close()

#抓取百度页面
#该种方式仅仅通过URL获取链接数据而已，并没有封装HTTP/HTTPS请求 post get
import urllib.request
response=urllib.request.urlopen("http://www.baidu.com")
html=response.read()
file=open("baidu.html","w")
file.write(str(html))
file.close()

3.利用程序查询指定网页

import urllib.request
kw=input("请输入要查询的内容：")
url="https://www.baidu.com/s"
values={"wd":kw}
data=urllib.parse.urlencode(values)
print(data)
url=url+"?"+data

header={"User-Agent":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;  Trident/5.0;"}

req=urllib.request.Request(url,headers=header)
rsp=urllib.request.urlopen(req)
html=rsp.read()
print(html.decode("utf8"))

4.抓取内涵段子的页面，指定前20页

#爬取内涵段子吧里面 脑经急转弯 前20页
import urllib
import urllib2
url="http://www.neihan8.com/njjzw/"
file=open("njjzw.txt","w")
num=0
for i in range(1):
    if i==0:
        newurl=url+"index_%s.html"%(i)
    else:
    	newurl=url+"index_%s.html"%(i+1)
    headers={"User-Agent":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;  Trident/5.0;"}
    request=urllib2.Request(newurl,headers=headers)
    
    resopnse=urllib2.urlopen(request)
    html=response.read()

    #先获取问题
    re.compile("class="title" title="一只蜜蜂叮在挂历上"")
    #规则 在此pattern的规则下 查找一个字符串中所有符合该规则的字符串
    questions=pattern.findall(html)
    #print len(questions)
   

    #获取答案
    pattern=re.complie("<div class=\"desc\">(.*)</div>")
    anwsers=pattern.findall(html)
    print len(anwsers)
    del anwsers[0]
    for i in range(len(anwsers)):
    	num+=1
    	file.write("问题%："%(num)+question[i])
        file.write("\n")
        anw=anwsers[i]
        anw=anw.strip()
        anw=anw[6:]
        file.write("\t"+anw)
        file.write("\n") 
file.close()

5.正则表达式

这里就不再一一赘述，与shell中讲述的一致，也就是所谓的匹配。如果以后再编程中需要使用正则表达式，则需要查询即可，只是工具而已。

#
#[1][34578][0-9]{9}
import re
result=re.match("hehe","hehehe")
text=result.group()
print text
ret = re.match(".","abcd")
print ret.group()
ret = re.match("[hH]","hello Python")
print ret.group()
ret = re.match("[01234567Hello","7Hello Python")

2022冲鸭

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python——网络爬虫以及正则表达式

由于人工智能的火热，python也随之流行，主要用于网络爬虫，WEB网页以及实现自动化。下面就自己学到的一些爬虫案例进行总结！1.抓取百度贴吧的页面具体的步骤，已经在下面代码中给出注释，这里要说明一点的是，我们在抓取网页信息时，到底是通过网页还是程序代码，后台服务器都可以跟踪导，所以在程序中添加伪装浏览器，Windows NT 6.1 是windows 的学名。import urll...
复制链接

扫一扫

专栏目录