python3爬虫实战(一)：基于requests/urllib.request和BeautifulSoup爬取网站新闻

最新推荐文章于 2021-02-04 17:45:01 发布

程序员对白

最新推荐文章于 2021-02-04 17:45:01 发布

阅读量1k

点赞数 1

分类专栏： python爬虫

本文链接：https://blog.csdn.net/qq_33161208/article/details/99241222

版权

python爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一、使用requests向浏览器模拟发送请求：

requests模块可以向浏览器模拟发送请求，这里先采用requests模块获取清华大学新闻首页中的所有新闻并使用BeautifulSoup模块解析html文档去除html标签，代码如下：

#requests或urllib.request：向浏览器模拟发送请求
#BeautifulSoup:将html文档转换成树形结构，通俗意思：去除html标签
#selenium:开源web自动化测试软件，通俗意思：与浏览器进行通话，使得浏览器完全按照脚本运行
import requests
from bs4 import BeautifulSoup as bs
def getInfo(url,headers):
    response=requests.get(url,headers=headers)
    response.encoding='utf-8'
    soup=bs(response.text,'html.parser')
    print(soup.text)

if __name__=='__main__':
    url="http://news.tsinghua.edu.cn/publish/thunews/index.html"
    headers={
        'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
    'Referer': r'http://info.tsinghua.edu.cn/',
    'Connection': 'keep-alive'
}
    getInfo(url,headers)

二、使用urllib.request向浏览器模拟发送请求：

urllib.request模块也可以向浏览器模拟发送请求，作用与request模块相同，区别在于urllib.request.urlopen()不支持添加headers头信息，它默认的用户代理(User-Agent)是本级python版本，服务器一下就能识别出它是爬虫，因此我们先使用urllib.request.Request类来添加headers头信息，然后作为参数传给urllib.request.urlopen()获取网页内容，代码如下：

#使用requests或urllib.request向浏览器模拟发送请求
import urllib.request
from bs4 import BeautifulSoup as bs
def getInfo(url,headers):
    #urllib.request.urlopen()不支持添加headers信息，它默认的用户代理是本机python版本，服务器一下就能识别出这是爬虫
    #要想模拟一个真实用户用浏览器去访问网页，在发送请求的时候会有不同的用户代理
    response=urllib.request.Request(url,headers=headers)
    response1=urllib.request.urlopen(response)
    soup=bs(response1.read().decode('utf-8'),'html.parser')
    print(soup.text)
if __name__=='__main__':
    url="http://news.tsinghua.edu.cn/publish/thunews/index.html"
    headers={
        'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
    'Referer': r'http://info.tsinghua.edu.cn/',
    'Connection': 'keep-alive'
}
    getInfo(url,headers)

以上就是requests模块和urllib.request模块两者的区别，从代码中我们可以发现requests模块使用的基本是属性，如requests.encoding，requests.text，二urllib.request.urlopen()使用的基本是方法，如urllib.request.urlopen().read()，urllib.request.urlopen().read().decode()。