Python连载笔记（十）——————爬虫初步训练案例

最新推荐文章于 2024-09-05 11:10:46 发布

墨漓_lyl

最新推荐文章于 2024-09-05 11:10:46 发布

阅读量157

点赞数

分类专栏： Python学习笔记文章标签： python 学习笔记爬虫

本文链接：https://blog.csdn.net/qq_42025108/article/details/102757926

版权

Python学习笔记专栏收录该内容

10 篇文章 0 订阅

订阅专栏

一、网页内网址的爬取

import urllib.request
import re

#第一步 确定需要爬取的网址
url = "http://www.baidu.com/"

#第二步：发送请求获取响应
response = urllib.request.urlopen(url)

#第三步：通过response.read() 获取响应内容
html = response.read().decode("utf-8")

#第四步：输出
print(html)

#提取网址
f = re.findall("""(")(http://[^"]+)(")""",html)
for i in f:
    print(i[1])

二、User-Agent值的获取与爬虫解码

import urllib.request

url = "http://www.baidu.com/"

#headers的值可在自己的浏览器中找到，比如在谷歌流量器中按F12,点击Network，在点Name下的任意一栏，在Headers便可看见User-Agent的值
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}


#1.构建请求对象
request = urllib.request.Request(url,headers = headers)

#2.获取响应对象
response = urllib.request.urlopen(request)

#3.通过response获取对象内容
html = response.read().decode("utf-8")

print(request.get_header("User-agent"))

三、爬虫搜索的编码

"""
    https://www.baidu.com/s?wd=图片
    https://www.baidu.com/s?wd=三峡

    通过以上分析：
        https://www.baidu.com/s?wd=    是不改变的，唯一改变的是wd的值
"""
import urllib.request
import urllib.parse
#*********************************************************************************
#第一种编码方式
url = "https://www.baidu.com/s?wd="
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
#编码，拼接 URL
key = input("请输入要搜索的内容：")
#quote加码方式
key = urllib.parse.quote(key)
urls = url + key
#构建请求对象
request = urllib.request.Request(urls,headers=headers)
response = urllib.request.urlopen(request)
html = response.read().decode("utf-8")
print(html)
#*********************************************************************************
#第二种字典编码方式
url = "https://www.baidu.com/s?"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
#编码，拼接 URL
key = input("请输入要搜索的内容：")
#quote字典加码方式
key = {'wd':key,'pn':2}
key = urllib.parse.urlencode(key)
urls = url + key
print(urls)
#构建请求对象
request = urllib.request.Request(urls,headers=headers)
response = urllib.request.urlopen(request)
html = response.read().decode("utf-8")
print(html)

四、百度贴吧网页的抓取与保存

"""
    百度贴吧数据抓取
        1.可以由用户输入贴吧内容
        2.可以由用户选取页码数
        3.最终保存在.html文件中

    步骤：
        1.找URL的规律（拼接URL）
            第一页：http://tieba.baidu.com/f?kw=贴吧名称&pn=0
            第二页：http://tieba.baidu.com/f?kw=贴吧名称&pn=50
            第三页：http://tieba.baidu.com/f?kw=贴吧名称&pn=100
            第n 页：http://tieba.baidu.com/f?kw=贴吧名称&pn=50*(n-1)

            url

        2.获取响应的内容
        3.保存到本地/数据库

"""
import urllib.request
import urllib.parse
"""
    以下为函数版本
"""
#******************************************************************************
#这是发送请求或许响应的函数
def zhixing(urls):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
    request = urllib.request.Request(urls, headers=headers)
    response = urllib.request.urlopen(request)
    html = response.read().decode("utf-8")
    return  html

def save(i,html):
    try:
        # 以下方式待程序执行完毕，一定会自动释放资源
        with open("./test_file/mynote(%d).html" % i, 'w', encoding="utf-8") as f:
            f.write(html)
            print("第%d个网页文件保存成功！" % i)
    except Exception as e:
        print("文件打开失败！")

def main():
    url = "http://tieba.baidu.com/f?"
    # 编码，拼接 URL
    str1 = input("请输入你要搜索的贴吧名称：")
    p1 = eval(input("请输入要截取的起始页数:"))
    p2 = eval(input("请输入要截取的起始页数:"))
    # quote字典加码方式
    for i in range(p1, p2 + 1):
        key = {'kw': str1, 'pn': 50 * (i - 1)}
        key = urllib.parse.urlencode(key)
        urls = url + key
        html = zhixing(urls)
        save(i,html)

if __name__=="__main__":
    main()
#******************************************************************************

墨漓_lyl

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python连载笔记（十）——————爬虫初步训练案例

一、网页内网址的爬取import urllib.requestimport re#第一步确定需要爬取的网址url = "http://www.baidu.com/"#第二步：发送请求获取响应response = urllib.request.urlopen(url)#第三步：通过response.read() 获取响应内容html = response.read().dec...
复制链接

扫一扫

专栏目录