python基础学习网络爬虫

龍龍哥

已于 2022-04-28 15:41:45 修改

阅读量113

点赞数

分类专栏： Python数据分析文章标签：爬虫 python

于 2022-04-26 17:37:52 首次发布

本文链接：https://blog.csdn.net/baldicoot_/article/details/124428767

版权

Python数据分析专栏收录该内容

4 篇文章 0 订阅

订阅专栏

网络爬虫

网络爬虫是按照一定的规则，自动地抓取万维网信息的程序或者脚本。借助于网络爬虫的技术，基本上可以做到所见即所得。

使用到的包及函数

import requests
import re
import bs4

requests.get  == 基于URL，发送网络请求
re.findall  == 基于正则表达式，搜索目标数据
bs4.BeautifulSoup == 对HTML源代码做解析，便于目标函数的拆解

案例一：红牛在中国的分公司

网站：http://www.redbull.com.cn/about/branch

方法一：

import re
import requests

url = 'http://www.redbull.com.cn/about/branch'
response = requests.get(url)
#公司名称
company = re.findall('<h2>(.*?)</h2>',response.text)
for i in company:
    print(i)

# 公司地址
address = re.findall("<p class='mapIco'>(.*?)</p>",response.text)
for i in address:
    print(i)

在这里插入图片描述
方法二：

import bs4

# 公司邮编
soup = bs4.BeautifulSoup(response.text,features='html.parser')
email = [i.text for i in soup.findAll(name='p',attrs={'class':'mailIco'})]
for i in email:
    print(i)

# 联系方式
soup = bs4.BeautifulSoup(response.text,features='html.parser')
iphone = [i.text for i in soup.findAll(name='p',attrs={'class':'telIco'})]
for i in iphone:
    print(i)

在这里插入图片描述
数据展示：

requests请求失败

一些网站使用了反爬虫技术，所以当我们用浏览器访问服务器时可以获得数据，但通过python直接请求时，会被服务器拒绝。我们可以通过拿到浏览器中请求头的用户代理信息进行设置。
在这里插入图片描述

动态存储数据

对于一些分页动态数据，URL不变的情况下，切换页面拿到了不同的数据，我们就需要找到异步数据存放的文件，然后再文件中进行正则寻找。

龍龍哥

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python基础学习网络爬虫

网络爬虫网络爬虫是按照一定的规则，自动地抓取万维网信息的程序或者脚本。借助于网络爬虫的技术，基本上可以做到所见即所得。使用到的包及函数import requestsimport reimport bs4requests.get == 基于URL，发送网络请求re.findall == 基于正则表达式，搜索目标数据bs4.BeautifulSoup == 对HTML源代码做解析，便于目标函数的拆解案例一：红牛在中国的分公司网站：http://www.redbull.com.cn/a
复制链接

扫一扫

专栏目录