python爬虫项目学习

最新推荐文章于 2024-07-23 16:17:03 发布

June_Wosen

最新推荐文章于 2024-07-23 16:17:03 发布

阅读量261

点赞数

分类专栏：学习文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/June_Wosen/article/details/122795893

版权

学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

本文介绍了Python爬虫的基本概念和设计思路，包括使用urllib和requests库进行HTTP请求，处理异常，获取响应头和状态码。还涉及到了BeautifulSoup库的使用，正则表达式的应用，以及数据的保存，如Excel和SQLite数据库操作。通过实例展示了如何模拟浏览器发送请求，解析HTML内容，提取数据，并进行了简单的错误处理。

摘要由CSDN通过智能技术生成

Python准备工作

1.爬虫认识：按照一种规则，自动抓取，按照用户需求定向。本质是模仿浏览器打开网页
2.天眼查可以查询真实度、百度指数可以搜索关键词的搜索量
3.爬虫设计思路原理：
'1.蜘蛛从索引区爬取出发爬取的网页，将爬取到的网页放到临时库中进行处理，斌反复上述操作
'2.将临时库中不符合规则的内容进行清理，将符合规则的内容放置设计的索引区。在新的索引区中进行分类、归档、排序，然后将结果反馈给用户
4.补充“ if name=“main” ”，实现当程序执行时，自动调用main中函数，而其他程序调用这个程序时，不会引发main下内容的运行
5.提前安装的库：bs4，网页解析，获取数据；re，正则表达式，进行文字匹配；urllib.request,urllib.error，制定URL，获取网页数据；xlwt，进行Excel操作；sqlite3，进行sqlite数据库操作

代码测试，使用一个http://httpbin.org网站工具测试
#获取一个get请求
# response=urllib.request.urlopen("http://httpbin.org/get")
# print(response.read().decode("utf-8"))

#获取一个post请求，必须使用post方法来封装数据，用data来传入参数，常常用于模拟用户的登录
# import urllib.parse
# data=bytes(urllib.parse.urlencode({"hello":"world"}),encoding="utf-8")
# response=urllib.request.urlopen("http://httpbin.org/post",data=data)
# print(response.read().decode("utf-8"))

6.尝试用异常处理来设置超时情况：

try:
    response=urllib.request.urlopen("http://httpbin.org/get",timeout=0.01)
    print(response.read().decode("utf-8"))
except urllib.error.URLError as e:
    print("time out")

7.上文中get方法得到的response可以使用“response.getheaders()”获得报文的响应头、“response.status”获得状态码等等
8.对发送请求头报文进行修改，尝试换装，模拟浏览器的发送信息

import urllib.parse
data=bytes(urllib.parse.urlencode({"hello":"world"}),encoding="utf-8")
url="http://httpbin.org/post"
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.43"
}
req=urllib.request.Request(url=url,data=data,headers=headers,method="POST")  #构建一个请求对象
response=urllib.request.urlopen(req)                                         #发出请求
print(response.read().decode("utf-8"))

Python获取数据

#得到指定一个URL网页内容
def askUrl(url):
    headers = {#模拟浏览器头部信息，向豆瓣服务器发送消息
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.43"
    }#用户代理，表示告诉豆瓣服务器，我们是什么类型的机器--浏览器（本质上是告诉浏览器，我们可以接受什么水平的文件内容）
    request=urllib.request.Request(url,headers=head)
    html=""
    try:
        response=urllib.request.urlopen(request)
        html=response.read().decode("utf-8")
        print(html)
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
    return html

9.学习beautifulsoup:

#准备内容
from bs4 import BeautifulSoup
file=open("./baidu.html","rb")
html=file.read().decode("utf-8")
bs=BeautifulSoup(html,"html.parser")

# print(bs.title)
# print(bs.a)
# print(ba.head)
#1.Tag 标签及其内容，拿到他所找到的第一个内容

# print(bs.title.string)
# print(bs.a.attrs)
#2.NavigableString 标签里的内容（字符串）

# print(bs)
#3.BeautifulSoup  表示整个文档

# print(bs.a.string)
#4.commment 是一个特殊的NavigableString，输出的内容不包含注释符号

#--------------------------------------------------

#文档的遍历
#print(bs.head.contents[1])

#文档的搜索
#1.find_all()
#字符串过滤：会查找与字符串完全匹配的内容
# t_list=bs.find_all("a")

#正则表达式搜索：使用search（）方法来匹配内容
# t_list=bs.find_all(re.compile("a"))

#方法：传入一个函数（方法），根据函数的要求来搜索(了解)
# def name_is_exist(tag):
#     return tag.has_attr("name")
# t_list=bs.find_all(name_is_exist)
# for item in t_list:
#     print(item)

#2.kwargs 参数
# t_list=bs.find_all(id="head")
# t_list=bs.find_all(class_=True)
# t_list=bs.find_all(class_="edge-translate-notifier-center")

#3.text 参数
# t_list=bs.find_all(text="hao123")
# t_list=bs.find_all(text=["hao123","地图","贴吧"])
# t_list=bs.find_all(text=re.compile("\d")) #应用正则表达式来查找包含特定文本的内（标签里的字符串

#4.limit 参数
# t_list =bs.find_all("a",limit=3)

#补充css选择器
# t_list=bs.select("title")#通过标签来查找
# t_list=bs.select(".mnav")#通过类名来查找
# t_list=bs.select("#u1")#通过id来查找
# t_list=bs.select("a[class='bri']")#通过属性来查找
# t_list=bs.select("head > title")#通过子标签来查找
# t_list=bs.select(".mnav~.bri")#通过兄弟标签来查找

10.正则表达式，作为一种标准对字符串进行判断
在这里插入图片描述

#正则表达式
import re
#创建匹配模式
#1.search方法
# pat=re.compile("AA")#此处的AA为正则表达式，用来验证其他字符串
# print(pat.search("ABVAA"))      #search方法，进行比对查找（第一次出现）。另一种写法m=re.search("正则表达式","要比对的字符串")
#2.findall方法
# print(re.findall("[A-Z]+","ASdDasdfASsDsSLJ")) #findall方法，（正则表达式，比对字符串）
#3.sub方法
# print(re.sub("a","A","adbafbabd"))#找到a用A来替换，在第三个字符串中处理
#建立在正则表达式中，被比较的字符串前面，“加上r”，不用担心转移字符的问题

Python解析内容

11.正则提取，对应两步关键操作
'1.添加正则规则，全局变量，eg:

#影片详情链接的规则
findLink=re.compile(r'<a href="(.*?)">')#创建正则表达式对象，表示规则（字符串的模式）

'2.对应正则规则，在分析过程中进行内容提取,eg:

#影片详情的链接
link=re.findall(findLink,item)[0]#re库用来通过正则表达式查找指定的字符串
data.append(link)#添加链接

python保存数据

12.使用Excel保存，主要使用xlwt库

import xlwt

workbook=xlwt.Workbook(encoding="utf-8")#创建workbook对象
worksheet=workbook.add_sheet('sheet1')#创建工作表
worksheet.write(0,0,'hello')  #写入数据，第一行参数“行”，第二行参数“列”，第三个参数是内容
workbook.save("stu.xls")  #保存数据表

13.使用SQLIT，了解常用语句和写法

import sqlite3
#1.建立数据库
# conn=sqlite3.connect("test.db")#打开或是创建数据库文件
# print("Open database successfully")

#2.创建数据表
# conn=sqlite3.connect("test.db")#打开或是创建数据库文件
# print("成功打开数据库")
# c=conn.cursor() #获得游标
# sql='''
#     create table company
#         (id int primary key not null,
#         name text not null,
#         age int not null,
#         address char(50),
#         salary real);
#
# '''
#
# c.execute(sql)   #执行SQL语句
# conn.commit()    #提交数据库操作
# conn.close()     #关闭数据库链接
#
# print("Open database successfully")

#3.插入数据
# conn=sqlite3.connect("test.db")#打开或是创建数据库文件
# print("成功打开数据库")
# c=conn.cursor() #获得游标
# sql1='''
#     insert into company(id,name,age,address,salary)
#     values (1,'张三',32,"河南",8888);
# '''
# sql2='''
#     insert into company(id,name,age,address,salary)
#     values (2,'李四',22,"河北",8848);
# '''
# c.execute(sql1)
# c.execute(sql2)   #执行SQL语句
# conn.commit()    #提交数据库操作
# conn.close()     #关闭数据库链接

#4.查询数据
# conn=sqlite3.connect("test.db")#打开或是创建数据库文件
# print("成功打开数据库")
# c=conn.cursor() #获得游标
# sql='''
#     select * from company
# '''
# 
# cursor=c.execute(sql)   #执行SQL语句
# 
# for row in cursor:
#     print("id=",row[0])
#     print("name=", row[1])
#     print("address=", row[3])
#     print("salary=", row[4],"\n")
# 
# conn.commit()    #提交数据库操作
# conn.close()     #关闭数据库链接