1.pathlib模块
(1)Path类
.exist检查目录是否存在
.mkdir创建新目录
.glob寻找当前路径中的目录和文件
- pypi.org上有许多函数可以使用,以openpyxl为例
下载:在terminal窗口输入终端指令,例如pip3 install openpyxl
导入文件:Helloworld-右键-在Explorer中显示-将所需文件粘贴到Helloworld中
处理一个excel文件:
import openpyxl as xl
from openpyxl.chart import BarChart, Reference # BarChart和 Reference是类
wb = xl.load_workbook(“transaction.xlsx”) # 用load_workbook导入exl文件,注意要把exl保存在总文件夹如Helloworld中
sheet = wb[“Sheet1”] # 用方括号导入表格
print(f"共有{sheet.max_row}行") # 显示表格一共有多少行
for row in range(2, sheet.max_row + 1): # range产生的数不包括第二个值,因此要加1,这行的row会在【1,sheet.max_row】之间
cell = sheet.cell(row, 3) # sheet表格中row行3列的单元格
corrected_price = cell.value * 0.9
correct_price_cell = sheet.cell(row, 4) # 创建新的1列,这里是第四列
correct_price_cell.value = corrected_price # 给新的列赋值
Values = Reference(sheet, min_row=2, max_row=sheet.max_row, min_col=4, max_col=4) # Value包含所有第4列的数
chart = BarChart() # 创造一个柱状图
chart.add_data(Values) # 图里的数据来自Values
sheet.add_chart(chart, “e2”) # 在表格里添加这张图
wb.save(“transaction2.xlsx”) # 保存为新文件
处理多个文件:将上述代码定义为一个函数,用filename作为变量替代“transaction.xlsx"
3.爬虫
使用一个叫做selenium的包
4.libraries:
Numpy
Pandas:数据分析库,数据帧
MatPlotLib:二维绘图库,用于create graphs on plots
Sciikit-Learn:提供算法库,如decision tree, neural network
5.requests库
(1)get()应用:
r = requests.get(url)
#get 构造一个向服务器请求资源的Request对象
#r 是Response对象,包含服务器所有相关资源信息
r.encoding :如果header中不存在charset,则编码为ISO-8859-1,这个编码不能解析中文
r.apparent_encoding : 分析内容找到encoding编码,
当encoding不能正确返回时,
令r.encoding=r.apparent_encoding
requests.get(url, params = None, kwargs)
#url获得页面url链接
#params是url的额外参数
#12个控制访问的参数
import requests
r = requests.get(“https://www.baidu.com/?tn=18029102_3_dg”)
r.status_code # 检测请求的状态码
200 # 200说明访问成功,否则为失败r.encoding = ‘utf-8’
r.text
r.headers # 返回头部信息
获取网页的源代码
import requests
r = requests.get(“https://python123.io/ws/demo.html”)
r.text
(2)Requests库的异常
r.raise_for_status( ) # 如果不是200,将产生异常requests.HTTPError
(3)爬取网页的通用框架
6.htttp协议
PATCH仅需要提交更改部分
PUT需要更改后的全部信息
7.Requests库的post()方法
Requests库的put()方法
8.requests.request( )
(1)params GET
把键值对增加到url中,使url再次访问时带入了一些参数
import requests
kv = {‘key1’: ‘value1’, ‘key2’: ‘value2’}
r= requests.request(‘GET’, ‘https://search.bilibili.com/all?keyword=python%20mosh’, params = kv)
print(r.url)
https://search.bilibili.com/all?keyword=python%20mosh&key1=value1&key2=value2&rt=V%2FymTlOu4ow%2Fy4xxNWPUZ9n3u3FeQUhkmTW3nVhxVWw%3D
(2)data POST
向服务器提交资源
body = ‘aaaa’
s = requests.request(‘POST’,‘https://editor.csdn.net/md?articleId=105171311’, data = body)
字符串会存到链接对应的位置中
(3)headers
定制头
hd = {'user-agent': 'Chrome/10'}
# 模拟Chrome10浏览器发起访问
r=requests.request('POST','https://space.bilibili.com/271922391/favlist?fid=285042691&ftype=create', headers = hd)
9.html5是以一系列<>为主的标签封装的信息
10.BeautifulSoup:解析、遍历、维护标签树的库
将标签树转换为BeautifulSoup类
from bs4 import BeautifulSoup # BeautifulSoup是类
soup = BeautifulSoup(' data', 'html.parser')
soup = BeautifulSoup(open("D://demo.html", 'html.parser') #通过打开文档的方式
(1)检查bs4是否安装成功
import requests
r = requests.get(“https://python123.io/ws/demo.html”)
r.text
‘This is a python demo page \r\n\r\nThe demo python introduces several python courses.
\r\nPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.
\r\n’demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, “html.parser”)将网页熬成一锅soup 😄
对demo 进行html 解析
print(soup.prettify())
The demo python introduces several python courses.
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: Basic Python and Advanced Python .
# 查看是否成功解析
(2)bs4库的基本元素
Tag
<p class = "title"> .... </p >组成标签对 # p是标签的名称
(3)4种解析器
(4)
import requests
r = requests.get(“https://python123.io/ws/demo.html”)
r.text
‘This is a python demo page \r\n\r\nThe demo python introduces several python courses.
\r\nPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.
\r\n’demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, “html.parser”)
soup.title
<title>This is a python demo page</title>
>>> # 页面的标题
>>> tag = soup.a # .a表示链接标签
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.tag
>>> print(soup.tag)
None
>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.name #获得a标签的名字
'a'
>>> soup.append.parent.name #a的父亲的名字
Traceback (most recent call last):
File "<pyshell#15>", line 1, in <module>
soup.append.parent.name #a的父亲的名字
AttributeError: 'function' object has no attribute 'parent'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name #a的父亲的父亲的名字
>>> tag.attrs
>>> tag = soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
tag.attrs[‘class’]
[‘py1’]获得class 属性的值
tag.attrs[‘href’] #获得链接
‘http://www.icourse163.org/course/BIT-268001’type(tag.attrs) # 查看标签属性的类型
<class ‘dict’>type(tag) #查看标签类型
<class ‘bs4.element.Tag’>soup.a.string #标签中间的字符串
‘Basic Python’soup.p
soup.p.string
‘The demo python introduces several python courses.’type(soup.p.string)
<class ‘bs4.element.NavigableString’>
# 不包含<b>,说明NavigableString可以跨越标签层次
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>", "html.parser")
newsoup.b.string
'This is a comment'
# 打印出了注释内容
print(newsoup.b.string)
This is a comment
type(newsoup.b.string)
<class 'bs4.element.Comment'>
#注释的类型
type(newsoup.p.string)
<class 'bs4.element.NavigableString'>
#p的字符串的类型