pandas模块的使用

weixin_50170972

已于 2023-10-07 18:02:25 修改

阅读量448

点赞数

分类专栏： excel表格操作文章标签： windows excel pandas python

于 2023-08-29 17:00:02 首次发布

本文链接：https://blog.csdn.net/weixin_50170972/article/details/132564527

版权

excel表格操作专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1.read_excel模块

1.常用参数

参数名	参数意义
io	excel路径
sheet_name	默认值:0；类型:int or str;
header	默认值:0;类型:int;含义:第几行作为表头
dtype	默认值:None;类型:dict;指定某种类型
converters	默认值:None;类型:dict;读取时对列数据进行转化
parse_dates	默认值:None;类型:布尔值or list,字符串转日期格式
date_parser	默认值:none;类型:函数；自定义解析日期函数

2.dtype的区别

df = pd.read_excel(parh,dtype=str)
或者
df = pd.read_excel(parh,dtype={"id":int,"日期":str,"目的地":str})
df = df.fillna("")

3.converters的解析

含义：用于转换某些列中的值的函数。键可以是整数或列标签，值是接受一个输入参数、Excel单元格内容并返回转换内容的函数

df = pd.read_excel(parh, converters={'类别编码': str})

4.data_parser与parse_datas

先在parse_dates中确认日期列，然后再date_parser中利用lambda函数将其转化为datetime格式

import pandas as  pd
 
 
def read(parh):
    df = pd.read_excel(parh,parse_dates=['日期'], date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d'))
    df = df.fillna("")
    df_dict = df.to_dict("records")
    for id,content in enumerate(df_dict):
        print(content)
 
 
path = r"D:\百度网盘\百度网盘文件\pandas培训\file\read_excel.xlsx"
read(path)

2.read_csv模块

参数名	参数意义
csv路径	读取csv文件的路径
open对象	打开文件
sep	读取文件指定分隔符

1.filepath 或者 buffer

打开csv文件并且读取内容

import pandas as  pd
 
 
def read(parh):
    file_csv = open(parh)
    df = pd.read_csv(file_csv,dtype=str)
    df = df.fillna("")
    print(df)
    df_dict = df.to_dict("records")
    for id,value in enumerate(df_dict):
        print(id)
        print(value)
    file_csv.close()
 
 
path = r"D:\百度网盘\百度网盘文件\pandas培训\file\demo.csv"
read(path)

2.sep的使用

sep使用分隔符

df = pd.read_csv(csv_path,dtype=str)
print(df)
输出
  id         日期 目的州别
0  1  2022/4/25   亚洲
1  2  2022/4/26   亚洲
2  1  2022/4/25   亚洲
3  2  2022/4/26   亚洲
 
 
df = pd.read_csv(csv_path,dtype=str,sep='\t')
print(df)
输出
       id,日期,目的州别
0  1,2022/4/25,亚洲
1  2,2022/4/26,亚洲
2  1,2022/4/25,亚洲
3  2,2022/4/26,亚洲

3.read_html模块

参数名	参数意义
io	网页地址以及本地html文件
flavor	解析器

1.获取网页中的表格

import chardet
import pandas as pd
 
def get_incode(path):
    df2 = pd.DataFrame()
    for i in range(30):
        url = "http://47.114.52.15/log/?page={}".format(i+1)
        df2=pd.concat([df2,pd.read_html(url)[0]])  #read_html返回格式是list,默认选择第一个
        df = pd.read_html(path,encoding='utf-8')
        print(type(df))
    df2.to_excel(r"C:\Users\zzy\Desktop\记录.xlsx",encoding='utf-8',index=0)
 
path = r"C:\Users\zzy\Desktop\1.html"
get_incode(path)

4.to_excel

参数	参数意义
excel_writer	文件路径
sheet_name	指定sheet页，默认sheet1
columns	指定列进行保存
header	指定某行作为列名
index	是否显示行索引

1.参数columns

指定列进行保存

import pandas as pd
 
def get_incode(path):
    df2 = pd.DataFrame()
    for i in range(30):
        url = "http://47.114.52.15/log/?page={}".format(i+1)
        df2=pd.concat([df2,pd.read_html(url)[0]])
        df = pd.read_html(path,encoding='utf-8')
        print(type(df))
    #columns指的是全部列名，比如dataframe中有五列，columns里有四个元素，则to_excel后显示为四列
    df2.to_excel(r"C:\Users\zzy\Desktop\记录.xlsx",encoding='utf-8',index=0,sheet_name='记录',columns=['ID','IP地址','访问地址','访问时刻','访问设备','访问平台'])
 
path = r"C:\Users\zzy\Desktop\1.html"
get_incode(path)

2.参数index

是否增加行索引，0为否，1为是

5.to_csv

除sep外其他参数path,columns,header,index同to_excel参数相同

6.to_html

参数	参数意义
buf	保存html文件路径
columns	指定列名
col_space	每列最小宽度
bold_rows	输出中将行标签加粗
classes	设置style样式
border	设置边框大小
header	设置第几行
index	是否设置行索引

import pandas as pd
 
def get_incode(path):
    df2 = pd.DataFrame()
    for i in range(10):
        url = "http://47.114.52.15/log/?page={}".format(i+1)
        df2=pd.concat([df2,pd.read_html(url)[0]])
        df = pd.read_html(path,encoding='utf-8')
        print(type(df))
    #columns指的是全部列名，比如dataframe中有五列，columns里有四个元素，则to_excel后显示为四列
    df2.to_html(r"C:\Users\zzy\Desktop\记录.html",col_space=3,bold_rows=1,border=1,index=0,columns=['ID','IP地址','访问地址','访问时刻','访问设备','访问平台'])
 
path = r"C:\Users\zzy\Desktop\1.html"
get_incode(path)

输出html的格式为

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th style="min-width: 3px;">ID</th>
      <th style="min-width: 3px;">IP地址</th>
      <th style="min-width: 3px;">访问地址</th>
      <th style="min-width: 3px;">访问时刻</th>
      <th style="min-width: 3px;">访问设备</th>
      <th style="min-width: 3px;">访问平台</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>101.80.32.157</td>
      <td>/</td>
      <td>2021年5月24日 22:20</td>
      <td>谷歌浏览器</td>
      <td>Windows 10</td>
    </tr>
    <tr>
      <td>2</td>
      <td>101.80.32.157</td>
      <td>/get_valid_code/</td>
      <td>2021年5月24日 22:20</td>
      <td>谷歌浏览器</td>
      <td>Windows 10</td>
    </tr>
    <tr>
      <td>3</td>
      <td>101.80.32.157</td>
      <td>/login/</td>
      <td>2021年5月24日 22:20</td>
      <td>谷歌浏览器</td>
      <td>Windows 10</td>
    </tr>
    <tr>
      <td>4</td>
      <td>101.80.32.157</td>
      <td>/index/1.html</td>
      <td>2021年5月24日 22:20</td>
      <td>谷歌浏览器</td>
      <td>Windows 10</td>
    </tr>
    <tr>
      <td>5</td>
      <td>101.80.32.157</td>
      <td>/media/avatar/default.png</td>
      <td>2021年5月24日 22:20</td>
      <td>谷歌浏览器</td>
      <td>Windows 10</td>
    </tr>
    <tr>
      <td>6</td>
      <td>101.80.32.157</td>
      <td>/media/bg_img/default_bg.png</td>
      <td>2021年5月24日 22:20</td>
      <td>谷歌浏览器</td>
      <td>Windows 10</td>
    </tr>
  </tbody>
</table>

7.Filter过滤方法

1.查找null值

def read_code(path):
    df = pd.read_excel(path,dtype=str)
    df_content = df[df['访问平台'].isnull()]
    print(df_content)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
read_code(path)

2.查找非null值

def read_code(path):
    df = pd.read_excel(path,dtype=str)
    df_content = df[df['访问平台'].notnull()]
    print(df_content)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
read_code(path)

3.查找某一列指定内容(全包含)

def read_code(path):
    df = pd.read_excel(path,dtype=str)
    df_content = df[df['访问平台'].isin(['Windows 7'])]
    print(df_content)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
read_code(path)

可以包含多值

def read_code(path):
    df = pd.read_excel(path,dtype=str)
    df_content = df[df['访问平台'].isin(['Windows 7','Windows 10'])]
    df_dict = df_content.to_dict('records')
    for id,content in enumerate(df_dict):
        print(content)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
read_code(path)

4.使用列值过滤

1.过滤指定某列包含内容(全包含)

def read_code(path):
    df = pd.read_excel(path,dtype=str)
    df_content = df[df['访问平台']=='Windows 10']
    df_dict = df_content.to_dict('records')
    for id,content in enumerate(df_dict):
        print(content)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
read_code(path)

2.过滤指定某列前包含某元素(部分匹配)

def read_code(path):
    df = pd.read_excel(path,dtype=str)
    df = df[df['访问平台'].notnull()]
    df_content = df[df['访问平台'].str.startswith('Windo')]
    print(df_content)
    df_dict = df_content.to_dict('records')
    for id,content in enumerate(df_dict):
        print(content)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
read_code(path)

3.过滤判定字符串长度

def read_code(path):
    df = pd.read_excel(path,dtype=str)
    df = df[df['访问平台'].notnull()]
    df_content = df[df['访问平台'].str.len()<=8]
    df_dict = df_content.to_dict('records')
    for id,content in enumerate(df_dict):
        print(content)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
read_code(path)

4.过滤包含字符内容(模糊匹配)

def read_code(path):
    df = pd.read_excel(path,dtype=str)
    df = df[df['访问平台'].notnull()]
    df_content = df[df['访问平台'].str.contains('indows')]
    df_dict = df_content.to_dict('records')
    for id,content in enumerate(df_dict):
        print(content)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
read_code(path)

5.过滤数字大小

def read_code(path):
    #首先读取内容时将ID列全部化成int列
    df = pd.read_excel(path,dtype=str,converters={'ID':int})
    df = df[df['ID'].notnull()]
    df_content = df[df['ID']<=8]
    df_dict = df_content.to_dict('records')
    for id,content in enumerate(df_dict):
        print(content)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
read_code(path)

6.过滤数据并且获取相反值(~或者-)

def read_code(path):
    #首先读取内容时将ID列全部化成int列
    df = pd.read_excel(path,dtype=str,converters={'ID':int})
    df = df[df['ID'].notnull()]
    df_content = df[~(df['访问平台'].str.len()>8)]
    df_dict = df_content.to_dict('records')
    for id,content in enumerate(df_dict):
        print(content)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
read_code(path)

5.多过滤条件

当有多个过滤条件时，使用逻辑操作符 & |

def read_code(path):
    #首先读取内容时将ID列全部化成int列
    df = pd.read_excel(path,dtype=str,converters={'ID':int})
    df = df[df['ID'].notnull()]
    df_content = df[(df['访问地址'].str.contains('avatar')) & (df['访问设备']=='IE浏览器')]
    df_dict = df_content.to_dict('records')
    for id,content in enumerate(df_dict):
        print(content)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
read_code(path)

6.自定义过滤函数apply

1.单列apply

def filter(d):
    if ('avatar'in str(d)):
        return True
    else:
        return False
 
def filter_order(path):
    df = pd.read_excel(path, dtype=str, converters={'ID': int})
    df = df[df['ID'].notnull()]
    df_content = df[df['访问地址'].apply(filter)]
    df_dict = df_content.to_dict('records')
    for id,value in enumerate(df_dict):
        print(value)
 
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
filter_order(path)

2.多列apply

def filter(d):
    if ('avatar'in str(d['访问地址']) and d['访问设备']=='谷歌浏览器'):
        return True
    else:
        return False
 
def filter_order(path):
    df = pd.read_excel(path, dtype=str, converters={'ID': int})
    df = df[df['ID'].notnull()]
    df_content = df[df.apply(filter,axis=1)]
    df_dict = df_content.to_dict('records')
    for id,value in enumerate(df_dict):
        print(value)
 
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
filter_order(path)

7.功能

1.遍历df

value.tolist的方法

def get_data(path):
    df = pd.read_excel(path,dtype=str)
    df = df.fillna("")
    df_list = df.values.tolist()
    print(df_list)
    for value in df_list:
        print(value)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
get_data(path)

to_dict方法

def get_data(path):
    df = pd.read_excel(path,dtype=str)
    df = df.fillna("")
    df_dict = df.to_dict("records")
    for value in df_dict:
        print(value)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
get_data(path)

2.fillna填充空缺数值

1.对空格填充

def get_data(path):
    df = pd.read_excel(path,dtype=str)
    df = df.fillna("AB")
    df_dict = df.to_dict("records")
    for value in df_dict:
        print(value)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
get_data(path)

2.对不同列中的空格进行填充

def get_data(path):
    df = pd.read_excel(path,dtype=str)
    df = df.fillna({'IP地址':'A','访问地址':'B','访问平台':'AB'})
    df_dict = df.to_dict("records")
    for value in df_dict:
        print(value)
 
path = r"C:\Users\zzy\Desktop\记录.xlsx"
get_data(path)