python开发pandas_python pandas开发笔记

最新推荐文章于 2023-10-31 16:23:55 发布

weixin_39540834

最新推荐文章于 2023-10-31 16:23:55 发布

阅读量163

点赞数

文章标签： python开发pandas

本文链接：https://blog.csdn.net/weixin_39540834/article/details/111861901

版权

1、python 读写excel文件(xlrd、xlwt、openpyxl)

python用于读写excel文件的库有很多，pandas、xlrd、xlwt、openpyxl、xlwings等等。

主要模块：xlrd库：从excel中读取数据，支持xls、xlsx

xlwt库：对excel进行修改操作，不支持对xlsx格式的修改

xlutils库：在xlw和xlrd中，对一个已存在的文件进行修改

openpyxl：主要针对xlsx格式的excel进行读取和编辑

xlwings：对xlsx、xls、xlsm格式文件进行读写、格式修改等操作

xlsxwriter：用来生成excel表格，插入数据、插入图标等表格操作，不支持读取

2、list 数据转DataFrame存到 excel中

import pandas as pd

a = [[1, 2], [2, 4], [4, 8]]

b = pd.DataFrame(a)

print(a)

b.to_excel('./test.xls', index=False, header=['name', 'test'])

3、pd read_excel 数据里是字符串读取的时候被处理成了整数，导致id数据如 00314 变成了 314，可以使用converters属性来避免这种情况

a = pd.read_excel( './data/topic.xlsx',converters={'id': str})

4、csv 文件打开乱码

Excel在读取csv的时候是通过读取文件头上的bom来识别编码的，如果文件头无bom信息，则默认按照unicode编码读取。这时会出现乱码情况。

将非unicode编码的csv文件，用文本编辑器(推荐notepad++)打开并转换为带bom的编码形式(具体编码方式随意)，问题解决。

5、多个excel 合并

filenames = os.listdir("./")

data = pd.read_csv('a1.csv', encoding='utf-8')

data.to_excel('t1.xls', index=False)

for o in filenames:

if len(re.findall('^a.*.csv', o)) and o != 'a1.csv':

# print(o)

data = pd.read_csv(o)

data.to_csv('t1.xls', index=False, header=False, mode='a+')

6、 excel数据分成 train 80%、dev 10%、test 10% 数据集

data = df.sample(frac=1)

split_1 = int(0.8 * len(data))

split_2 = int(0.9 * len(data))

train_data = data[:split_1]

dev_data = data[split_1:split_2]

test_data = data[split_2:]

对df进行shuffle。其中参数frac是要返回的比例，比如df中有10行数据，我只想返回其中的30%,那么frac=0.3

7、pandas 对数据列顺序调整

import pandas

dict_a = {'user_id':['webbang','webbang','webbang'],'book_id':['3713327','4074636','26873486'],'rating':['4','4','4'],'mark_date':['2017-03-07','2017-03-07','2017-03-07']}

df = pandas.DataFrame(dict_a)

print(df)

# 调整列顺序

df = df[['user_id','book_id','rating','mark_date']]

print(df)

## finished

8、两个pandas Dataframe 数据变量合并

# 按行合并

pd.concat([df1,df2],ignore_index=True, sort=False)

df1 = pd.DataFrame([['a', 1], ['b', 2]],

columns=['letter', 'number'])

df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],

columns=['animal', 'name'])

df = pd.concat([df1, df4], axis=1)

letter number animal name

0 a 1 bird polly

1 b 2 monkey george

df = pd.read_csv("path/to/file.csv",

names=["Sequence", "Start", "End", "Coverage"])

一开始csv文件没有header 这样读取就会有header了

10、 Dataframe 数据单列进行运算

def square(x):

return (x**2)

df['col2'] = df['col1'].map(square)

11、 pandas按行按列遍历Dataframe

iterrows(): 按行遍历，将DataFrame的每一行迭代为(index, Series)对，可以通过row[name]对元素进行访问。

itertuples(): 按行遍历，将DataFrame的每一行迭代为元祖，可以通过row[name]对元素进行访问，比iterrows()效率高。

iteritems():按列遍历，将DataFrame的每一列迭代为(列名, Series)对，可以通过row[index]对元素进行访问。

df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])

>>> row = next(df.iterrows())[1]

>>> row

int 1.0

float 1.5

Name: 0, dtype: float64

df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial'],

'population': [1864, 22000, 80000]},

index=['panda', 'polar', 'koala'])

species population

panda bear 1864

polar bear 22000

koala marsupial 80000

for label, content in df.items():

print('label:', label)

print('content:', content, sep='\n')

12、DataFrame.drop_duplicates

df = pd.DataFrame({

'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],

'style': ['cup', 'cup', 'cup', 'pack', 'pack'],

'rating': [4, 4, 3.5, 15, 5]

})

print(df)

brand style rating

0 Yum Yum cup 4.0

1 Yum Yum cup 4.0

2 Indomie cup 3.5

3 Indomie pack 15.0

4 Indomie pack 5.0

去除重复的行，比较所有的列

df.drop_duplicates()

brand style rating

0 Yum Yum cup 4.0

2 Indomie cup 3.5

3 Indomie pack 15.0

4 Indomie pack 5.0

根据某列来去重

df.drop_duplicates(subset=['brand'])

brand style rating

0 Yum Yum cup 4.0

2 Indomie cup 3.5

根据多列去重

df.drop_duplicates(subset=['brand', 'style'], keep='last')

brand style rating

1 Yum Yum cup 4.0

2 Indomie cup 3.5

4 Indomie pack 5.0

13. 两个Dataframe相减

可以根据两个dataframe的其中一个属性，例如id

import pandas as pd

import numpy as np

a=np.array([[1,3],[4,6],[7,9]])

b=np.array([[1,3]])

df1=pd.DataFrame(a,columns=['id', 'value']))

print(df1)

df2=pd.DataFrame(b,columns=['id', 'value']))

#df1[df1['id'].isin(df2['id'].tolist())] 选取df1中 id列包含了df2的id列

# 取反

df1[~df1[df1['id'].isin(df2['id'].tolist())]]

# 就是我们要的两个Dataframe 的差

weixin_39540834

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python开发pandas_python pandas开发笔记

1、python 读写excel文件(xlrd、xlwt、openpyxl)python用于读写excel文件的库有很多，pandas、xlrd、xlwt、openpyxl、xlwings等等。主要模块：xlrd库：从excel中读取数据，支持xls、xlsxxlwt库：对excel进行修改操作，不支持对xlsx格式的修改xlutils库：在xlw和xlrd中，对一个已存在的文件进行修改op...
复制链接

扫一扫