numpy，pandas数据处理常见问题

最新推荐文章于 2023-03-25 18:20:41 发布

weixin_34111819

最新推荐文章于 2023-03-25 18:20:41 发布

阅读量223

点赞数

文章标签： python 测试

原文链接：https://my.oschina.net/zhiyonghe/blog/1475673

版权

2019独角兽企业重金招聘Python工程师标准>>>

pandas 实现vlooup功能 真心速度快啊 https://www.168seo.cn/python/23694.html

十分钟快速入门 Pandas https://www.168seo.cn/python/23708.html

十分钟入门Matplotlib https://www.168seo.cn/python/23712.html

这里就着重聊聊一些使用过程中常用到但教科书里找不着的问题，省的各位朋友还跑去stackoverflow找答案。

问题一、dataframe里面.values，.iloc，.ix，.loc的区别？

1.只有values是将原本dataframe数据强制转化为numpy格式的数据来索引，其他3个都是对dataframe本身的数据来索引，其中iloc是对基于values的位置来索引调用，loc是对index和columns的位置来索引，而ix则是先用loc的方式来索引，索引失败就转成iloc的方式；

stackoverflow 解释https://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation

First, a recap:

loc works on labels in the index.
iloc works on the positions in the index (so it only takes integers).
ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.

It's important to note some subtleties that can make ix slightly tricky to use:

if the index is of integer type, ix will only use label-based indexing and not fall back to position-based indexing. If the label is not in the index, an error is raised.
if the index does not contain only integers, then given an integer, ix will immediately use position-based indexing rather than label-based indexing. If however ix is given another type (e.g. a string), it can use label-based indexing.

2.iloc 索整型索引
loc 索引字符串索引
ix 是 iloc 和 loc的合体

例如：
import pandas as pd
index_loc = ['a','b']
index_iloc = [1,2]
data = [[1,2,3,4],[5,6,7,8]]

columns = ['one','two','three','four']
df1 = pd.DataFrame(data=data,index=index_loc,columns=columns)
df2 = pd.DataFrame(data=data,index=index_iloc,columns=columns)

# df1 用.loc['a']可以正确检索，但用.iloc['a']则会报错
print(df1.loc['a'])
#-------------------
one 1
two 2
three 3
four 4
Name: a, dtype: int64
#-------------------

# 相反，df2 用.iloc[10]可以正确检索，但用.loc[10]则会报错

print(df2.loc[1])
#-------------------
one 1
two 2
three 3
four 4
Name: 10, dtype: int64
#-------------------

# .ix则在两个df里都可以通用
df1.ix['a']
df2.ix[1]
可以检索字符串型索引，也可以索引整型索引。

问题二、可否有两层，或2层以上的columns或index？有的话如何索引？

可以，索引的话如果用loc或ix，则默认是用第一层的index或columns，最简单的方式是类似于这样：

example.loc[index1, columns1].loc[index2, columns2]
问题三、list, dict, numpy.ndarray, dataframe数据格式如何转换？

1. list转化为numpy.ndarray：

np.array(example)

2. numpy.ndarray转化为list：

list(example)

3. dict转化为dataframe:

example['a'] = {'bb':2, 'cc':3}
eee = pd.DataFrame(example)

4. numpy.ndarray转化为dataframe:

pd.DataFrame(example)

5. dataframe转化为numpy.ndarray：

example.values[:, :]

问题四、numpy.ndarray和dataframe如何填补nan，inf？

1. 对于numpy.ndarray来说：

example = np.where(np.isnan(example), 0, example)
example = np.where(np.isnan(example), 0, example)

2. 对于dataframe来说：

既可以用example.fillna(),还可以用example.replace(a, b)
问题五、各种OI的效率快慢问题？

1. npy读写效率最高，但最费硬盘空间，比如np.load(), np.save();

2. csv其次，比如pd.Dataframe.to_csv()，pd.load_csv()；

3. txt读写，当然也可以很快，但是需要频繁的split，对格式规范的数据比较麻烦；

4. 至于简单的excel和word，可以用xlrd,xlwt来操作；
问题六、关于常见的os操作，包括新建文件夹、遍历文件夹的操作问题？
1. 新建文件夹：

if not os.path.isdir(path_out):
    os.makedirs(path_out)

2. 遍历所有文件和子文件夹：

for a, b, filenames in os.walk(path_data):
    for filename in filenames:

只遍历当前文件，不包含子文件夹：

for a, b, filenames in os.walk(path_data):
    for filename in filenames:
        if a == path_data:

问题七、numpy.ndarray和dataframe如何选取满足条件的行和列数据？

1. 根据新的columns来选取：

frame_[newcolumns]

2. 根据新的index来选取：

frame_[frame_.index.isin(newindex)]

3. 根据某一行或者列的条件来选取：

假如是根据dataframe的第一列，必须大于start_time这个常数，frame_ = frame_.ix[:, frame_.ix[0, :] >= start_date]
或者是根据dataframe的第一行，必须大于start_time这个常数，frame_ = frame_.ix[frame_.ix[:, 0] >= start_date, :]

问题八、如何计算相关性矩阵？

将y和所有x放入到sample = numpy.ndarray下，然后直接np.corrcoef(sample )，默认的是皮尔森相关系数，当然，也可以用ranked correlation，也就是spearman correlation，可以直接用scipy.stats.spearmanr。

问题九、如何取出一串字符串里面的字母或者数字？

1. 取出example里面的数字：
int(''.join(x for x in example if x.isdigit()))
2. 取出example里面的字母：
(''.join(x for x in example if x.alpha()))

问题十、各种merge操作？

1. 纵向merge 格式为numpy.ndarray的数据：
np.hstack((example1, example2))

2. 纵向merge 格式为dataframe的数据，并根据dataframe的index来merge，merge后保留原本各自列的所有index，其他没有该index的列则对应数值为nan：
pd.concat([example1, example2], axis=1)

3. 纵向merge，但是只保留公共的index行：

example.sort_index(axis=1, inplace=True)

4. 横向merge格式为numpy.ndarray的数据：
np.vstack((example1, example2))

5. 横向merge 格式为dataframe的数据，并根据dataframe的column来merge，merge后保留原本的index和columns，其他没有该index或columns的列则对应数值为np.nan：

pd.concat([example1, example2], axis=0)

6. 横向merge，但是只保留公共的columns列：

example.sort_index(axis=0, inplace=True)

问题十一、对dataframe数据的index统一加一个后缀

比如对原本dataframe下的index=[‘aa’, ‘cc’, ‘dddddd’]的，统一加上一个_5m的后缀，通常的操作大家一般就是直接example.index = [x + ‘_5m’ for x in example.index]，这个其实会产生些小问题，因为默认的index是pandas.indexes.base.Index，这个格式可能会默认index里面数据的长度是确定的，导致加_5m后缀失败，所以需要先把格式强制转化为list, 像这样：example.index = [x + ‘_5m’ for x in list(example.index)]

转载于:https://my.oschina.net/zhiyonghe/blog/1475673