python里percentile、filter、linspace，array[array ＞ 0] = 1、pop、startswith()、pandas数据操作

最新推荐文章于 2024-07-30 09:27:17 发布

yangdeshun888

最新推荐文章于 2024-07-30 09:27:17 发布

阅读量1.1k

点赞数 1

分类专栏： python pycharm

本文链接：https://blog.csdn.net/yangdashi888/article/details/92634297

版权

python 同时被 2 个专栏收录

57 篇文章 2 订阅

订阅专栏

pycharm

8 篇文章 1 订阅

订阅专栏

1、percentile获取百分位上的数据：

给定一个递增数组a，求它的中位数。
np.percentile(a,50)
中位数就是50%处的数字，也可以获得0%、100%处的数字

import numpy as np

a = np.array([1, 2, 3, 6, 7, 11, 13])
for ind, v in enumerate(a):
    print("index", ind, "value", v, "percentile", ind / (len(a) - 1) * 100)
    percen = np.percentile(a, ind / (len(a) - 1) * 100)
    assert np.abs(percen - v) < 1e-3, "%s!=%s" % (percen, v)

2、filter是用于数据的过滤，快速简便，可以少写很多代码：

formatted_res = filter(lambda y: y['dist'] < self.distance_cutoff, formatted_res)

其方法的数据类型是filter，由于其使用一个yeild。其中可以直接把其当作一个list来进行使用。

3、linspace获取图像块的重心坐标：

x_coords = np.linspace(window[0][0], window[0][1], n + 2, dtype=int)[1:-1]

其是把图像分为n个中心，所以其数据间隔就要有n+2个

4、array[array > 0] = 1直接赋值符合要求的数据：

5、pop是去除对应的键值并且把其去除的值进行返回：

signature = rec.pop('signature')

6、startswitch()匹配字符串的开头结尾：

filename = 'spam.txt'
filename.startswith('spam')
Out[3]: True
filename.endswith('.txt')
Out[4]: True

7、pandas数据操作：

数据分析自动化特征参考博客：

用 Python Featuretools 库实现自动化特征工程

去除数据列里包含有none的值并去除或者替换，

# drop rows with null customer id
data = data.dropna(axis=0)

去除一样的数据，即对应列里的数据一样的，即重复数据：

# drop the duplicates
data = data.drop_duplicates()

pd读取的数据转换位字符串并进行匹配字符：

#str change to string，and match data of starting with "C"
data['cancelled'] = data['order_id'].str.startswith('C')

pandas可以把这个dataframe格式的数据转换成dict，其内置函数如下：

    def _format_grid_data(self, data: str) -> List[Dict]:
        with open(data, encoding="gbk", errors="replace") as f:
            content = f.read()

        df = pd.read_csv(
            StringIO(content),
            delimiter="\t",
            dtype=self._trader.config.GRID_DTYPE,
            na_filter=False,
        )
        return df.to_dict("records")

进行画柱状图：

data['cancelled'].value_counts().plot.bar(figsize = (6, 5));

plt.title('Cancelled Purchases Breakdown');

填重none数据为0

    # Set the total for any customer who did not have a purchase in the timeframe equal to 0
    totals['total'] = totals['total'].fillna(0)

把日期序列化成pandas的日期格式，并且读取数据后可以使用dt进行快速分解year、day等

testdate=[]
dates=datetime.now().strftime("%Y-%m-%d %H:%M:%S")
testdate.append(pd.to_datetime(dates,format="%Y-%m-%d %H:%M:%S"))
alldata=[]
alldata.append(testdate)
res=pd.DataFrame(testdate,columns=["order_date"])
res.to_csv("testone.csv",index=False)
reads=pd.read_csv("testone.csv",parse_dates=['order_date'])
print(reads['order_date'].dt.year)

输出符合要求的所有数据：

data=pd.read_csv("test.csv",parse_dates=["order_date"])
print("测试loc")
#输出所有符合要求的数据
print(data.loc[data['order_date']=="5"])

选择数据时更是可以如下：

plot_labels = labels.copy()
plot_labels['month'] = plot_labels['cutoff_time'].dt.month

plt.figure(figsize = (12, 6))
sns.boxplot(x = 'month', y = 'total', 
            data = plot_labels[(plot_labels['total'] > 0) & (plot_labels['total'] < 1000)]);
plt.title('Customer Spending Distribution by Month');