python 数据处理 - Pandas模块

python 数据处理 - Pandas模块

1.一些函数

1.1 pd.DataFrame

class pandas.``DataFrame(data=None, index=None**,** columns=None**,** dtype=None**,** copy=False**)**

Parameters
  • data: ndarray (structured or homogeneous), Iterable, dict, or DataFrame

  • index: Index or array-like

  • columns: Index or array-like

  • dtype: dtype, default None

  • copy: bool, default False

    Copy data from inputs. Only affects DataFrame / 2d ndarray input

# 示例1
voltage = pd.DataFrame(data=np.array(voltage),
			index=pd.MultiIndex.from_product([id, ['real', 'pred']], names=['vin', 'datatype']))
1.2 reset_index, set_index, reindex
1.2.1. reset_index:

DataFrame.``reset_index(level=None, drop=False**,** inplace=False**,** col_level=0**,** col_fill=’')

Definition:

Reset the index, or a level of it. Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.

Parameters
  • level: int, str, tuple, or list, default None

    Only remove the given levels from the index. Removes all levels by default.

  • drop: bool, default False

    Do not try to insert index into dataframe columns. This resets the index to the default integer index.

  • inplace: bool, default False

    Modify the DataFrame in place (do not create a new object).

  • col_level: int or str, default 0

    If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.

  • col_fill: object, default ‘’

    If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.

# 示例1
df.reset_index(inplace=True, drop=True)
1.2.2. set_index:

DataFrame.``set_index(keys, drop=True**,** append=False**,** inplace=False**,** verify_integrity=False**)**

Definition:

Set the DataFrame index using existing columns. Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it.

Parameters
  • keys: label or array-like or list of labels/arrays

    This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index, np.ndarray, and instances of Iterator.

  • drop: bool, default True

    Delete columns to be used as the new index.

  • append: bool, default False

    Whether to append columns to existing index.

  • inplace: bool, default False

    Modify the DataFrame in place (do not create a new object).

  • verify_integrity: bool, default False

    Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.

# 示例1
df.set_index('vin', inplace=True)
# 示例2, multiindex
df.set_index(['a', 'b'], inplace=True)
# 示例3, series
df.set_index(pd.Series(range(6)))
1.2.3 reindex
Definition:

Conform Series/DataFrame to new index with optional filling logic. Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.

Parameters
  • keywords for axes: array-like, optional

    New labels / index to conform to, should be specified using keywords. Preferably an Index object to avoid duplicating data.

  • method: {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}

    Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index. None (default): don’t fill gaps pad / ffill: Propagate last valid observation forward to next valid. backfill / bfill: Use next valid observation to fill gap. nearest: Use nearest valid observations to fill gap.

  • copy: bool, default True

    Return a new object, even if the passed indexes are the same.

  • level: int or name

    Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value: scalar, default np.NaN

    Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

  • limit: int, default None

    Maximum number of consecutive elements to forward or backward fill.

  • tolerance: optional

    Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] - target) <= tolerance. Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.

# 示例1
df = df.reindex(index=car_df.index[::-1])
# 示例2
df = df.reindex(columns=['a', 'b'])
# 示例3 - 使用df.columns重新命名dataframe
image-20201015173427308
1.3. replace

DataFrame.``replace(to_replace=None, value=None**,** inplace=False**,** limit=None**,** regex=False**,** method='pad’)

Parameters

  • to_replace: str, regex, list, dict, Series, int, float, or None

  • value: scalar, dict, list, str, regex, default None

    Value to replace any values matching to_replace with.

# 示例1
df['TermIndex'].replace([1,2],['一','二'], inplace=True) #替换“TermIndex”的值,将数字转为中文
1.4. pd.to_datetime

pandas.to_datetime(arg,errors =‘raise’,utc = None,format = None,unit = None )
pandas中的to_datetime( )有和datetime( )类似的功能。获取指定的时间和日期,将Str和Unicode转化为datetimedatetime(str,format)。

# 示例1
df['date_formatted']=pd.to_datetime(df['date'],format='%Y-%m-%d')
1.5. pd.read_csv
Parameters
  • filepath_or_buffer: str, path object or file-like object

  • header**: int, list of int, default ‘infer’**

  • names****array-like, optional

    List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

  • index_col****int, str, sequence of int / str, or False, default None

  • usecols****list-like or callable, optional

  • dtype****Type name or dict of column -> type, optional

  • nrows****int, optional

  • encodingstr, optional

    Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings

  • chunksizeint, optional

    Return TextFileReader object for iteration. See the IO Tools docs for more information on iterator and chunksize.

  • parse_datesbool or list of int or names or list of lists or dict, default False

    The behavior is as follows:

    • boolean. If True -> try parsing the index.
    • list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
    • list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
    • dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

    If a column or index cannot be represented as an array of datetimes, say because of an unparseable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv. To parse an index or column with a mixture of timezones, specify date_parser to be a partially-applied pandas.to_datetime() with utc=True. See Parsing a CSV with mixed timezones for more.

    Note: A fast-path exists for iso8601-formatted dates.

  • date_parserfunction, optional

    Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.

# 示例1
data = pd.read_csv(r"D:\ThinkEnergy_Data\车辆网样例数据\15.csv",usecols=['created','voltage','total_current','soc_display','year','month','day'])
# 示例2 - 输入设置index,如下将第一列作为索引
data = pd.read_csv(filepath, index_col=0)
# 示例3 - 输出没有index
data.to_csv(filepath, index=False)

2.一些tricks

2.1. 取满足条件的行 & 取满足条件行的相应索引
# 示例1
data = data[((data.index) % 50 == 0)]
# 示例2
index = data[(data['status'] == 2) & (data['charging'] == 1)].index.tolist()
# 示例3 - 找到包含nan值的行
temp_indices = data[data.isnull().T.any()].index.tolist()
2.2. 处理缺失值
# 示例1
data.fillna(0, inplace=True)
# 示例2
data.dropna(subset=['volt'], axis=0, how='any', inplace=True)
# 示例3
data.replace(nan, 0)
2.3. 丢弃 drop
# 示例1
data.drop(temp_indices, axis=1,inplace=True)
# 示例2
data.drop_duplicates(inplace=True)
2.4. 拷贝
# 示例1
df1 = df.copy(deep=True)
2.5. 其他
df1 = df.describe()

a=pd.cut(df['压差'],[0,0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,
0.05,0.055,0.06,0.065,0.07,0.075,0.08,0.085,0.09,0.095,0.1,1000])
df_V = pd.Series(df['压差']).groupby(a).agg(['count']).reset_index()#groupby提取汇总统计值
print(df_V)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值