入门Pandas必练习100题基础到进阶|阶级教程-1


217eb9a606cb0bb70e72b4c880f659ae.gif

作者:郭震

36. How to import only specified columns from a csv file?
  • 如何从CSV文件中仅导入指定的列?

# input

# code that generates the housing_preprocessed.csv file
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
with open("/kaggle/input/boston-house-prices/housing.csv") as f:
    data = f.read()
    nth_rows = []
    for i, rows in enumerate(data.split("\n")):
        nth_rows.append(rows)

data_ = [nth_rows[i].split() for i in range(len(nth_rows))]

df = pd.DataFrame(data_, columns=names)
df.to_csv("housing_preprocessed.csv")
del df

# use the /kaggle/input/boston-house-prices/housing_preprocessed.csv file
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

file = "housing_preprocessed.csv"
# using index
df = pd.read_csv(file, usecols = [1, 2, 4], skipfooter=1)
df.head()
# using column names
df = pd.read_csv(file, usecols = ["CRIM", "ZN", "CHAS"])
df.head()
37.  How to get the nrows, ncolumns, datatype, summary stats of each column  of a dataframe? Also get the array and list equivalent.
  • 如何获取DataFrame的行数、列数、数据类型、每列的汇总统计信息?同时获取数组和列表的等效信息。

# input
# use the "housing_preprocessed.csv" file

df = pd.read_csv("housing_preprocessed.csv", index_col=0 ,skipfooter=1)
# number of rows and columns
df.shape

# each type of column
df.dtypes

# a more general view of the earlier code
df.info()

# how many columns under each dtype
df.get_dtype_counts()
df.dtypes.value_counts()

# all the statistics
df.describe()
38. How to extract the row and column number of a particular cell with given criterion?
# input
# use the "housing_preprocessed.csv" file

# solution 1
df = pd.read_csv("housing_preprocessed.csv", skipfooter=1, index_col=0)
# let's get the maximum value
max_tax = df["TAX"].max()
max_tax

# now let's find the column and cell that has the maximum value
df[df["TAX"] == max_tax]

# solution 2
df.loc[df["TAX"] == np.max(df["TAX"]), ["CRIM", "ZN", "TAX"]]

# solution 3
# get the row and column number
row, col = np.where(df.values == np.max(df["TAX"]))
for i, j in zip(row, col):
    print(i , j)

# Get the value
df.iat[row[0], col[0]]
df.iloc[row[0], col[0]]

# Alternates
df.at[row[0], 'TAX']
df.get_value(row[0], 'TAX')

# The difference between `iat` - `iloc` vs `at` - `loc` is:
# `iat` snd `iloc` accepts row and column numbers. 
# Whereas `at` and `loc` accepts index and column names.
39. How to rename a specific columns in a dataframe?
# input
# Rename the column Type as CarType in df and replace the ‘.’ in column names with ‘_’.
cars93 = pd.read_csv("../input/cars93/Cars93.csv", index_col=0)
cars93.head()

'''
Desired Output

Index(['Manufacturer', 'Model', 'CarType', 'Min_Price', 'Price', 'Max_Price',
        'MPG_city', 'MPG_highway', 'AirBags', 'DriveTrain', 'Cylinders',
        'EngineSize', 'Horsepower', 'RPM', 'Rev_per_mile', 'Man_trans_avail',
        'Fuel_tank_capacity', 'Passengers', 'Length', 'Wheelbase', 'Width',
        'Turn_circle', 'Rear_seat_room', 'Luggage_room', 'Weight', 'Origin',
        'Make'],
       dtype='object')
'''

# Solution 1: in 2 steps
# Step1
# first let's rename the Type to CarType
cars93 = pd.read_csv("../input/cars93/Cars93.csv", index_col=0)
cars93.rename(columns={"Type":"CarType"}, inplace = True)
cols = cars93.columns
# or
df.columns.values[2] = "CarType"
# Step2
# replace the "." with "-"
cols = list(map(lambda x: x.replace(".", "_"), cols))
cars93.columns = cols
cars93.head()

# Solution 2: working only with lists
cars93 = pd.read_csv("../input/cars93/Cars93.csv", index_col=0)
cols = cars93.columns
cols = list(map(lambda x: x.replace(".", "_"), cols))
cols[cols.index("Type")] = "CarType"
cars93.columns = cols
cars93.head()
40. How to check if a dataframe has any missing values?
# input
df = pd.read_csv("../input/cars93/Cars93.csv")
df

# Solution 1
print("Our df has a total of {} null values".format(df.isnull().sum().sum()))
print()

# Solution 2
df.isnull().values.any()
print()

# Solution 3
# A more detailed one
def report_nulls(df):
    '''
    Show a fast report of the DF.
    '''
    rows = df.shape[0]
    columns = df.shape[1]
    null_cols = 0
    list_of_nulls_cols = []
    for col in list(df.columns):
        null_values_rows = df[col].isnull().sum()
        null_rows_pcn = round(((null_values_rows)/rows)*100, 2)
        col_type = df[col].dtype
        if null_values_rows > 0:
            print("The column {} has {} null values. It is {}% of total rows.".format(col, null_values_rows, null_rows_pcn))
            print("The column {} is of type {}.\n".format(col, col_type))
            null_cols += 1
            list_of_nulls_cols.append(col)
    null_cols_pcn = round((null_cols/columns)*100, 2)
    print("The DataFrame has {} columns with null values. It is {}% of total columns.".format(null_cols, null_cols_pcn))
    return list_of_nulls_cols

report_nulls(df)
41. How to count the number of missing values in each column?

Count the number of missing values in each column of df. Which column has the maximum number of missing values?

# input
df = pd.read_csv("../input/cars93/Cars93.csv")

# Solution 1
df_null = pd.DataFrame(df.isnull().sum())
df_null[df_null[0] > 0][0].argmax()
df_null[df_null[0] > 0][0].idxmax()

# Solution 2
# find the total number of nulls per column
n_missings_each_col = df.apply(lambda x: x.isnull().sum())

# find the maximum nulls
n_missings_each_col.argmax()
n_missings_each_col.idxmax()
42. How to replace missing values of multiple numeric columns with the mean?

Replace missing values in Luggage.room columns with their respective mean.

# input
df = pd.read_csv("../input/cars93/Cars93.csv")

# Solution 1
beg_null = df.isnull().sum().sum()
print(beg_null)
# notice that we have filtering the columns  as a list.
df[["Luggage.room"]] = df[["Luggage.room"]].apply(lambda x: x.fillna(x.mean()))
end_null = df.isnull().sum().sum()
print(end_null)

print("We have got rid of {} null values, filling them with the mean.".format(beg_null - end_null))
43. How to use apply function on existing columns with global variables as additional arguments?

In df, use apply method to replace the missing values in  Rear.seat.room with mean Luggage.room with median by passing an argument  to the function.

# input
df = pd.read_csv("../input/cars93/Cars93.csv")

# input
df = pd.read_csv("../input/cars93/Cars93.csv")

# Solution 1
print("We have a total of {} nulls".format(df.isnull().sum().sum()))

d = {'Rear.seat.room': np.nanmean, 'Luggage.room': np.nanmedian}
df[['Rear.seat.room', 'Luggage.room']] = df[['Rear.seat.room', 'Luggage.room']].apply(lambda x, d: x.fillna(d[x.name](x)), args=(d, ))

print("We have a total of {} nulls".format(df.isnull().sum().sum()))

df["Rear.seat.room"].sum()
df["Luggage.room"].sum()


# Solution 2
# impor the df
df = pd.read_csv("../input/cars93/Cars93.csv")

# check nulls
print("We have a total of {} nulls".format(df.isnull().sum().sum()))

# define a custom function
def num_inputer(x, strategy):
    if strategy.lower() == "mean":
        x = x.fillna(value = np.nanmean(x))
    if strategy.lower() == "median":
        x = x.fillna(value = np.nanmedian(x))
    return x

# apply the custon function and using args whe can pass the strategy we want
df['Rear.seat.room'] = df[['Rear.seat.room']].apply(num_inputer, args = ["mean"])
df['Luggage.room'] = df[['Luggage.room']].apply(num_inputer, args = ["median"])

# check for nulls
print("We have a total of {} nulls".format(df.isnull().sum().sum()))

df["Rear.seat.room"].sum()
df["Luggage.room"].sum()
44. How to select a specific column from a dataframe as a dataframe instead of a series?

Get the first column (a) in df as a dataframe (rather than as a Series).

# input
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

# Solution
# using to_frame()
type(df["a"].to_frame())
# using pandas DataFrame
type(pd.DataFrame(df["a"]))

# Other solutions
# Solution
type(df[['a']])
type(df.loc[:, ['a']])
type(df.iloc[:, [0]])

# This returns a series
# Alternately the following returns a Series
type(df.a)
type(df['a'])
type(df.loc[:, 'a'])
type(df.iloc[:, 1])
45. How to change the order of columns of a dataframe?

Actually 3 questions.

  1. In df, interchange columns 'a' and 'c'.

  2. Create a generic function to interchange two columns, without hardcoding column names.

  3. Sort the columns in reverse alphabetical order, that is colume 'e' first through column 'a' last.

# input
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

# Solution to question 1
# we pass a list with the custom names BUT THIS DOESN'T change in place
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df[["c", "b", "a", "d", "e"]]
df

# if we reasing that this will work
df = df[["c", "b", "a", "d", "e"]]
df

# Solution to question 2
def change_cols(df, col1, col2):
    df_columns = df.columns.to_list()
    index1 = df_columns.index(col1)
    index2 = df_columns.index(col2)
    # swaping values
    df_columns[index1], df_columns[index2] = col1, col2

    return df[df_columns]


df = change_cols(df, "b", "e")
df


# Solution to question 3
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
col_list = list(df.columns)
col_list_reversed = col_list[::-1]
col_list
col_list_reversed
# using the trick from solution 1
df = df[col_list_reversed]
df


print("Solution from the website")
print("-------------------------")
# Others solution from the website

# Input
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

# Solution Q1
df[list('cbade')]

# Solution Q2 - No hard coding
def switch_columns(df, col1=None, col2=None):
    colnames = df.columns.tolist()
    i1, i2 = colnames.index(col1), colnames.index(col2)
    colnames[i2], colnames[i1] = colnames[i1], colnames[i2]
    return df[colnames]

df1 = switch_columns(df, 'a', 'c')

# Solution Q3
df[sorted(df.columns)]
# or
df.sort_index(axis=1, ascending=False, inplace=True)
46. How to set the number of rows and columns displayed in the output?

Change the pandas display settings on printing the dataframe df it shows a maximum of 10 rows and 10 columns.

# input
df = pd.read_csv("../input/cars93/Cars93.csv")

# we use set_option to set the maximun rows and columns to display
pd.set_option("display.max_columns",10)
pd.set_option("display.max_rows",10)
df
47. How to format or suppress scientific notations in a pandas dataframe?

Suppress scientific notations like ‘e-03’ in df and print upto 4 numbers after decimal.

# input
df = pd.DataFrame(np.random.random(5)#### 10, columns=['random'])

'''
Desired Output
#>    random
#> 0  0.0035
#> 1  0.0000
#> 2  0.0747
#> 3  0.0000
'''

print("Initial DF")
df
print("Using solution 1")
# Solution 1
df.round(4)
df
pd.reset_option('^display.', silent=True)

print("Using solution 2")
# Solution 2
df.apply(lambda x: '%.4f' %x, axis=1).to_frame()
df
pd.reset_option('^display.', silent=True)

print("Using solution 3")
# Solution 3
pd.set_option('display.float_format', lambda x: '%.4f'%x)
df
pd.reset_option('^display.', silent=True)
df
48. How to format all the values in a dataframe as percentages?

Format the values in column 'random' of df as percentages.

# input
df = pd.DataFrame(np.random.random(4), columns=['random'])
df

# Solution 1
# Using style.format we can pass a dictionary to each column and display as we want
out = df.style.format({
    'random': '{0:.2%}'.format,
})
out

# This applies to all the df
pd.options.display.float_format = '{:,.2f}%'.format
# to get the % multiply by 100
df*100
pd.reset_option('^display.', silent=True)
49. How to filter every nth row in a dataframe?

From df, filter the 'Manufacturer', 'Model' and 'Type' for every 20th row starting from 1st (row 0).

# input
df = pd.read_csv("../input/cars93/Cars93.csv")
df

# First let's import only the columns we need
df = pd.read_csv("../input/cars93/Cars93.csv", usecols=["Manufacturer", "Model", "Type"])

# Solution 1
# Using normal python slicing
df[::20]

df = pd.read_csv("../input/cars93/Cars93.csv", usecols=["Manufacturer", "Model", "Type"])

# Solution 2
# Using iloc
df.iloc[::20, :][['Manufacturer', 'Model', 'Type']]
50. How to create a primary key index by combining relevant columns?

In df, Replace NaNs with ‘missing’ in columns 'Manufacturer', 'Model'  and 'Type' and create a index as a combination of these three columns  and check if the index is a primary key.

# input
df = pd.read_csv("../input/cars93/Cars93.csv")
df

# Solution
df = pd.read_csv("../input/cars93/Cars93.csv", usecols=["Manufacturer", "Model", "Type", "Min.Price", "Max.Price"])

# let's check if we have null
df.isnull().sum().sum()
df.fillna("missing")
# create new index
df["new_index"] = df["Manufacturer"] + df["Model"] + df["Type"]
# set new index
df.set_index("new_index", inplace = True)
df

感谢你的支持,原创不易,希望转发,点击,以及收藏,也可以点击阅读原文更多AI知识分享,同时也可以关注知识星球:郭震AI学习星球

d0f4af3edb8b1feb91cf101936fe38dc.png

长按上图二维码查看「郭震AI学习星球」

  • 更多Python、数据分析、爬虫、前后端开发、人工智能等教程参考.

  • 以上全文,欢迎继续点击阅读原文学习,阅读更多AI资讯,[请点击这里]  https://ai-jupyter.com/

  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值