Python数据分析习题(基于pandas&numpy模块)(下)

练习题目录

习题编号 内容 相应数据集

  • 练习6 - 统计 探索风速数据 wind.data
  • 练习7 - 可视化 探索泰坦尼克灾难数据 train.csv
  • 练习8 - 创建数据框 探索Pokemon数据 练习中手动内置的数据
  • 练习9 - 时间序列 探索Apple公司股价数据 Apple_stock.csv
  • 练习10 - 删除数据 探索Iris纸鸢花数据 iris.csv

练习6-统计

探索风速数据

步骤1 导入必要的库
import pandas as pd
import datetime
步骤2 从以下地址导入数据
path6 = "./exercise_data/wind.data"
步骤3 将数据作存储并且设置前三列为合适的索引
import datetime
data = pd.read_table(path6, sep = "\s+", parse_dates = [[0,1,2]]) 
data.head()
Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
02061-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
12061-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
22061-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
32061-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
42061-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83
步骤4 2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug
def fix_century(x):
    year = x.year - 100 if x.year > 1989 else x.year
    return datetime.date(year, x.month, x.day)

# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)

# data.info()
data.head()
Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
01961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
11961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
21961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
31961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
41961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83
步骤5 将日期设为索引,注意数据类型,应该是datetime64[ns]
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])

# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')

data.head()
# data.info()
RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
1961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
1961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
1961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
1961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83
步骤6 对应每一个location,一共有多少数据值缺失
data.isnull().sum()
RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64
步骤7 对应每一个location,一共有多少完整的数据值
data.shape[0] - data.isnull().sum()
RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64
步骤8 对于全体数据,计算风速的平均值
data.mean().mean()
10.227982360836924
步骤9 创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差
loc_stats = pd.DataFrame()

loc_stats['min'] = data.min() # min
loc_stats['max'] = data.max() # max 
loc_stats['mean'] = data.mean() # mean
loc_stats['std'] = data.std() # standard deviations

loc_stats
minmaxmeanstd
RPT0.6735.8012.3629875.618413
VAL0.2133.3710.6443145.267356
ROS1.5033.8411.6605265.008450
KIL0.0028.466.3064683.605811
SHA0.1337.5410.4558344.936125
BIR0.0026.167.0922543.968683
DUB0.0030.379.7973434.977555
CLA0.0031.088.4950534.499449
MUL0.0025.888.4935904.166872
CLO0.0428.218.7073324.503954
BEL0.1342.3813.1210075.835037
MAL0.6742.5415.5990796.699794
步骤10 创建一个名为day_stats的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差
day_stats = pd.DataFrame()

# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max 
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations

day_stats.head()
minmaxmeanstd
Yr_Mo_Dy
1961-01-019.2918.5013.0181822.808875
1961-01-026.5017.5411.3363643.188994
1961-01-036.1718.5011.6418183.681912
1961-01-041.7911.756.6191673.198126
1961-01-056.1713.3310.6300002.445356
步骤11 对于每一个location,计算一月份的平均风速

注意,1961年的1月和1962年的1月应该区别对待

data['date'] = data.index

# creates a column for each value from date
data['month'] = data['date'].apply(lambda date: date.month)
data['year'] = data['date'].apply(lambda date: date.year)
data['day'] = data['date'].apply(lambda date: date.day)

# gets all value from the month 1 and assign to janyary_winds
january_winds = data.query('month == 1')

# gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,'RPT':"MAL"].mean()
RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64
步骤12 对于数据记录按照年为频率取4样
data.query('month == 1 and day == 1')
RPTVALROSKILSHABIRDUBCLAMULCLOBELMALdatemonthyearday
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.041961-01-01119611
1962-01-019.293.4211.543.502.211.9610.412.793.545.174.387.921962-01-01119621
1963-01-0115.5913.6219.798.3812.2510.0023.4515.7113.5914.3717.5834.131963-01-01119631
1964-01-0125.8022.1318.2113.2521.2914.7914.1219.5813.2516.7528.9621.001964-01-01119641
1965-01-019.5411.929.004.386.085.2110.256.085.718.6312.0417.411965-01-01119651
1966-01-0122.0421.5017.0812.7522.1715.5921.7918.1216.6617.8328.3323.791966-01-01119661
1967-01-016.464.466.503.216.673.7911.383.837.719.0810.6720.911967-01-01119671
1968-01-0130.0417.8816.2516.2521.7912.5418.1616.6218.7517.6222.2527.291968-01-01119681
1969-01-016.131.635.411.082.541.008.502.424.586.349.1716.711969-01-01119691
1970-01-019.592.9611.793.426.134.089.004.467.293.507.3313.001970-01-01119701
1971-01-013.710.794.710.171.421.044.630.751.541.084.219.541971-01-01119711
1972-01-019.293.6314.544.256.754.4213.005.3310.048.548.7119.171972-01-01119721
1973-01-0116.5015.9214.627.418.2911.2113.547.7910.4610.7913.379.711973-01-01119731
1974-01-0123.2116.5416.089.7515.8311.469.5413.5413.8316.6617.2125.291974-01-01119741
1975-01-0114.0413.5411.295.4612.585.588.128.969.295.177.7111.631975-01-01119751
1976-01-0118.3417.6714.838.0016.6210.1313.179.0413.135.7511.3814.961976-01-01119761
1977-01-0120.0411.9220.259.139.298.0410.755.889.009.0014.8825.701977-01-01119771
1978-01-018.337.127.713.548.507.5014.7110.0011.8310.0015.0920.461978-01-01119781
步骤13 对于数据记录按照月为频率取样
data.query('day == 1')
RPTVALROSKILSHABIRDUBCLAMULCLOBELMALdatemonthyearday
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.041961-01-01119611
1961-02-0114.2515.129.045.8812.087.1710.173.636.505.509.178.001961-02-01219611
1961-03-0112.6713.1311.796.429.798.5410.2513.29NaN12.2120.62NaN1961-03-01319611
1961-04-018.386.348.336.759.339.5411.678.2111.216.4611.967.171961-04-01419611
1961-05-0115.8713.8815.379.7913.4610.179.9614.049.759.9218.6311.121961-05-01519611
1961-06-0115.929.5912.048.7911.546.049.758.299.3310.3410.6712.121961-06-01619611
1961-07-017.216.837.714.428.464.796.716.005.797.966.968.711961-07-01719611
1961-08-019.595.095.544.638.295.254.215.255.375.418.389.081961-08-01819611
1961-09-015.581.134.963.044.252.254.632.713.676.004.795.411961-09-01919611
1961-10-0114.2512.877.878.0013.007.755.839.007.085.2911.794.041961-10-011019611
1961-11-0113.2113.1314.338.5412.1710.2113.0812.1710.9213.5420.1720.041961-11-011119611
1961-12-019.677.758.003.966.002.757.252.505.585.587.7911.171961-12-011219611
1962-01-019.293.4211.543.502.211.9610.412.793.545.174.387.921962-01-01119621
1962-02-0119.1213.9612.2110.5815.7110.6315.7111.0813.1712.6217.6722.711962-02-01219621
1962-03-018.214.839.004.836.002.217.961.874.083.924.085.411962-03-01319621
1962-04-0114.3312.2511.8710.3714.9211.0019.7911.6714.0915.4616.6223.581962-04-01419621
1962-05-019.629.543.583.338.753.752.252.581.672.377.293.251962-05-01519621
1962-06-015.886.298.675.215.004.255.915.414.799.255.2510.711962-06-01619621
1962-07-018.674.176.926.718.175.6611.179.388.7511.1210.2517.081962-07-01719621
1962-08-014.585.376.042.297.873.714.462.584.004.797.217.461962-08-01819621
1962-09-0110.0012.0810.969.259.297.627.418.757.679.6214.5811.921962-09-01919621
1962-10-0114.587.8319.2110.0811.548.3813.2910.638.2112.9218.0518.121962-10-011019621
1962-11-0116.8813.2516.008.9613.4611.4610.4610.1710.3713.2114.8315.161962-11-011119621
1962-12-0118.3815.4111.756.7912.218.048.4210.835.669.0811.5011.501962-12-011219621
1963-01-0115.5913.6219.798.3812.2510.0023.4515.7113.5914.3717.5834.131963-01-01119631
1963-02-0115.417.6224.6711.429.218.1714.047.547.5410.0810.1717.671963-02-01219631
1963-03-0116.7519.6717.678.8719.0815.3716.2114.2911.299.2119.9219.791963-03-01319631
1963-04-0110.549.5912.467.339.469.5911.7911.879.7910.7113.3718.211963-04-01419631
1963-05-0118.7914.1713.5911.6314.1711.9614.4612.4612.8713.9615.2921.621963-05-01519631
1963-06-0113.376.8712.008.5010.049.4210.9212.9611.7911.0410.9213.671963-06-01619631
...................................................
1976-07-018.501.756.582.132.752.215.372.045.884.504.9610.631976-07-01719761
1976-08-0113.008.388.635.8312.928.2513.009.4210.5811.3414.2120.251976-08-01819761
1976-09-0111.8711.007.386.877.758.3310.346.4610.179.2912.7519.551976-09-01919761
1976-10-0110.966.7110.414.637.585.045.045.546.503.926.795.001976-10-011019761
1976-11-0113.9615.6710.296.4612.799.0810.009.6710.2111.6323.0921.961976-11-011119761
1976-12-0113.4616.429.214.5410.758.6710.884.838.795.918.8313.671976-12-011219761
1977-01-0120.0411.9220.259.139.298.0410.755.889.009.0014.8825.701977-01-01119771
1977-02-0111.839.7111.004.258.588.716.175.668.297.5811.7116.501977-02-01219771
1977-03-018.6314.8310.293.756.638.795.008.127.876.4213.5413.671977-03-01319771
1977-04-0121.6716.0017.3313.5920.8315.9625.6217.6219.4120.6724.3730.091977-04-01419771
1977-05-016.427.128.673.584.584.006.756.133.334.5019.2112.381977-05-01519771
1977-06-017.085.259.712.832.213.505.291.422.000.925.215.631977-06-01619771
1977-07-0115.4116.2917.086.2511.8311.8312.2910.5810.417.2117.377.831977-07-01719771
1977-08-014.332.964.422.330.961.084.961.872.332.0410.509.831977-08-01819771
1977-09-0117.3716.3316.838.5814.4611.8315.0913.9213.2913.8823.2925.171977-09-01919771
1977-10-0116.7515.3412.259.4216.3811.3818.5013.9214.0914.4622.3429.671977-10-011019771
1977-11-0116.7111.5412.174.178.547.1711.126.468.256.2111.0415.631977-11-011119771
1977-12-0113.3710.9212.422.375.796.138.967.386.295.718.5412.421977-12-011219771
1978-01-018.337.127.713.548.507.5014.7110.0011.8310.0015.0920.461978-01-01119781
1978-02-0127.2524.2118.1617.4627.5418.0520.9625.0420.0417.5027.7121.121978-02-01219781
1978-03-0115.046.2116.047.876.426.6712.298.0010.589.335.4117.001978-03-01319781
1978-04-013.427.582.711.383.462.082.674.754.831.677.3313.671978-04-01419781
1978-05-0110.5412.219.085.2911.0010.0811.1713.7511.8711.7912.8727.161978-05-01519781
1978-06-0110.3711.426.466.0411.257.506.465.967.795.465.5010.411978-06-01619781
1978-07-0112.4610.6311.176.7512.929.0412.429.6212.088.0414.0416.171978-07-01719781
1978-08-0119.3315.0920.178.8312.6210.419.3312.339.509.9215.7518.001978-08-01819781
1978-09-018.426.139.875.253.215.717.253.507.336.507.6215.961978-09-01919781
1978-10-019.506.8310.503.886.134.584.216.506.386.5410.6314.091978-10-011019781
1978-11-0113.5916.7511.257.0811.048.338.1711.2910.7511.2523.1325.001978-11-011119781
1978-12-0121.2916.2924.0412.7918.2119.2921.5417.2116.7117.8317.7525.701978-12-011219781

216 rows × 16 columns

练习7-可视化

探索泰坦尼克灾难数据

步骤1 导入必要的库
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline
步骤2 从以下地址导入数据
path7 = './exercise_data/train.csv' 
步骤3 将数据框命名为titanic
titanic = pd.read_csv(path7)
titanic.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
步骤4 将PassengerId设置为索引
titanic.set_index('PassengerId').head()
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
步骤5 绘制一个展示男女乘客比例的扇形图
males = (titanic['Sex'] == 'male').sum()
females = (titanic['Sex'] == 'female').sum()

# put them into a list called proportions
proportions = [males, females]

# Create a pie chart
plt.pie(
    # using proportions
    proportions,
    
    # with the labels being officer names
    labels = ['Males', 'Females'],
    
    # with no shadows
    shadow = False,
    
    # with colors
    colors = ['blue','red'],
    
    # with one slide exploded out
    explode = (0.15 , 0),
    
    # with the start angle at 90%
    startangle = 90,
    
    # with the percent listed as a fraction
    autopct = '%1.1f%%'
    )

# View the plot drop above
plt.axis('equal')

# Set labels
plt.title("Sex Proportion")

# View the plot
plt.tight_layout()
plt.show()

在这里插入图片描述

步骤6 绘制一个展示船票Fare, 与乘客年龄和性别的散点图
lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)

# set title
lm.set(title = 'Fare x Age')

# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)

在这里插入图片描述

步骤7 有多少人生还?
titanic.Survived.sum()
342
步骤8 绘制一个展示船票价格的直方图
df = titanic.Fare.sort_values(ascending = False)
df

# create bins interval using numpy
binsVal = np.arange(0,600,10)
binsVal

# create the plot
plt.hist(df, bins = binsVal)

# Set the title and labels
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')

# show the plot
plt.show()

在这里插入图片描述

练习8-创建数据框

探索Pokemon数据

步骤1 导入必要的库
import pandas as pd
步骤2 创建一个数据字典
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
            "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
            "type": ['grass', 'fire', 'water', 'bug'],
            "hp": [45, 39, 44, 45],
            "pokedex": ['yes', 'no','yes','no']                        
            }
步骤3 将数据字典存为一个名叫pokemon的数据框中
pokemon = pd.DataFrame(raw_data)
pokemon.head()
evolutionhpnamepokedextype
0Ivysaur45Bulbasauryesgrass
1Charmeleon39Charmandernofire
2Wartortle44Squirtleyeswater
3Metapod45Caterpienobug
步骤4 数据框的列排序是字母顺序,请重新修改为name, type, hp, evolution, pokedex这个顺序
pokemon = pokemon[['name', 'type', 'hp', 'evolution','pokedex']]
pokemon
nametypehpevolutionpokedex
0Bulbasaurgrass45Ivysauryes
1Charmanderfire39Charmeleonno
2Squirtlewater44Wartortleyes
3Caterpiebug45Metapodno
步骤5 添加一个列place
pokemon['place'] = ['park','street','lake','forest']
pokemon
nametypehpevolutionpokedexplace
0Bulbasaurgrass45Ivysauryespark
1Charmanderfire39Charmeleonnostreet
2Squirtlewater44Wartortleyeslake
3Caterpiebug45Metapodnoforest
步骤6 查看每个列的数据类型
pokemon.dtypes
name         object
type         object
hp            int64
evolution    object
pokedex      object
place        object
dtype: object

练习9-时间序列

探索Apple公司股价数据

步骤1 导入必要的库
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt

%matplotlib inline
步骤2 数据集地址
path9 = './exercise_data/Apple_stock.csv'
步骤3 读取数据并存为一个名叫apple的数据框
apple = pd.read_csv(path9)
apple.head()
DateOpenHighLowCloseVolumeAdj Close
02014-07-0896.2796.8093.9295.356513000095.35
12014-07-0794.1495.9994.1095.975630540095.97
22014-07-0393.6794.1093.2094.032289180094.03
32014-07-0293.8794.0693.0993.482842090093.48
42014-07-0193.5294.0793.1393.523817020093.52
步骤4 查看每一列的数据类型
apple.dtypes
Date          object
Open         float64
High         float64
Low          float64
Close        float64
Volume         int64
Adj Close    float64
dtype: object
步骤5 将Date这个列转换为datetime类型
apple.Date = pd.to_datetime(apple.Date)
apple['Date'].head()
0   2014-07-08
1   2014-07-07
2   2014-07-03
3   2014-07-02
4   2014-07-01
Name: Date, dtype: datetime64[ns]
步骤6 将Date设置为索引
apple = apple.set_index('Date')
apple.head()
OpenHighLowCloseVolumeAdj Close
Date
2014-07-0896.2796.8093.9295.356513000095.35
2014-07-0794.1495.9994.1095.975630540095.97
2014-07-0393.6794.1093.2094.032289180094.03
2014-07-0293.8794.0693.0993.482842090093.48
2014-07-0193.5294.0793.1393.523817020093.52
步骤7 有重复的日期吗?
apple.index.is_unique
True
步骤8 将index设置为升序
apple.sort_index(ascending = True).head()
OpenHighLowCloseVolumeAdj Close
Date
1980-12-1228.7528.8728.7528.751172584000.45
1980-12-1527.3827.3827.2527.25439712000.42
1980-12-1625.3725.3725.2525.25264320000.39
1980-12-1725.8726.0025.8725.87216104000.40
1980-12-1826.6326.7526.6326.63183624000.41
步骤9 找到每个月的最后一个交易日(business day)
apple_month = apple.resample('BM').mean()
apple_month
OpenHighLowCloseVolumeAdj Close
Date
1980-12-3130.48153830.56769230.44307730.4430772.586252e+070.473077
1981-01-3031.75476231.82666731.65476231.6547627.249867e+060.493810
1981-02-2726.48000026.57210526.40789526.4078954.231832e+060.411053
1981-03-3124.93772725.01681824.83636424.8363647.962691e+060.387727
1981-04-3027.28666727.36809527.22714327.2271436.392000e+060.423333
.....................
2014-03-31533.593333536.453810530.070952533.2142865.954403e+0775.750000
2014-04-30540.081905544.349048536.262381541.0742867.660787e+0776.867143
2014-05-30601.301905606.372857598.332857603.1957146.828177e+0786.058571
2014-06-30222.360000224.084286220.735714222.6580955.745506e+0791.885714
2014-07-3194.29400095.00400093.48800094.4700004.218366e+0794.470000

404 rows × 6 columns

步骤10 数据集中最早的日期和最晚的日期相差多少天?
(apple.index.max() - apple.index.min()).days
12261
步骤11 在数据中一共有多少个月?
apple_months = apple.resample('BM').mean()
len(apple_months.index)
404
步骤12 按照时间顺序可视化Adj Close值
appl_open = apple['Adj Close'].plot(title = "Apple Stock")

# changes the size of the graph
fig = appl_open.get_figure()
fig.set_size_inches(13.5, 9)

在这里插入图片描述

练习10-删除数据

探索Iris纸鸢花数据

步骤1 导入必要的库
import pandas as pd
步骤2 数据集地址
path10 ='./exercise_data/iris.csv' 
步骤3 将数据集存成变量iris
iris = pd.read_csv(path10)
iris.head()
5.13.51.40.2Iris-setosa
04.93.01.40.2Iris-setosa
14.73.21.30.2Iris-setosa
24.63.11.50.2Iris-setosa
35.03.61.40.2Iris-setosa
45.43.91.70.4Iris-setosa
步骤4 创建数据框的列名称
iris = pd.read_csv(path10,names = ['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class'])
iris.head()
sepal_lengthsepal_widthpetal_lengthpetal_widthclass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
步骤5 数据框中有缺失值吗?
pd.isnull(iris).sum()
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64
步骤6 将列petal_length的第10到19行设置为缺失值
iris.iloc[10:20,2:3] = np.nan
iris.head(20)
sepal_lengthsepal_widthpetal_lengthpetal_widthclass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
55.43.91.70.4Iris-setosa
64.63.41.40.3Iris-setosa
75.03.41.50.2Iris-setosa
84.42.91.40.2Iris-setosa
94.93.11.50.1Iris-setosa
105.43.7NaN0.2Iris-setosa
114.83.4NaN0.2Iris-setosa
124.83.0NaN0.1Iris-setosa
134.33.0NaN0.1Iris-setosa
145.84.0NaN0.2Iris-setosa
155.74.4NaN0.4Iris-setosa
165.43.9NaN0.4Iris-setosa
175.13.5NaN0.3Iris-setosa
185.73.8NaN0.3Iris-setosa
195.13.8NaN0.3Iris-setosa
步骤7 将缺失值全部替换为1.0
iris.petal_length.fillna(1, inplace = True)
iris
sepal_lengthsepal_widthpetal_lengthpetal_widthclass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
55.43.91.70.4Iris-setosa
64.63.41.40.3Iris-setosa
75.03.41.50.2Iris-setosa
84.42.91.40.2Iris-setosa
94.93.11.50.1Iris-setosa
105.43.71.00.2Iris-setosa
114.83.41.00.2Iris-setosa
124.83.01.00.1Iris-setosa
134.33.01.00.1Iris-setosa
145.84.01.00.2Iris-setosa
155.74.41.00.4Iris-setosa
165.43.91.00.4Iris-setosa
175.13.51.00.3Iris-setosa
185.73.81.00.3Iris-setosa
195.13.81.00.3Iris-setosa
205.43.41.70.2Iris-setosa
215.13.71.50.4Iris-setosa
224.63.61.00.2Iris-setosa
235.13.31.70.5Iris-setosa
244.83.41.90.2Iris-setosa
255.03.01.60.2Iris-setosa
265.03.41.60.4Iris-setosa
275.23.51.50.2Iris-setosa
285.23.41.40.2Iris-setosa
294.73.21.60.2Iris-setosa
..................
1206.93.25.72.3Iris-virginica
1215.62.84.92.0Iris-virginica
1227.72.86.72.0Iris-virginica
1236.32.74.91.8Iris-virginica
1246.73.35.72.1Iris-virginica
1257.23.26.01.8Iris-virginica
1266.22.84.81.8Iris-virginica
1276.13.04.91.8Iris-virginica
1286.42.85.62.1Iris-virginica
1297.23.05.81.6Iris-virginica
1307.42.86.11.9Iris-virginica
1317.93.86.42.0Iris-virginica
1326.42.85.62.2Iris-virginica
1336.32.85.11.5Iris-virginica
1346.12.65.61.4Iris-virginica
1357.73.06.12.3Iris-virginica
1366.33.45.62.4Iris-virginica
1376.43.15.51.8Iris-virginica
1386.03.04.81.8Iris-virginica
1396.93.15.42.1Iris-virginica
1406.73.15.62.4Iris-virginica
1416.93.15.12.3Iris-virginica
1425.82.75.11.9Iris-virginica
1436.83.25.92.3Iris-virginica
1446.73.35.72.5Iris-virginica
1456.73.05.22.3Iris-virginica
1466.32.55.01.9Iris-virginica
1476.53.05.22.0Iris-virginica
1486.23.45.42.3Iris-virginica
1495.93.05.11.8Iris-virginica

150 rows × 5 columns

步骤8 删除列class
del iris['class']
iris.head()
sepal_lengthsepal_widthpetal_lengthpetal_width
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
步骤9 将数据框前三行设置为缺失值
iris.iloc[0:3 ,:] = np.nan
iris.head()
sepal_lengthsepal_widthpetal_lengthpetal_width
0NaNNaNNaNNaN
1NaNNaNNaNNaN
2NaNNaNNaNNaN
34.63.11.50.2
45.03.61.40.2
步骤10 删除有缺失值的行
iris = iris.dropna(how='any')
iris.head()
sepal_lengthsepal_widthpetal_lengthpetal_width
34.63.11.50.2
45.03.61.40.2
55.43.91.70.4
64.63.41.40.3
75.03.41.50.2
步骤11 重新设置索引
iris = iris.reset_index(drop = True)
iris.head()
sepal_lengthsepal_widthpetal_lengthpetal_width
04.63.11.50.2
15.03.61.40.2
25.43.91.70.4
34.63.41.40.3
45.03.41.50.2
  • 2
    点赞
  • 39
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值