Data Cleaning 4

1. Read the data:

  1.1 If the data is not in .csv file. We have to search for the special read method

  all_survey = pandas.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252') # read http://kunststube.net/encoding/  for the introduction of encoding.

  1.2 Read a big set of data, So we are using for loop to read through the data.  

  for f in data_files:
  file = pd.read_csv("schools/{0}".format(f)) #When it related to a variable in the " ",we can not directly use the variable name in the string.
  f = f.replace(".csv","")
  data[f] = file

  1.3 Combine some dataframe into one by using concat() function.

  survey = pd.concat([all_survey,d75_survey],axis = 0) 

2. Cleaning up the data:

  In the combined dataframe, it is inavoidable to have lots of 'NaN' inside. So we need to deal with these "NaN"

  2.1 We need to figure out which column are relevant. And extract them from the original Dataframe.

  2.2 Some of the column name may have different column name but shows the same content. We need to change them into one.

  2.3 To unify the string, we can add,minus, change, numeric the column names

3. Filting the data:

  3.1 We can use findall() and split function to extract certain string we need from the whole string.

  def extract_lat(data):
  lat_lon = re.findall("\(.+,.+\)",data) 
  lat = re.split(",",lat_lon[0])
  final_lat = lat[0].replace("(","")
  return final_lat 

  data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(extract_lat) #loop through each row of the DataFrame in certain column to call the function.

  3.2 Find the relevant dataset from each column. And store them into another Dataframe.

4.  Combining the data

  4.1 Sometimes we would like to get the unique categorize for each column. Otherwise it is difficult to categorize. So we are going to groupup each column and calculate the mean . 

  import numpy as np
  group_by = class_size.groupby('DBN') # Groupby function can groupup the same categoize together.
  class_size = group_by.aggregate(np.mean) #aggregate function can operate the groupuped rows.
  class_size.reset_index(inplace = True)
  data['class_size'] = class_size
  print(data['class_size'].head(5))

5.   http://boundingbox.klokantech.com/ For looking for the coordiates of a city.

转载于:https://www.cnblogs.com/kingoscar/p/5995284.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值