Basic Data Exploration(2)

kaggle菜市场菜鸟

已于 2024-04-04 16:06:44 修改

阅读量1.6k

点赞数 45

文章标签： python 数据挖掘人工智能随机森林决策树机器学习

于 2024-03-10 01:59:17 首次发布

本文链接：https://blog.csdn.net/weixin_59907082/article/details/136594137

版权

1—Selecting Data for Modeling

Your dataset has so many variables to wrap your head around🤷🏼, or even to output it nicely🥀. How can you pare down that overwhelming amount of data to something you can understand🧐?

Let's start by selecting some variables using our intuition🤓.

To choose variables/columns, we’ll need to look at a list of all the columns🤯 in the dataset. This is done with the columns property of the DataFrame👇🏻:

#To choose variables/columns,
#we’ll need to look at a list of all the columns in the dataset.
import pandas as pd
melbourne_file_path = '/Users/mac/Desktop/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

Out[1]: 
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

2—Selecting The Prediction Target

We can pull out a variable with dot-notation. Store this single column in a Series, which is broadly like a DataFrame with only a single column of data.

Now, instead of choosing variables intuitively, but using what you're going to predict💁🏼‍♀️. The variable we pull out is called the prediction target. By convention, the prediction target is called y.

#Selecting the prediction target: y = 'Price'
y = melbourne_data.Price

（Interesting part😄）

3—Choosing "Features"

In the output above, columns other than the ‘Price' are called “features”. By convention, the features are called X.

(In this case, those columns are also inputted into our model and used to determine the home price.)(Sometimes, we will use all columns except the target as features. Other times you'll be better off with fewer features.)

#For now, we'll build a model with only a few features
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 
                      'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

- Is this the end of it? -No😂

Let's quickly review the data we'll be using to predict house prices using the describe method and the head method......

the describe method👀:

#To choose variables/columns,
#we’ll need to look at a list of all the columns in the dataset.
import pandas as pd
melbourne_file_path = '/Users/mac/Desktop/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns
#Selecting the prediction target: y = 'Price'
y = melbourne_data.Price
#For now, we'll build a model with only a few features
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 
                      'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
#Predict house prices using the describe method
X.describe()

Out[2]: 
              Rooms      Bathroom       Landsize     Lattitude    Longtitude
count  13580.000000  13580.000000   13580.000000  13580.000000  13580.000000
mean       2.937997      1.534242     558.416127    -37.809203    144.995216
std        0.955748      0.691712    3990.669241      0.079260      0.103916
min        1.000000      0.000000       0.000000    -38.182550    144.431810
25%        2.000000      1.000000     177.000000    -37.856822    144.929600
50%        3.000000      1.000000     440.000000    -37.802355    145.000100
75%        3.000000      2.000000     651.000000    -37.756400    145.058305
max       10.000000      8.000000  433014.000000    -37.408530    145.526350

the head method👀:

#To choose variables/columns,
#we’ll need to look at a list of all the columns in the dataset.
import pandas as pd
melbourne_file_path = '/Users/mac/Desktop/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns
#Selecting the prediction target: y = 'Price'
y = melbourne_data.Price
#For now, we'll build a model with only a few features
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 
                      'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
#Predict house prices using the head method
X.head()

Out[3]: 
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     202.0   -37.7996    144.9984
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
3      3       2.0      94.0   -37.7969    144.9969
4      4       1.0     120.0   -37.8072    144.9941

Using these commands to visually check data is an important part of the data effort. I think we'll find some surprises💡 in the dataset that are worth checking out.