目录
步骤1:划分训练集(train_data)和测试集(test_data)
Week1 - Intro
本模块的课程内容
- Intro
- Regression
- Classification
- Clustering & Retrieval
- Matrix Factorization & Dimensionality Reduction
- Capstone: Build an Intelligent Application with Deep Learning
Turi Create & SFrame 基础
官方文档:turicreate.SFrame — Turi Create API 6.4.1 documentation
从csv文件创建SFrame
# 导入turicreate库
import turicreate
# 基于people-example.csv创建SFrame对象
sf = turicreate.SFrame('people-example.csv')
# 查看表格
sf # 查看前几行
sf.tail() # 查看后几行
数据可视化
显示所有列的数据
# 显示所有列的数据
sf.show()
显示可交互的直方图:
显示指定数据
# sf['target_column'].show()
sf['age'].show()
列操作
基础操作
# 求某列的平均值
sf['age'].mean()
# 求某列的最大/最小值
sf['age'].max()
sf['age'].min()
# 新建列并赋值
sf['Full Name'] = sf['First Name'] + ' ' + sf['Last Name']
SFrame.apply() 函数操作
# 定义函数:transform_country,把所有的USA转换成United States
def transform_country(country):
if country == 'USA':
return 'United States'
else:
return country
# 对每一列都进行transform_country函数操作
sf['Country'] = sf['Country'].apply(transform_country)
# 展示效果
sf
小测
Task1: 加载并读取SFrame文件
import turicreate
sf = turicreate.SFrame('people_wiki.sframe/')
sf
Task2: 统计sf的行数量 -> 答案:50971
num_rows = sf.num_rows()
print(num_rows)
Task3: 最后一行的name值 -> 答案:Fawaz Damrah
last_row = sf[-1]
name_in_last_row = last_row['name']
print(name_in_last_row)
Task4: 读取Harpdog Brown的text列
# Filter the SFrame for rows where the name column is 'Harpdog Brown'
harpdog_brown_row = sf[sf['name'] == 'Harpdog Brown']
# Access the text column in the filtered SFrame
if harpdog_brown_row:
harpdog_brown_text = harpdog_brown_row['text'][0] # Assuming 'text' is the column name
print(harpdog_brown_text)
else:
print("Harpdog Brown not found in the SFrame")
Task5: 根据text排序,谁是第一个名字? -> 答案:108(artist)
# Sort the SFrame by the 'text' column in ascending order
sorted_sf = sf.sort('text')
# Get the name from the first row of the sorted SFrame
first_name = sorted_sf['name'][0]
print(first_name)
Week2 - 线性回归预测房价
数据集:Sales
导入数据集
import turicreate
sales = turicreate.SFrame('home_data.sframe/')
基于Matplotlib的数据可视化
turicreate.show(sales[1:5000]['sqft_living'],sales[1:5000]['price'])
建立线性回归模型
步骤1:划分训练集(train_data)和测试集(test_data)
# .8:设置80%为训练集的比例
training_set, test_set = sales.random_split(.8,seed=0)
步骤2:建立单个变量的线性回归模型
建立模型
# 用sqft_living解释price
sqft_model = turicreate.linear_regression.create(training_set,target='price',features=['sqft_living'])
评估模型
# max_error: 预测值与实际值之间的最大误差
# rmse: 均方根误差
print (sqft_model.evaluate(test_set))
结果:
{'max_error': 4129161.6293016262, 'rmse': 255237.0877033326}
模型系数
输出模型的截距、变量的值和标准误差
sqft_model.coefficients
模型可视化
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(test_set['sqft_living'],test_set['price'],'.',
test_set['sqft_living'],sqft_model.predict(test_set),'-')
步骤3:建立多个变量的线性回归模型
# 多个变量
my_features = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors','zipcode']
# 建立模型
my_features_model = turicreate.linear_regression.create(training_set,target='price',features=my_features)
# 评估模型
print (my_features_model.evaluate(test_set))
# 预测房价
house1 = sales[sales['id']=='5309101200']
print (my_features_model.predict(house1))
小测
Task1: 找到最高平均房价的邮编,并输出最高的平均房价
import turicreate as tc
sales = tc.SFrame('home_data.sframe/')
# 按邮编分组,并计算平均房价
average_prices_by_zip = sales.groupby('zipcode', {'avg_price': tc.aggregate.MEAN('price')})
# 按平均房价排序,找到最高
highest_avg_price_zip = average_prices_by_zip.sort('avg_price', ascending=False)[0]
# 输出结果
print(highest_avg_price_zip)
{'zipcode': '98039', 'avg_price': 2160606.5999999996}
Task2: 面积在2000 sq.ft. 和 4000 sq.ft之间的的房子比例是?
# 过滤面积在 2000 和 4000范围内的房子
houses_in_range = sales[(sales['sqft_living'] >= 2000) & (sales['sqft_living'] <= 4000)]
# 计算比例
fraction = len(houses_in_range) / len(sales)
print(fraction)
0.4266413732475825
Task3:根据advanced_features建立一个新的线性回归模型,并比较my_features_model和advanced_features_model的RMSE差异
training_set, test_set = sales.random_split(.8,seed=0)
# 定义特征集
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house
'grade', # measure of quality of construction
'waterfront', # waterfront property
'view', # type of view
'sqft_above', # square feet above ground
'sqft_basement', # square feet in basement
'yr_built', # the year built
'yr_renovated', # the year renovated
'lat', 'long', # the lat-long of the parcel
'sqft_living15', # average sq.ft. of 15 nearest neighbors
'sqft_lot15', # average lot size of 15 nearest neighbors
]
# 训练my_features_model
my_features_model = tc.linear_regression.create(training_set, target='price', features=my_features)
# 训练advanced_features_model
advanced_features_model = tc.linear_regression.create(training_set, target='price', features=advanced_features)
# 评估模型
my_features_evaluation = my_features_model.evaluate(test_set)
advanced_features_evaluation = advanced_features_model.evaluate(test_set)
# 计算RMSE差值
rmse_difference = my_features_evaluation['rmse'] - advanced_features_evaluation['rmse']
25088.23651349137