第二个例子——什锦饭

 

1、烹饪是什么?-探索性数据分析

本笔记本为给定的问题提供了逐步的分析和解决方案。它也可以作为学习如何探索、操纵、转换和学习文本数据的一个很好的起点。它分为三个主要部分:
+探索性分析——作为第一步,我们借助于图式虚拟化来探索数据的主要特征;
+文本处理-这里我们应用一些基本的文本处理技术,以便清理和准备用于模型开发的数据;
+特征工程与数据建模-在这一部分中,我们从数据中提取特征,并建立菜肴的预测模型。

In [1]:

 

# Data processing
import pandas as pd
import numpy as np
import json
from collections import Counter
import re
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

数据来源:https://github.com/woshizhangrong/train_raw

train_df = pd.read_json('E:/Whats_Cooking/train.json') # store as dataframe objects
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39774 entries, 0 to 39773
Data columns (total 3 columns):
cuisine        39774 non-null object
id             39774 non-null int64
ingredients    39774 non-null object
dtypes: int64(1), object(2)
memory usage: 932.3+ KB

In [3]:

 

print("The training data consists of {} recipes".format(len(train_df)))
train_df.head()
The training data consists of 39774 recipes

Out[3]:

 cuisineidingredients
0greek10259[romaine lettuce, black olives, grape tomatoes...
1southern_us25693[plain flour, ground pepper, salt, tomatoes, g...
2filipino20130[eggs, pepper, salt, mayonaise, cooking oil, g...
3indian22213[water, vegetable oil, wheat, salt]
4indian13162[black pepper, shallots, cornflour, cayenne pe...

我们已经将数据作为DataFrame对象导入,上面的代码显示了训练样本的初始外观。我们观察到每个配方是一个单独的行,并具有:

  • 唯一的标识符“ID”列;
  • 烹饪方法的类型,这是我们的目标变量;
  • 一个包含成分的列表对象(食谱)-这将是我们分类问题中解释变量的主要来源。

问题陈述:根据给定的数据(配料)预测菜肴的类型。这是一个分类任务,需要文本处理和分析。

In [4]:

 

#Now let's explore a little bit more about the target variable
print("Number of cuisine categories: {}".format(len(train_df.cuisine.unique())))
train_df.cuisine.unique()
Number of cuisine categories: 20

Out[4]:

array(['greek', 'southern_us', 'filipino', 'indian', 'jamaican',
       'spanish', 'italian', 'mexican', 'chinese', 'british', 'thai',
       'vietnamese', 'cajun_creole', 'brazilian', 'french', 'japanese',
       'irish', 'korean', 'moroccan', 'russian'], dtype=object)

有20个不同的类别(菜肴),我们将预测。
这意味着手头的问题是一个多类分类。

In [5]:

 

sns.countplot(y=train_df.cuisine,order=train_df.cuisine.value_counts().reset_index()["index"])
plt.title("Cuisine Distribution")
plt.show()

In [6]:

 

train_df.cuisine.value_counts()

Out[6]:

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

In [7]:

 

print('Maximum Number of Ingredients in a Dish: ',train_df['ingredients'].str.len().max())
print('Minimum Number of Ingredients in a Dish: ',train_df['ingredients'].str.len().min())
Maximum Number of Ingredients in a Dish:  65
Minimum Number of Ingredients in a Dish:  1

训练样本中最常见的成分是什么?在数据集中我们能找到多少独特的成分?

2.文本处理

我们将通过进行一些简单的数据处理来进行分析。其目的是为模型开发准备数据。

In [8]:

 

# Prepare the data 
features = [] # list of list containg the recipes
for item in train_df['ingredients']:
    features.append(item)

In [9]:

 

ingrCounter = Counter()
features_processed= [] # here we will store the preprocessed training features
for item in features:
    newitem = []
    for ingr in item:
        ingr.lower() # Case Normalization - convert all to lower case 
        ingr = re.sub("[^a-zA-Z]"," ",ingr) # Remove punctuation, digits or special characters 
        ingr = re.sub((r'\b(oz|ounc|ounce|pound|lb|inch|inches|kg|to)\b'), ' ', ingr) # Remove different units
        ingrCounter[ingr] += 1
        newitem.append(ingr)
    features_processed.append(newitem)

In [10]:

 

ingr_df = pd.DataFrame(ingrCounter.most_common(15),columns=['ingredient','count'])
ingr_df

Out[10]:

 ingredientcount
0salt18049
1onions7972
2olive oil7972
3water7457
4garlic7380
5sugar6434
6garlic cloves6237
7butter4848
8ground black pepper4785
9all purpose flour4632
10pepper4438
11vegetable oil4385
12eggs3388
13soy sauce3296
14kosher salt3113

In [11]:

 

#f, ax=plt.subplots(figsize=(12,20))
sns.barplot(y=ingr_df['ingredient'].values, x=ingr_df['count'].values,orient='h')
#plt.ylabel('Ingredient', fontsize=12)
#plt.xlabel('Count', fontsize=12)
#plt.xticks(rotation='horizontal')
#plt.yticks(fontsize=12)
plt.title("Ingredient Count")
plt.show()

盐似乎是最常用的成分,一点也不奇怪!我们还发现水,洋葱,大蒜和橄榄油-也不奇怪。:)

  • 盐、水、洋葱、大蒜是常见的食材,我们期望它们在识别菜肴类型方面具有较差的预测能力。

3. 特征工程与数据建模

In [12]:

 

train_df['seperated_ingredients'] = train_df['ingredients'].apply(','.join)

In [13]:

 

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(binary=True).fit(train_df['seperated_ingredients'].values)
X_train_vectorized = vect.transform(train_df['seperated_ingredients'].values)
X_train_vectorized = X_train_vectorized.astype('float')

In [14]:

 

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y_transformed = encoder.fit_transform(train_df.cuisine)

In [15]:

 

print(X_train_vectorized)
y_transformed
  (0, 2798)	0.15183517837377775
  (0, 2427)	0.23007896012035983
  (0, 2318)	0.3426671291173114
  (0, 2202)	0.23913220198081458
  (0, 2017)	0.10208411357610164
  (0, 1889)	0.1645493089953018
  (0, 1885)	0.26100924108701357
  (0, 1541)	0.2663871237012894
  (0, 1180)	0.35031170238526027
  (0, 1103)	0.10531073154596084
  (0, 1097)	0.38853112215987895
  (0, 967)	0.3040361765035925
  (0, 745)	0.3343204746101372
  (0, 528)	0.14568369866765699
  (0, 251)	0.1398962004921347
  (0, 185)	0.20748802168948122
  (1, 3012)	0.30913470576050534
  (1, 2905)	0.23719808692764152
  (1, 2798)	0.20426659039473835
  (1, 2775)	0.3034717400305941
  (1, 2373)	0.12082052495781231
  (1, 2100)	0.3831099504645736
  (1, 2017)	0.1373355900588895
  (1, 1877)	0.1300036033814326
  (1, 1724)	0.23580432530539203
  :	:
  (39772, 350)	0.1941573519292017
  (39772, 303)	0.27894483473192366
  (39772, 287)	0.13398798263813363
  (39772, 149)	0.13758614056520396
  (39773, 2971)	0.1975464041226418
  (39773, 2798)	0.17872306445833877
  (39773, 2672)	0.15854644611127255
  (39773, 2373)	0.1057119249320061
  (39773, 2316)	0.4290979107163017
  (39773, 2017)	0.12016178204711027
  (39773, 1898)	0.2568680228381645
  (39773, 1890)	0.15485523022733863
  (39773, 1368)	0.2873688348522114
  (39773, 1215)	0.14683027334043636
  (39773, 1201)	0.1848275373976143
  (39773, 1103)	0.12395978892263136
  (39773, 1053)	0.1468886450663615
  (39773, 869)	0.22475778151656522
  (39773, 602)	0.20502059327274608
  (39773, 583)	0.19438870133941094
  (39773, 556)	0.2554909855209906
  (39773, 551)	0.2728016336867552
  (39773, 496)	0.27507217175058174
  (39773, 251)	0.16466986060689143
  (39773, 205)	0.23693690350347973

Out[15]:

array([ 6, 16,  4, ...,  8,  3, 13], dtype=int64)

逻辑回归

In [16]:

 

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X_train_vectorized, y_transformed , random_state = 0)
lr1 = LogisticRegression(C=10,dual=False)
lr1.fit(X_train , y_train)
lr1.score(X_test, y_test)

Out[16]:

0.794147224456959
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值