Getting started with machine learning in Python

Getting started with machine learning in Python

Machine learning is a field that uses algorithms to learn from data and make predictions. Practically, this means that we can feed data into an algorithm, and use it to make predictions about what might happen in the future. This has a vast range of applications, from self-driving cars to stock price prediction. Not only is machine learning interesting, it’s also starting to be widely used, making it an extremely practical skill to learn.

In this tutorial, we’ll guide you through the basic principles of machine learning, and how to get started with machine learning in Python. Luckily for us, Python has an amazing ecosystem of libraries that make machine learning easy to get started with. We’ll be using the excellent Scikit-learnPandas, and Matplotlib libraries in this tutorial.

If you want to dive more deeply into machine learning, and apply algorithms in your browser, check out our courses here.

The dataset

Before we dive into machine learning, we’re going to explore a dataset, and figure out what might be interesting to predict. The dataset is from BoardGameGeek, and contains data on 80000 board games. Here’s a single boardgame on the site. This information was kindly scraped into csv format by Sean Beck, and can be downloaded here.

The dataset contains several data points about each board game. Here’s a list of the interesting ones:

  • name – name of the board game.
  • playingtime – the playing time (given by the manufacturer).
  • minplaytime – the minimum playing time (given by the manufacturer).
  • maxplaytime – the maximum playing time (given by the manufacturer).
  • minage – the minimum recommended age to play.
  • users_rated – the number of users who rated the game.
  • average_rating – the average rating given to the game by users. (0-10)
  • total_weights – Number of weights given by users. Weight is a subjective measure that is made up by BoardGameGeek. It’s how “deep” or involved a game is. Here’s a full explanation.
  • average_weight – the average of all the subjective weights (0-5).

Introduction to Pandas

The first step in our exploration is to read in the data and print some quick summary statistics. In order to do this, we’ll us the Pandas library. Pandas provides data structures and data analysis tools that make manipulating data in Python much quicker and more effective. The most common data structure is called adataframe. A dataframe is an extension of a matrix, so we’ll talk about what a matrix is before coming back to dataframes.

Our data file looks like this (we removed some columns to make it easier to look at):

id,type,name,yearpublished,minplayers,maxplayers,playingtime
12333,boardgame,Twilight Struggle,2005,2,2,180
120677,boardgame,Terra Mystica,2012,2,5,150

This is in a format called csv, or comma-separated values, which you can read more about here. Each row of the data is a different board game, and different data points about each board game are separated by commas within the row. The first row is the header row, and describes what each data point is. The entire set of one data point, going down, is a column.

We can easily conceptualize a csv file as a matrix:

    1       2           3                   4
1   id      type        name                yearpublished
2   12333   boardgame   Twilight Struggle   2005
3   120677  boardgame   Terra Mystica       2012

We removed some of the columns here for display purposes, but you can still get a sense of how the data looks visually. A matrix is a two-dimensional data structure, with rows and columns. We can access elements in a matrix by position. The first row starts with id, the second row starts with 12333, and the third row starts with 120677. The first column is id, the second is type, and so on. Matrices in Python can be used via theNumPy library.

A matrix has some downsides, though. You can’t easily access columns and rows by name, and each column has to have the same datatype. This means that we can’t effectively store our board game data in a matrix – the name column contains strings, and the yearpublished column contains integers, which means that we can’t store them both in the same matrix.

A dataframe, on the other hand, can have different datatypes in each column. It has has a lot of built-in niceities for analyzing data as well, such as looking up columns by name. Pandas gives us access to these features, and generally makes working with data much simpler.

Reading in our data

We’ll now read in our data from a csv file into a Pandas dataframe, using the read_csv method.

# Import the pandas library.
import pandas

# Read in the data.
games = pandas.read_csv("board_games.csv") # Print the names of the columns in games. print(games.columns)
Index(['id', 'type', 'name', 'yearpublished', 'minplayers', 'maxplayers',
       'playingtime', 'minplaytime', 'maxplaytime', 'minage', 'users_rated',
       'average_rating', 'bayes_average_rating', 'total_owners',
       'total_traders', 'total_wanters', 'total_wishers', 'total_comments',
       'total_weights', 'average_weight'],
      dtype='object')

The code above read the data in, and shows us all of the column names. The columns that are in the data but aren’t listed above should be fairly self-explanatory.

print(games.shape)
(81312, 20)

We can also see the shape of the data, which shows that it has 81312 rows, or games, and 20 columns, or data points describing each game.

Plotting our target variables

It could be interesting to predict the average score that a human would give to a new, unreleased, board game. This is stored in the average_rating column, which is the average of all the user ratings for a board game. Predicting this column could be useful to board game manufacturers who are thinking of what kind of game to make next, for instance.

We can access a column is a dataframe with Pandas using games["average_rating"]. This will extract a single column from the dataframe.

Let’s plot a histogram of this column so we can visualize the distribution of ratings. We’ll use Matplotlib to generate the visualization. Matplotlib is the main plotting infrastructure in Python, and most other plotting libraries, like seaborn and ggplot2 are built on top of Matplotlib.

We import Matplotlib’s plotting functions with import matplotlib.pyplot as plt. We can then draw and show plots.

# Import matplotlib
import matplotlib.pyplot as plt # Make a histogram of all the ratings in the average_rating column. plt.hist(games["average_rating"]) # Show the plot. plt.show()

png

What we see here is that there are quite a few games with a 0 rating. There’s a fairly normal distribution of ratings, with some right skew, and a mean rating around 6 (if you remove the zeros).

Exploring the 0 ratings

Are there truly so many terrible games that were given a 0 rating? Or is something else happening? We’ll need to dive into the data bit more to check on this.

With Pandas, we can select subsets of data using Boolean series (vectors, or one column/row of data, are known as series in Pandas). Here’s an example:

games[games["average_rating"] == 0]

The code above will create a new dataframe, with only the rows in games where the value of the average_ratingcolumn equals 0.

We can then index the resulting dataframe to get the values we want. There are two ways to index in Pandas – we can index by the name of the row or column, or we can index by position. Indexing by names looks like games["average_rating"] – this will return the whole average_rating column of games. Indexing by position looks likegames.iloc[0] – this will return the whole first row of the dataframe. We can also pass in multiple index values at once – games.iloc[0,0] will return the first column in the first row of games. Read more about Pandas indexinghere.

# Print the first row of all the games with zero scores.
# The .iloc method on dataframes allows us to index by position.
print(games[games["average_rating"] == 0].iloc[0]) # Print the first row of all the games with scores greater than 0. print(games[games["average_rating"] > 0].iloc[0])
id                             318
type                     boardgame
name                    Looney Leo
users_rated                      0
average_rating                   0
bayes_average_rating             0
Name: 13048, dtype: object
id                                  12333
type                            boardgame
name                    Twilight Struggle
users_rated                         20113
average_rating                    8.33774
bayes_average_rating              8.22186
Name: 0, dtype: object

This shows us that the main difference between a game with a 0 rating and a game with a rating above 0 is that the 0 rated game has no reviews. The users_rated column is 0. By filtering out any board games with 0reviews, we can remove much of the noise.

Removing games without reviews

# Remove any rows without user reviews.
games = games[games["users_rated"] > 0] # Remove any rows with missing values. games = games.dropna(axis=0)

We just filtered out all of the rows without user reviews. While we were at it, we also took out any rows with missing values. Many machine learning algorithms can’t work with missing values, so we need some way to deal with them. Filtering them out is one common technique, but it means that we may potentially lose valuable data. Other techniques for dealing with missing data are listed here.

Clustering games

We’ve seen that there may be distinct sets of games. One set (which we just removed) was the set of games without reviews. Another set could be a set of highly rated games. One way to figure out more about these sets of games is a technique called clustering. Clustering enables you to find patterns within your data easily by grouping similar rows (in this case, games), together.

We’ll use a particular type of clustering called k-means clustering. Scikit-learn has an excellent implementation of k-means clustering that we can use. Scikit-learn is the primary machine learning library in Python, and contains implementations of most common algorithms, including random forests, support vector machines, and logistic regression. Scikit-learn has a consistent API for accessing these algorithms.

# Import the kmeans clustering model.
from sklearn.cluster import KMeans # Initialize the model with 2 parameters -- number of clusters and random state. kmeans_model = KMeans(n_clusters=5, random_state=1) # Get only the numeric columns from games. good_columns = games._get_numeric_data() # Fit the model using the good columns. kmeans_model.fit(good_columns) # Get the cluster assignments. labels = kmeans_model.labels_

In order to use the clustering algorithm in Scikit-learn, we’ll first intialize it using two parameters – n_clusters defines how many clusters of games that we want, and random_state is a random seed we set in order to reproduce our results later. Here’s more information on the implementation.

We then only get the numeric columns from our dataframe. Most machine learning algorithms can’t directly operate on text data, and can only take numbers as input. Getting only the numeric columns removes typeand name, which aren’t usable by the clustering algorithm.

Finally, we fit our kmeans model to our data, and get the cluster assignment labels for each row.

Plotting clusters

Now that we have cluster labels, let’s plot the clusters. One sticking point is that our data has many columns – it’s outside of the realm of human understanding and physics to be able to visualize things in more than 3 dimensions. So we’ll have to reduce the dimensionality of our data, without losing too much information. One way to do this is a technique called principal component analysis, or PCA. PCA takes multiple columns, and turns them into fewer columns while trying to preserve the unique information in each column. To simplify, say we have two columns, total_owners, and total_traders. There is some correlation between these two columns, and some overlapping information. PCA will compress this information into one column with new numbers while trying not to lose any information.

We’ll try to turn our board game data into two dimensions, or columns, so we can easily plot it out.

# Import the PCA model.
from sklearn.decomposition import PCA # Create a PCA model. pca_2 = PCA(2) # Fit the PCA model on the numeric columns from earlier. plot_columns = pca_2.fit_transform(good_columns) # Make a scatter plot of each game, shaded according to cluster assignment. plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels) # Show the plot. plt.show()

png

We first initialize a PCA model from Scikit-learn. PCA isn’t a machine learning technique, but Scikit-learn also contains other models that are useful for performing machine learning. Dimensionality reduction techniques like PCA are widely used when preprocessing data for machine learning algorithms.

We then turn our data into 2 columns, and plot the columns. When we plot the columns, we shade them according to their cluster assignment.

The plot shows us that there are 5 distinct clusters. We could dive more into which games are in each cluster to learn more about what factors cause games to be clustered.

Figuring out what to predict

There are two things we need to determine before we jump into machine learning – how we’re going to measure error, and what we’re going to predict. We thought earlier that average_rating might be good to predict on, and our exploration reinforces this notion.

There are a variety of ways to measure error (many are listed here). Generally, when we’re doing regression, and predicting continuous variables, we’ll need a different error metric than when we’re performing classification, and predicting discrete values.

For this, we’ll use mean squared error – it’s easy to calculate, and simple to understand. It shows us how far, on average, our predictions are from the actual values.

Finding correlations

Now that we want to predict average_rating, let’s see what columns might be interesting for our prediction. One way is to find the correlation between average_rating and each of the other columns. This will show us which other columns might predict average_rating the best. We can use the corr method on Pandas dataframes to easily find correlations. This will give us the correlation between each column and each other column. Since the result of this is a dataframe, we can index it and only get the correlations for the average_ratingcolumn.

games.corr()["average_rating"]
id                      0.304201
yearpublished           0.108461
minplayers             -0.032701
maxplayers             -0.008335
playingtime             0.048994
minplaytime             0.043985
maxplaytime             0.048994
minage                  0.210049
users_rated             0.112564
average_rating          1.000000
bayes_average_rating    0.231563
total_owners            0.137478
total_traders           0.119452
total_wanters           0.196566
total_wishers           0.171375
total_comments          0.123714
total_weights           0.109691
average_weight          0.351081
Name: average_rating, dtype: float64

We see that the average_weight and id columns correlate best to rating. ids are presumably assigned when the game is added to the database, so this likely indicates that games created later score higher in the ratings. Maybe reviewers were not as nice in the early days of BoardGameGeek, or older games were of lower quality.average_weight indicates the “depth” or complexity of a game, so it may be that more complex games are reviewed better.

Picking predictor columns

Before we get started predicting, let’s only select the columns that are relevant when training our algorithm. We’ll want to remove certain columns that aren’t numeric.

We’ll also want to remove columns that can only be computed if you already know the average rating. Including these columns will destroy the purpose of the classifier, which is to predict the rating without any previous knowledge. Using columns that can only be computed with knowledge of the target can lead tooverfitting, where your model is good in a training set, but doesn’t generalize well to future data.

The bayes_average_rating column appears to be derived from average_rating in some way, so let’s remove it.

# Get all the columns from the dataframe.
columns = games.columns.tolist() # Filter the columns to remove ones we don't want. columns = [c for c in columns if c not in ["bayes_average_rating", "average_rating", "type", "name"]] # Store the variable we'll be predicting on. target = "average_rating"

Splitting into train and test sets

We want to be able to figure out how accurate an algorithm is using our error metrics. However, evaluating the algorithm on the same data it has been trained on will lead to overfitting. We want the algorithm to learn generalized rules to make predictions, not memorize how to make specific predictions. An example is learning math. If you memorize that 1+1=2, and 2+2=4, you’ll be able to perfectly answer any questions about 1+1 and 2+2. You’ll have 0 error. However, the second anyone asks you something outside of your training set where you know the answer, like 3+3, you won’t be able to solve it. On the other hand, if you’re able to generalize and learn addition, you’ll make occasional mistakes because you haven’t memorized the solutions – maybe you’ll get 3453 + 353535 off by one, but you’ll be able to solve any addition problem thrown at you.

If your error looks surprisingly low when you’re training a machine learning algorithm, you should always check to see if you’re overfitting.

In order to prevent overfitting, we’ll train our algorithm on a set consisting of 80% of the data, and test it on another set consisting of 20% of the data. To do this, we first randomly samply 80% of the rows to be in the training set, then put everything else in the testing set.

# Import a convenience function to split the sets.
from sklearn.cross_validation import train_test_split # Generate the training set. Set random_state to be able to replicate results. train = games.sample(frac=0.8, random_state=1) # Select anything not in the training set and put it in the testing set. test = games.loc[~games.index.isin(train.index)] # Print the shapes of both sets. print(train.shape) print(test.shape)
(45515, 20)
(11379, 20)

Above, we exploit the fact that every Pandas row has a unique index to select any row not in the training set to be in the testing set.

Fitting a linear regression

Linear regression is a powerful and commonly used machine learning algorithm. It predicts the target variable using linear combinations of the predictor variables. Let’s say we have a 2 values, 3, and 4. A linear combination would be 3 * .5 + 4 * .5. A linear combination involves multiplying each number by a constant, and adding the results. You can read more here.

Linear regression only works well when the predictor variables and the target variable are linearly correlated. As we saw earlier, a few of the predictors are correlated with the target, so linear regression should work well for us.

We can use the linear regression implementation in Scikit-learn, just as we used the k-means implementation earlier.

# Import the linearregression model.
from sklearn.linear_model import LinearRegression # Initialize the model class. model = LinearRegression() # Fit the model to the training data. model.fit(train[columns], train[target])

When we fit the model, we pass in the predictor matrix, which consists of all the columns from the dataframe that we picked earlier. If you pass a list to a Pandas dataframe when you index it, it will generate a new dataframe with all of the columns in the list. We also pass in the target variable, which we want to make predictions for.

The model learns the equation that maps the predictors to the target with minimal error.

Predicting error

After we train the model, we can make predictions on new data with it. This new data has to be in the exact same format as the training data, or the model won’t make accurate predictions. Our testing set is identical to the training set (except the rows contain different board games). We select the same subset of columns from the test set, and then make predictions on it.

# Import the scikit-learn function to compute error.
from sklearn.metrics import mean_squared_error # Generate our predictions for the test set. predictions = model.predict(test[columns]) # Compute error between our test predictions and the actual values. mean_squared_error(predictions, test[target])
1.8239281903519875

Once we have the predictions, we’re able to compute error between the test set predictions and the actual values. Mean squared error has the formula 1nni=1(yiy^i)2 . Basically, we subtract each predicted value from the actual value, square the differences, and add them together. Then we divide the result by the total number of predicted values. This will give us the average error for each prediction.

Trying a different model

One of the nice things about Scikit-learn is that it enables us to try more powerful algorithms very easily. One such algorithm is called random forest. The random forest algorithm can find nonlinearities in data that a linear regression wouldn’t be able to pick up on. Say, for example, that if the minage of a game, is less than 5, the rating is low, if it’s 5-10, it’s high, and if it is between 10-15, it is low. A linear regression algorithm wouldn’t be able to pick up on this because there isn’t a linear relationship between the predictor and the target. Predictions made with a random forest usually have less error than predictions made by a linear regression.

# Import the random forest model.
from sklearn.ensemble import RandomForestRegressor # Initialize the model with some parameters. model = RandomForestRegressor(n_estimators=100, min_samples_leaf=10, random_state=1) # Fit the model to the data. model.fit(train[columns], train[target]) # Make predictions. predictions = model.predict(test[columns]) # Compute the error. mean_squared_error(predictions, test[target])
1.4144905030983794

Further exploration

We’ve managed to go from data in csv format to making predictions. Here are some ideas for further exploration:

  • Try a support vector machine.
  • Try ensembling multiple models to create better predictions.
  • Try predicting a different column, such as average_weight.
  • Generate features from the text, such as length of the name of the game, number of words, etc.

Want to learn more about machine learning?

At Dataquest, we offer interactive lessons on machine learning and data science. We believe in learning by doing, and you’ll learn interactively in your browser by analyzing real data and building projects. Check out our machine learning lessons here.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
开始学习机器学习可以是一个令人兴奋和有趣的过程。以下是一些帮助你入门的步骤和建议。 步骤1:学习基本概念和理论 了解机器学习的基本概念、算法和技术是开始的第一步。你可以通过阅读教科书、学术论文或参加在线教育平台上的机器学习课程来学习这些知识。确保你对基本的数学和统计学概念有一定的了解,因为它们与机器学习密切相关。 步骤2:选择适当的编程语言和工具 机器学习的实现通常需要编程来处理和分析数据集。选择适合你的需求的编程语言,如Python或R,并熟悉与机器学习相关的库和工具,如Scikit-Learn、TensorFlow或PyTorch。 步骤3:实践和探索数据集 找到适合初学者的数据集,可以是公开可用的数据集或者自己收集的数据。通过使用所选编程语言和工具,将数据导入和处理,并探索数据集的不同特征和模式。 步骤4:选择合适的机器学习算法 根据你的数据和问题类型选择合适的机器学习算法。有监督学习、无监督学习和强化学习是常见的算法类型。根据你的研究方向和目标,选择合适的算法来训练模型。 步骤5:训练和评估模型 使用你的数据集来训练机器学习模型,并使用评估指标来评估模型的性能。这可以帮助你了解模型的准确性和效果,并根据需要进行改进。 步骤6:调整和优化模型 通过调整模型的超参数、改变特征工程方法或尝试其他算法来进一步改进模型的性能。这是一个迭代的过程,可以帮助你逐渐提高模型的准确性和泛化能力。 步骤7:实际应用和持续学习 将机器学习模型应用到实际问题中,探索更多的数据和场景。保持对最新研究和技术的学习,并与机器学习社区保持联系,以不断提升自己的技能和知识。 通过以上步骤,你可以开始你的机器学习之旅,并逐渐提高自己在这个领域的技能水平。记住,持续学习和实践是成为一名优秀的机器学习从业者的关键。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值