如何使用PHP机器学习分析推文情感

This article was peer reviewed by Wern Ancheta. Thanks to all of SitePoint’s peer reviewers for making SitePoint content the best it can be!

本文由Wern Ancheta进行同行评审。 感谢所有SitePoint的同行评审人员使SitePoint内容达到最佳状态!



As of late, it seems everyone and their proverbial grandma is talking about Machine Learning. Your social media feeds are inundated with posts about ML, Python, TensorFlow, Spark, Scala, Go and so on; and if you are anything like me, you might be wondering, what about PHP?

直到最近,似乎每个人和他们众所周知的祖母都在谈论机器学习。 您的社交媒体供稿中充斥着有关ML,Python,TensorFlow,Spark,Scala,Go等的帖子; 如果您像我一样,您可能会想知道,PHP呢?

Yes, what about Machine Learning and PHP? Fortunately, someone was crazy enough not only to ask that question, but to also develop a generic machine learning library that we can use in our next project. In this post we are going take a look at PHP-ML – a machine learning library for PHP – and we’ll write a sentiment analysis class that we can later reuse for our own chat or tweet bot. The main goals of this post are:

是的,关于机器学习和PHP呢? 幸运的是,有人疯狂地不仅问了这个问题,而且还开发了一个通用的机器学习库,我们可以在下一个项目中使用它。 在这篇文章中,我们将看一下PHP-ML (一种用于PHP的机器学习库),并且我们将编写一个情感分析类,以后可以将其重用于我们自己的聊天或推文机器人。 这篇文章的主要目标是:

  • Explore the general concepts around Machine learning and Sentiment Analysis

    探索有关机器学习和情感分析的一般概念
  • Review the capabilities and shortcomings of PHP-ML

    回顾PHP-ML的功能和缺点
  • Define the problem we are going to work on

    定义我们要解决的问题
  • Prove that trying to do Machine learning in PHP isn’t a completely crazy goal (optional)

    证明尝试用PHP进行机器学习并不是一个完全疯狂的目标(可选)
A robot elephpant

什么是机器学习? (What is Machine Learning?)

Machine learning is a subset of Artificial Intelligence that focuses on giving “computers the ability to learn without being explicitly programmed”. This is achieved by using generic algorithms that can “learn” from a particular set of data.

机器学习是人工智能的一个子集,其重点是赋予“计算机无需明确编程即可学习的能力”。 这是通过使用可以从一组特定数据中“学习”的通用算法来实现的。

For example, one common usage of machine learning is classification. Classification algorithms are used to put data into different groups or categories. Some examples of classification applications are:

例如,机器学习的一种常见用法是分类。 分类算法用于将数据分为不同的组或类别。 分类应用程序的一些示例是:

  • Email spam filters

    电子邮件垃圾邮件过滤器
  • Market segmentation

    市场细分
  • Fraud detection

    欺诈识别

Machine learning is something of an umbrella term that covers many generic algorithms for different tasks, and there are two main algorithm types classified on how they learn – supervised learning and unsupervised learning.

机器学习是一个笼统的术语,涵盖了许多用于不同任务的通用算法,并且根据它们的学习方式分为两种主要的算法类型:监督学习和非监督学习。

监督学习 (Supervised Learning)

In supervised learning, we train our algorithm using labelled data in the form of an input object (vector) and a desired output value; the algorithm analyzes the training data and produces what is referred to as an inferred function which we can apply to a new, unlabelled dataset.

在监督学习中,我们使用输入对象(向量)和期望输出值形式的标记数据来训练算法。 该算法会分析训练数据并生成所谓的推断函数,我们可以将其应用于新的未标记数据集。

For the remainder of this post we will focus on supervised learning, just because its easier to see and validate the relationship; keep in mind that both algorithms are equally important and interesting; one could argue that unsupervised is more useful because it precludes the labelled data requirements.

在本文的其余部分中,我们将重点放在有监督的学习上,因为它更易于查看和验证这种关系。 请记住,两种算法都同样重要且有趣; 有人可能会说,无监督更为有用,因为它排除了标记数据的要求。

无监督学习 (Unsupervised Learning)

This type of learning on the other hand works with unlabelled data from the get-go. We don’t know the desired output values of the dataset and we are letting the algorithm draw inferences from datasets; unsupervised learning is especially handy when doing exploratory data analysis to find hidden patterns in the data.

另一方面,这种类型的学习适用于刚开始使用的未标记数据。 我们不知道数据集的期望输出值,而是让算法从数据集中得出推论。 在进行探索性数据分析以发现数据中的隐藏模式时,无监督学习特别方便。

PHP语言 (PHP-ML)

Meet PHP-ML, a library that claims to be a fresh approach to Machine Learning in PHP. The library implements algorithms, neural networks, and tools to do data pre-processing, cross validation, and feature extraction.

认识PHP-ML,这是一个声称是用PHP进行机器学习的新方法的库。 该库实现了算法,神经网络和工具来进行数据预处理,交叉验证和特征提取。

I’ll be the first to admit PHP is an unusual choice for machine learning, as the language’s strengths are not that well suited for Machine Learning applications. That said, not every machine learning application needs to process petabytes of data and do massive calculations – for simple applications, we should be able to get away with using PHP and PHP-ML.

我将是第一个承认PHP是机器学习的不寻常选择,因为该语言的优势并不十分适合机器学习应用程序。 也就是说,并非每个机器学习应用程序都需要处理PB的数据并进行大量计算-对于简单的应用程序,我们应该能够摆脱使用PHP和PHP-ML的束缚。

The best use case that I can see for this library right now is the implementation of a classifier, be it something like a spam filter or even sentiment analysis. We are going to define a classification problem and build a solution step by step to see how we can use PHP-ML in our projects.

我现在可以在该库中看到的最佳用例是分类器的实现,例如垃圾邮件过滤器甚至情感分析。 我们将定义一个分类问题并逐步构建解决方案,以了解如何在项目中使用PHP-ML。

问题 (The Problem)

To exemplify the process of implementing PHP-ML and adding some machine learning to our applications, I wanted to find a fun problem to tackle and what better way to showcase a classifier than building a tweet sentiment analysis class.

为了举例说明实现PHP-ML的过程并向我们的应用程序中添加一些机器学习功能,我想找到一个有趣的问题来解决,以及比构建推特情感分析类更好的展示分类器的方法。

One of the key requirements needed to build successful machine learning projects is a decent starting dataset. Datasets are critical since they will allow us to train our classifier against already classified examples. As there has recently been significant noise in the media around airlines, what better dataset to use than tweets from customers to airlines?

建立成功的机器学习项目所需的关键要求之一是一个不错的起始数据集。 数据集至关重要,因为它们将使我们能够针对已经分类的示例训练分类器。 鉴于最近航空公司周围的媒体大肆宣传,有什么比从客户到航空公司的推文更能使用的数据集呢?

Fortunately, a dataset of tweets is already available to us thanks to Kaggle.io. The Twitter US Airline Sentiment database can be downloaded from their site using this link

幸运的是,感谢Kaggle.io ,我们已经可以使用推文的数据集。 可以使用此链接从其站点下载Twitter美国航空情绪数据库

解决方案 (The Solution)

Let’s begin by taking a look at the dataset we will be working on. The raw dataset has the following columns:

让我们先来看一下我们将要处理的数据集。 原始数据集包含以下列:

  • tweet_id

    tweet_id
  • airline_sentiment

    航空公司情绪
  • airline_sentiment_confidence

    航空公司信心指数
  • negativereason

    负面原因
  • negativereason_confidence

    否定理由
  • airline

    航空公司
  • airline_sentiment_gold

    airline_sentiment_gold
  • name

    名称
  • negativereason_gold

    negativereason_gold
  • retweet_count

    retweet_count
  • text

    文本
  • tweet_coord

    tweet_coord
  • tweet_created

    tweet_created
  • tweet_location

    tweet_location
  • user_timezone

    user_timezone

And looks like following example (side-scrollable table):

并类似于以下示例(可侧滚动表):

tweet_idairline_sentimentairline_sentiment_confidencenegativereasonnegativereason_confidenceairlineairline_sentiment_goldnamenegativereason_goldretweet_counttexttweet_coordtweet_createdtweet_locationuser_timezone
570306133677760513neutral1.0Virgin Americacairdin0@VirginAmerica What @dhepburn said.2015-02-24 11:35:52 -0800Eastern Time (US & Canada)
570301130888122368positive0.34860.0Virgin Americajnardino0@VirginAmerica plus you’ve added commercials to the experience… tacky.2015-02-24 11:15:59 -0800Pacific Time (US & Canada)
570301083672813571neutral0.6837Virgin Americayvonnalynn0@VirginAmerica I didn’t today… Must mean I need to take another trip!2015-02-24 11:15:48 -0800Lets PlayCentral Time (US & Canada)
570301031407624196negative1.0Bad Flight0.7033Virgin Americajnardino0“@VirginAmerica it’s really aggressive to blast obnoxious “”entertainment”” in your guests’ faces & they have little recourse”2015-02-24 11:15:36 -0800Pacific Time (US & Canada)
570300817074462722negative1.0Can’t Tell1.0Virgin Americajnardino0@VirginAmerica and it’s a really big bad thing about it2015-02-24 11:14:45 -0800Pacific Time (US & Canada)
570300767074181121negative1.0Can’t Tell0.6842Virgin Americajnardino0“@VirginAmerica seriously would pay $30 a flight for seats that didn’t have this playing.
it’s really the only bad thing about flying VA”2015-02-24 11:14:33 -0800Pacific Time (US & Canada)
570300616901320704positive0.67450.0Virgin Americacjmcginnis0“@VirginAmerica yesnearly every time I fly VX this “ear worm” won’t go away :)”2015-02-24 11:13:57 -0800San Francisco CAPacific Time (US & Canada)
570300248553349120neutral0.634Virgin Americapilot0“@VirginAmerica Really missed a prime opportunity for Men Without Hats parodythere. https://t.co/mWpG7grEZP”2015-02-24 11:12:29 -0800Los AngelesPacific Time (US & Canada)
tweet_id 航空公司情绪 航空公司信心指数 负面原因 否定理由 航空公司 airline_sentiment_gold 名称 negativereason_gold retweet_count 文本 tweet_coord tweet_created tweet_location user_timezone
570306133677760513 中性 1.0 维珍美国航空 凯丁 0 @VirginAmerica @dhepburn怎么说。 2015-02-24 11:35:52 -0800 东部时间(美国和加拿大)
570301130888122368 0.3486 0.0 维珍美国航空 纳尔迪诺 0 @VirginAmerica加上您已经在广告中添加了经验……俗气。 2015-02-24 11:15:59 -0800 太平洋时间(美国和加拿大)
570301083672813571 中性 0.6837 维珍美国航空 伊冯娜琳 0 @VirginAmerica我今天没有……必须表示我需要再次旅行! 2015-02-24 11:15:48 -0800 让我们玩 中部时间(美国和加拿大)
570301031407624196 1.0 飞行不良 0.7033 维珍美国航空 纳尔迪诺 0 “ @VirginAmerica在客人的脸上炸开令人讨厌的“娱乐”确实很积极,他们几乎没有追索权” 2015-02-24 11:15:36 -0800 太平洋时间(美国和加拿大)
570300817074462722 1.0 无法分辨 1.0 维珍美国航空 纳尔迪诺 0 @VirginAmerica,这是一个非常大的坏事 2015-02-24 11:14:45 -0800 太平洋时间(美国和加拿大)
570300767074181121 1.0 无法分辨 0.6842 维珍美国航空 纳尔迪诺 0 “ @VirginAmerica认真地为没有参加这个比赛的座位支付30美元的航班费用。
这真的是飞行VA唯一的缺点” 2015-02-24 11:14:33 -0800 太平洋时间(美国和加拿大)
570300616901320704 0.6745 0.0 维珍美国航空 cjmcginnis 0 “ @VirginAmerica是 几乎每次我乘坐VX时,这种“耳蜗”都不会消失:)” 2015-02-24 11:13:57 -0800 加州旧金山 太平洋时间(美国和加拿大)
570300248553349120 中性 0.634 维珍美国航空 飞行员 0 “ @VirginAmerica真的错过了“无帽男装”模仿的绝佳机会 那里。 https://t.co/mWpG7grEZP” 2015-02-24 11:12:29 -0800 洛杉矶 太平洋时间(美国和加拿大)

The file contains 14,640 tweets, so it’s a decent dataset for us to work with. Now, with the current amount of columns we have available we have way more data than what we need for our example; for practical purposes we only care about the following columns:

该文件包含14,640条推文,因此它是供我们使用的一个不错的数据集。 现在,使用当前可用的列数,我们拥有的数据比示例所需的更多。 出于实际目的,我们只关心以下几列:

  • text

    文本
  • airline_sentiment

    航空公司情绪

Where text will become our feature and the airline_sentiment becomes our target. The rest of the columns can be discarded as they will not be used for our exercise. Let’s start by creating the project, and initialize composer using the following file:

text将成为我们的功能,而airline_sentiment成为我们的目标。 其余的列可以丢弃,因为它们将不会用于我们的练习。 让我们从创建项目开始,并使用以下文件初始化composer:

{
    "name": "amacgregor/phpml-exercise",
    "description": "Example implementation of a Tweet sentiment analysis with PHP-ML",
    "type": "project",
    "require": {
        "php-ai/php-ml": "^0.4.1"
    },
    "license": "Apache License 2.0",
    "authors": [
        {
            "name": "Allan MacGregor",
            "email": "amacgregor@allanmacgregor.com"
        }
    ],
    "autoload": {
        "psr-4": {"PhpmlExercise\\": "src/"}
    },
    "minimum-stability": "dev"
}
composer install

If you need an introduction to Composer, see here.

如果您需要Composer的简介,请参见此处

To make sure we are set up correctly, let’s create a quick script that will load our Tweets.csv data file and make sure it has the data we need. Copy the following code as reviewDataset.php in the root of our project:

为了确保设置正确,让我们创建一个快速脚本,该脚本将加载Tweets.csv数据文件并确保其具有我们所需的数据。 将以下代码作为reviewDataset.php复制到我们项目的根目录中:

<?php
namespace PhpmlExercise;

require __DIR__ . '/vendor/autoload.php';

use Phpml\Dataset\CsvDataset;

$dataset = new CsvDataset('datasets/raw/Tweets.csv',1);

foreach ($dataset->getSamples() as $sample) {
    print_r($sample);
}

Now, run the script with php reviewDataset.php, and let’s review the output:

现在,使用php reviewDataset.php运行脚本,让我们检查输出:

Array( [0] => 569587371693355008 )
Array( [0] => 569587242672398336 )
Array( [0] => 569587188687634433 )
Array( [0] => 569587140490866689 )

Now that doesn’t look useful, does it? Let’s take a look at the CsvDataset class to get a better idea of what’s happening internally:

现在看起来没有用,是吗? 让我们看一下CsvDataset类,以更好地了解内部发生的事情:

<?php 

    public function __construct(string $filepath, int $features, bool $headingRow = true)
    {
        if (!file_exists($filepath)) {
            throw FileException::missingFile(basename($filepath));
        }

        if (false === $handle = fopen($filepath, 'rb')) {
            throw FileException::cantOpenFile(basename($filepath));
        }

        if ($headingRow) {
            $data = fgetcsv($handle, 1000, ',');
            $this->columnNames = array_slice($data, 0, $features);
        } else {
            $this->columnNames = range(0, $features - 1);
        }

        while (($data = fgetcsv($handle, 1000, ',')) !== false) {
            $this->samples[] = array_slice($data, 0, $features);
            $this->targets[] = $data[$features];
        }
        fclose($handle);
    }

The CsvDataset constructor takes 3 arguments:

CsvDataset构造函数采用3个参数:

  • A file-path to the source CSV

    源CSV的文件路径
  • An integer that specifies the number of features in our file

    一个整数,指定文件中的功能数
  • A boolean to indicate if the first row is header

    指示第一行是否为标题的布尔值

If we look a little closer we can see that the class is mapping out the CSV file into two internal arrays: samples and targets. Samples contains all the features provided by the file and targets contains the known values (negative, positive, or neutral).

如果再仔细一点,我们可以看到该类正在将CSV文件映射到两个内部数组:样本和目标。 样本包含文件提供的所有功能, 目标包含已知值(负值,正值或中性值)。

Based on the above, we can see that the format our CSV file needs to follow is as follows:

基于上述内容,我们可以看到CSV文件需要遵循的格式如下:

| feature_1 | feature_2 | feature_n | target |

We will need to generate a clean dataset with only the columns we need to continue working. Let’s call this script generateCleanDataset.php :

我们将需要生成仅包含我们需要继续工作的列的干净数据集。 让我们将此脚本称为generateCleanDataset.php

<?php
namespace PhpmlExercise;

require __DIR__ . '/vendor/autoload.php';

use Phpml\Exception\FileException;

$sourceFilepath         = __DIR__ . '/datasets/raw/Tweets.csv';
$destinationFilepath    = __DIR__ . '/datasets/clean_tweets.csv';

$rows =[];

$rows = getRows($sourceFilepath, $rows);
writeRows($destinationFilepath, $rows);


/**
 * @param $filepath
 * @param $rows
 * @return array
 */
function getRows($filepath, $rows)
{
    $handle = checkFilePermissions($filepath);

    while (($data = fgetcsv($handle, 1000, ',')) !== false) {
        $rows[] = [$data[10], $data[1]];
    }
    fclose($handle);
    return $rows;
}

/**
 * @param $filepath
 * @param string $mode
 * @return bool|resource
 * @throws FileException
 */
function checkFilePermissions($filepath, $mode = 'rb')
{
    if (!file_exists($filepath)) {
        throw FileException::missingFile(basename($filepath));
    }

    if (false === $handle = fopen($filepath, $mode)) {
        throw FileException::cantOpenFile(basename($filepath));
    }
    return $handle;
}

/**
 * @param $filepath
 * @param $rows
 * @internal param $list
 */
function writeRows($filepath, $rows)
{
    $handle = checkFilePermissions($filepath, 'wb');

    foreach ($rows as $row) {
        fputcsv($handle, $row);
    }

    fclose($handle);
}

Nothing too complex, just enough to do the job. Let’s execute it with phpgenerateCleanDataset.php.

没什么太复杂的,仅足以完成这项工作。 让我们使用phpgenerateCleanDataset.php执行它。

Now, let’s go ahead and point our reviewDataset.php script back to the clean dataset:

现在,让我们继续,将我们的reviewDataset.php脚本指向干净的数据集:

Array
(
    [0] => @AmericanAir That will be the third time I have been called by 800-433-7300 an hung on before anyone speaks. What do I do now???
)
Array
(
    [0] => @AmericanAir How clueless is AA. Been waiting to hear for 2.5 weeks about a refund from a Cancelled Flightled flight &amp; been on hold now for 1hr 49min
)

BAM! This is data we can work with! So far, we have been creating simple scripts to manipulate the data. Next, we are going to start creating a new class under src/classification/SentimentAnalysis.php.

AM! 这是我们可以使用的数据! 到目前为止,我们一直在创建用于处理数据的简单脚本。 接下来,我们将开始在src/classification/SentimentAnalysis.php下创建一个新类。

<?php
namespace PhpmlExercise\Classification;

/**
 * Class SentimentAnalysis
 * @package PhpmlExercise\Classification
 */
class SentimentAnalysis { 
    public function train() {}
    public function predict() {}
}

Our Sentiment class will need two functions in our sentiment analysis class:

在情感分析类中,情感类需要两个功能:

  • A train function, which will take our dataset training samples and labels and some optional parameters.

    训练函数 ,它将使用我们的数据集训练样本和标签以及一些可选参数。

  • A predict function, which will take an unlabelled dataset and assigned a set of labels based on the training data.

    预测函数 ,它将采用未标记的数据集并根据训练数据分配一组标记。

In the root of the project create a script called classifyTweets.php. We will use his script to instantiate and test our sentiment analysis class. Here is the template that we will use:

在项目的根目录中,创建一个名为classifyTweets.php的脚本。 我们将使用他的脚本实例化和测试我们的情感分析类。 这是我们将使用的模板:

<?php

namespace PhpmlExercise;
use PhpmlExercise\Classification\SentimentAnalysis;

require __DIR__ . '/vendor/autoload.php';

// Step 1: Load the Dataset

// Step 2: Prepare the Dataset

// Step 3: Generate the training/testing Dataset

// Step 4: Train the classifier 

// Step 5: Test the classifier accuracy

步骤1:加载数据集 (Step 1: Load the Dataset)

We already have the basic code that we can use for loading a CSV into a dataset object from our earlier examples. We are going to use the same code with a few tweaks:

我们已经有了基本的代码,可以用来从前面的示例中将CSV加载到数据集对象中。 我们将使用相同的代码进行一些调整:

<?php
...
use Phpml\Dataset\CsvDataset;
...
$dataset = new CsvDataset('datasets/clean_tweets.csv',1);

$samples = [];
foreach ($dataset->getSamples() as $sample) {
    $samples[] = $sample[0];
}

This generates a flat array with only the features – in this case the tweet text – which we are going to use to train our classifier.

这将生成仅具有特征(在本例中为tweet文本)的平面数组,我们将使用这些特征来训练分类器。

步骤2:准备资料集 (Step 2: Prepare the Dataset)

Now, having the raw text and passing that to a classifier wouldn’t be useful or accurate since every tweet is essentially different. Fortunately, there are ways of dealing with text when trying to apply classification or machine learning algorithms. For this example, we are going to make use of the following two classes:

现在,拥有原始文本并将其传递给分类器将不再有用或准确,因为每条推文本质上都是不同的。 幸运的是,在尝试应用分类或机器学习算法时,有多种处理文本的方法。 对于此示例,我们将使用以下两个类:

  • Token Count Vectorizer: This will transform a collection of text samples to a vector of token counts. Essentially, every word in our tweet becomes a unique number and keeps track of amounts of occurrences of a word in a specific text sample.

    令牌计数向量器:这会将文本样本的集合转换为令牌计数的向量。 本质上,我们推文中的每个单词都变成一个唯一的数字,并跟踪特定文本样本中单词出现的次数。

  • Tf-idf Transformer: short for term frequency–inverse document frequency, is a numerical statistic intended to reflect how important a word is to a document in a collection or corpus.

    Tf-idf TransformerTf-idf Transformer的缩写,是频率-逆文档频率,是一种数字统计,旨在反映单词对集合或语料库中文档的重要性。

Let’s start with our text vectorizer:

让我们从文本矢量化器开始:

<?php
...
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WordTokenizer;

...
$vectorizer = new TokenCountVectorizer(new WordTokenizer());

$vectorizer->fit($samples);
$vectorizer->transform($samples);

Next, apply the Tf-idf Transformer:

接下来,应用Tf-idf变压器:

<?php
...

use Phpml\FeatureExtraction\TfIdfTransformer;
...
$tfIdfTransformer = new TfIdfTransformer();

$tfIdfTransformer->fit($samples);
$tfIdfTransformer->transform($samples);

Our samples array is now in a format where it an easily be understood by our classifier. We are not done yet, we need to label each sample with its corresponding sentiment.

现在,我们的示例数组采用的格式易于分类器理解。 我们还没有完成,我们需要用相应的情绪标记每个样本。

步骤3:产生训练资料集 (Step 3: Generate the Training Dataset)

Fortunately, PHP-ML has this need already covered and the code is quite simple:

幸运的是,PHP-ML已经满足了这一需求,并且代码非常简单:

<?php
...
use Phpml\Dataset\ArrayDataset;
...
$dataset = new ArrayDataset($samples, $dataset->getTargets());

We could go ahead and use this dataset and train our classifier. We are missing a testing dataset to use as validation, however, so we are going to “cheat” a little bit and split our original dataset into two: a training dataset and a much smaller dataset that will be used for testing the accuracy of our model.

我们可以继续使用此数据集并训练我们的分类器。 但是,我们缺少一个要用作验证的测试数据集,因此我们将“作弊”一点并将原始数据集分为两个部分:训练数据集和一个较小的数据集,用于测试我们的准确性模型。

<?php
...
use Phpml\CrossValidation\StratifiedRandomSplit;
...
$randomSplit = new StratifiedRandomSplit($dataset, 0.1);

$trainingSamples = $randomSplit->getTrainSamples();
$trainingLabels     = $randomSplit->getTrainLabels();

$testSamples = $randomSplit->getTestSamples();
$testLabels      = $randomSplit->getTestLabels();

This approach is called cross-validation. The term comes from statistics and can be defined as follows:

这种方法称为交叉验证。 该术语来自统计,可以定义如下:

Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. — Wikipedia.com

交叉验证(有时称为轮转估计)是一种模型验证技术,用于评估统计分析的结果将如何推广到一个独立的数据集。 它主要用于目标是预测的环境中,并且人们想要估计预测模型在实践中的执行准确度。 — Wikipedia.com

步骤4:训练分类器 (Step 4: Train the Classifier)

Finally, we are ready to go back and implement our SentimentAnalysis class. If you haven’t noticed by now, a huge part of machine learning is about gathering and manipulating the data; the actual implementation of the Machine learning models tends to be a lot less involved.

最后,我们准备返回并实现SentimentAnalysis类。 如果您现在还没有注意到,那么机器学习的很大一部分就是关于数据的收集和处理。 机器学习模型的实际实现往往很少涉及。

To implement our sentiment analysis class, we have three classification algorithms available:

为了实施情感分析类,我们提供了三种分类算法:

  • Support Vector Classification

    支持向量分类
  • KNearestNeighbors

    KNearestNeighbors
  • NaiveBayes

    朴素贝叶斯

For this exercise we are going to use the simplest of them all, the NaiveBayes classifier, so let’s go ahead and update our class to implement the train method:

在本练习中,我们将使用最简单的分类器NaiveBayes分类器,因此,让我们继续更新类以实现train方法:

<?php

namespace PhpmlExercise\Classification;
use Phpml\Classification\NaiveBayes;

class SentimentAnalysis
{
    protected $classifier;

    public function __construct()
    {
        $this->classifier = new NaiveBayes();
    }
    public function train($samples, $labels)
    {
        $this->classifier->train($samples, $labels);
    }
}

As you can see, we are letting PHP-ML do all the heavy lifting for us. We are just creating a nice little abstraction for our project. But how do we know if our classifier is actually training and working? Time to use our testSamples and testLabels.

如您所见,我们让PHP-ML为我们完成所有繁重的工作。 我们只是为我们的项目创建一个漂亮的小抽象。 但是我们怎么知道我们的分类器实际上是在训练和工作呢? 是时候使用我们的testSamplestestLabels

步骤5:测试分类器的准确性 (Step 5: Test the Classifier’s Accuracy)

Before we can proceed with testing our classifier, we do have to implement the prediction method:

在继续测试分类器之前,我们必须实现预测方法:

<?php
...
class SentimentAnalysis
{
...
    public function predict($samples)
    {
        return $this->classifier->predict($samples);
    }
}

And again, PHP-ML is doing us a solid and doing all the heavy lifting for us. Let’s update our classifyTweets class accordingly:

同样,PHP-ML为我们提供了坚实的基础,并为我们做了所有繁重的工作。 让我们相应地更新我们的classifyTweets类:

<?php
...
$predictedLabels = $classifier->predict($testSamples);

Finally, we need a way to test the accuracy of our trained model; thankfully PHP-ML has that covered too, and they have several metrics classes. In our case, we are interested in the accuracy of the model. Let’s take a look at the code:

最后,我们需要一种方法来测试经过训练的模型的准确性; 值得庆幸的是,PHP-ML也涵盖了这一点,并且它们具有几个度量标准类。 在我们的案例中,我们对模型的准确性感兴趣。 让我们看一下代码:

<?php
...
use Phpml\Metric\Accuracy;
...
echo 'Accuracy: '.Accuracy::score($testLabels, $predictedLabels);

We should see something along the lines of:

我们应该看到以下内容:

Accuracy: 0.73651877133106%

结论 (Conclusion)

This article fell a bit on the long side, so let’s do a recap of what we’ve learned so far:

这篇文章有点长,所以让我们回顾一下到目前为止所学到的东西:

  • Having a good dataset from the start is critical for implementing machine learning algorithms.

    从一开始就拥有一个良好的数据集对于实现机器学习算法至关重要。
  • The difference between supervised learning and unsupervised Learning.

    监督学习与非监督学习之间的区别。
  • The meaning and use of cross-validation in machine learning.

    交叉验证在机器学习中的含义和使用。
  • That vectorization and transformation are essential to prepare text datasets for machine learning.

    矢量化和转换对于准备用于机器学习的文本数据集至关重要。
  • How to implement a Twitter sentiment analysis by using PHP-ML’s NaiveBayes classifier.

    如何使用PHP-ML的NaiveBayes分类器实施Twitter情绪分析。

This post also served as an introduction to the PHP-ML library and hopefully gave you a good idea of what the library can do and how it can be embedded in your own projects.

这篇文章还介绍了PHP-ML库,并希望您对库可以做什么以及如何将其嵌入到自己的项目中有了一个很好的了解。

Finally, this post is by no means comprehensive and there is plenty to learn, improve and experiment with; here are some ideas to get you started on how to improve things further:

最后,这篇文章绝不是全面的,有很多东西可以学习,改进和尝试。 以下是一些想法,可帮助您开始进一步改进事情:

  • Replace the NaiveBayes algorithm with the Support Vector Classification algorithm.

    将NaiveBayes算法替换为支持向量分类算法。
  • If you tried running against the full dataset (14,000 rows) you’d probably notice how memory intensive the process can be. Try implementing model persistence so it doesn’t have to be trained on each run.

    如果尝试对整个数据集(14,000行)运行,您可能会注意到该过程可能占用大量内存。 尝试实现模型持久性,这样就不必在每次运行时都进行训练。
  • Move the dataset generation to its own helper class.

    将数据集生成移至其自己的帮助器类。

I hope you found this article useful. If you have some application ideas regarding PHP-ML or any questions, don’t hesitate to drop them below into the comments area!

希望本文对您有所帮助。 如果您有关于PHP-ML的一些应用程序构想或任何问题,请立即将其放在注释区域中!

翻译自: https://www.sitepoint.com/how-to-analyze-tweet-sentiments-with-php-machine-learning/

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值