知乎 开源机器学习_使用开源数据和机器学习预测海洋温度

本文介绍如何结合开源数据和机器学习技术来预测海洋温度。通过Python实现,结合大数据和人工智能算法,对海洋气候进行深入分析和预测。
摘要由CSDN通过智能技术生成

知乎 开源机器学习

In this tutorial, we’re going to show you how to take open source data from the National Oceanic and Atmospheric Administration (NOAA), clean it, and forecast future temperatures using no-code machine learning methods.

在本教程中,我们将向您展示如何从美国国家海洋和大气管理局(NOAA)获取开源数据,进行清理以及使用无代码机器学习方法预测未来的温度。

This particular data comes from the Harmful Algal BloomS Observation System (HABSOS). There are several interesting questions to ask of this data — namely, what is the relationship between algal blooms and water temperature fluctuations. For this tutorial, we’re going to start with a basic question: can we predict what temperatures will be over the next five months?

此特定数据来自有害藻华观测系统(HABSOS)。 这个数据有几个有趣的问题要问-即藻华与水温波动之间的关系是什么。 对于本教程,我们将从一个基本问题开始:我们可以预测未来五个月的温度吗?

The first part of this tutorial deals with acquiring and cleaning the dataset. There are a lot of approaches to this; what is shown below is just one approach. Further, if your dataset is already clean, you can skip all that “data engineering” and jump straight into no-code AI bliss :)

本教程的第一部分涉及获取和清理数据集。 有很多方法可以解决这个问题。 下面显示的只是一种方法。 此外,如果您的数据集已经干净,则可以跳过所有的“数据工程”,直接跳入无代码的AI幸福:)

步骤1:下载并清理数据 (Step 1: Download & Clean the Data)

First, we download the data from the HABSOS site linked above. For convenience, we are posting the file here as well.

首先,我们从上面链接的HABSOS网站下载数据。 为了方便起见, 我们也在此处发布文件。

This CSV has 21 columns, which we discovered with this bash command.

该CSV共有21列,我们是通过bash命令发现的。

$ awk '{print NF}' habsos_20200310.csv | sort -nu | tail -n 1
21

We’ll explore the rest of the data in subsequent tutorials, but, of these 21 columns, the only columns I’m interested in for now are:

我们将在后续教程中探索其余数据,但是在这21列中,我目前唯一感兴趣的列是:

  • sample_date

    sample_date
  • sample_depth

    sample_depth
  • water_temp

    水温

In addition to only needing a subset of the columns in the data, there are other issues to deal with in order to get the data ready for analysis. We need to:

除了只需要数据中列的子集之外,还有其他问题需要处理才能准备好数据进行分析。 我们要:

  • Remove rows with NaN values (i.e. empty values) in thewater_temp column,

    删除water_temp列中具有NaN值(即空值)的行,

  • Select only the measurements made at a depth of 0.5 meters (to remove temperature variability due to ocean depth), and

    仅选择在0.5米深度处进行的测量(以消除由于海洋深度引起的温度变化),并且
  • Regularize the data periods by turning the datetime values into date values.

    通过将日期时间值转换为日期值来规范化数据周期。
import pandas as pd
from datetime import datetime as dtdf = pd.read_csv('habsos_20200310.csv', sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
pd.set_option('display.max_rows', None)# Get only the columns we care about
dfSub = df[['sample_date','sample_depth','water_temp']]# Remove the NaN values
dfClean = dfSub.dropna()# Select 0.5 depth measurements only
dfClean2 = dfClean.loc[df['sample_depth'] == '0.5']# Split the datetime values
dfClean2['sample_date'] = dfClean2['sample_date'].str.split(expand=True)[0]dfClean2.to_csv(r'/PATH/TO/YOUR/OUTPUT/out.csv', index = False)

There’s another big problem with this data: on certain days, there are multiple sensor readings; on other days, there are no sensor readings. Sometimes there are entire months without readings.

这些数据还有另一个大问题:在某些日子里,会有多个传感器读数。 在其他日子里,没有传感器读数。 有时整整几个月都没有阅读。

These problems are quicker to address in spreadsheets by using pivot tables. And, now that we have reduced the size of the data with the preceding Python script, we areable to load it into a Google Sheet.

通过使用数据透视表,可以更快地在电子表格中解决这些问题。 而且,既然我们已经使用前面的Python脚本减小了数据的大小,现在可以将其加载到Google表格中了。

What we ended up doing is making a pivot table of each month of each year (1954 to 2020) and took the median water temperature for that month. We used median instead of average values in case there were wild outlier measurements that might skew our summarized data.

我们最终要做的是制作每年(1954年至2020年)每个月的数据透视表,并获取该月的水温中位数 。 如果存在异常的异常测量结果可能会歪曲汇总数据的情况,我们将使用中位数而不是平均值。

Our results are available for viewing in the third tab of this Google Sheet.

我们的结果可在此Google表格的第三个标签中查看。

Let’s take those results and bring them into Monument!

让我们将这些结果带入Monument!

步骤2:绘制数据图表并使用无代码机器学习生成预测 (Step 2: Chart the Data & Use No-Code Machine Learning Generate a Forecast)

To chart the data, we’re first going to load it into Monument (www.monument.ai). Monument is an artificial intelligence/machine learning platform that allows you to use advanced algorithms without touching a line of code.

为了绘制数据图表,我们首先将其加载到Monument( www.monument.ai )中。 Monument是一个人工智能/机器学习平台,可让您使用高级算法而无需编写任何代码。

First, we’re going to import our freshly cleaned data into Monument as a CSV file. In the INPUT tab, you’ll see the data as it exists in the source file on the top and the data as it will be imported into Monument on the bottom. If you’re satisfied with how it will be imported, click OK in the bottom right.

首先,我们将刚清理的数据作为CSV文件导入到Monument(纪念碑)中。 在“输入”选项卡中,您将在顶部看到源文件中存在的数据,而在底部将看到导入到Monument中的数据。 如果对如何导入感到满意,请单击右下角的“确定”。

Image for post
Load the data!
加载数据!

When you click OK, you’ll be brought into the MODEL tab. You can drag the “data pills” from the far left into the COLS(X) and ROWS(Y) areas to chart the data. You will clearly see the gaps in the data, where there were months with no temperature readings.

单击“确定”后,您将进入“模型”选项卡。 您可以将“数据丸”从最左侧拖动到COLS(X)和ROWS(Y)区域以绘制数据图表。 您会清楚地看到数据中的差距,那里有数月没有温度读数。

Image for post
Monument’s algorithms can handle missing data.
Monument的算法可以处理丢失的数据。

This data has a visually recognizable pattern: it resembles a sine wave. In general — and especially when data has a repetitive pattern — it’s good to start an analysis with AutoRegression (AR). AR is one of the more “primitive” algorithms, but it often learns obvious patterns quickly.

该数据具有视觉上可识别的模式:类似于正弦波。 通常,尤其是当数据具有重复模式时,最好使用AutoRegression(AR)开始分析。 AR是更“原始”的算法之一,但是它经常可以快速学习明显的模式。

When we apply AR to the water temperature data by dragging it into the chart, we see a spiked divergence from the actual historical data early in the training period, but that the algorithm quickly gets a handle on what is occurring in the dataset.

当我们通过将AR拖入图表将AR应用于水温数据时,我们发现在训练初期它与实际历史数据存在明显的差异,但是该算法可以快速掌握数据集中的情况。

By the end of the training data, it almost perfectly overlays onto the training set. When an algorithm does a good job anticipating known historical data in the training period, it can be an indication that the algorithm will do well forecasting the future. (However, a concern is “overfitting,” which we will explore in future articles.)

到训练数据结束时,它几乎完美地覆盖了训练集。 当算法在训练期间很好地预测已知历史数据时,可能表明该算法可以很好地预测未来。 (但是,关注点是“过度拟合”,我们将在以后的文章中进行探讨。)

Image for post
Off to a good start!
开启良好的开端!

Now, let’s try a Dynamic Linear Model (DLM). DLM is a slightly more complex algorithm — let’s see if it gets us even better results. When we drag DLM into the chart, we notice immediately that something seems off: DLM appears out of sync with the training data. It has trouble anticipating where the peaks and troughs are in the historical data.

现在,让我们尝试动态线性模型(DLM)。 DLM是一种稍微复杂一些的算法-让我们看看它能否为我们带来更好的结果。 当我们将DLM拖到图表中时,我们立即注意到似乎有些不对劲:DLM似乎与训练数据不同步。 很难预测高峰和低谷在历史数据中的位置。

Image for post
Uh oh…
呃哦

If we zoom in by dragging the windowing widget below the chart and mute the AR results by clicking the color box above the cart, the effect is even more pronounced. The historical data and DLM are out of sync, so it’s unlikely that the forecasted results — beyond the historical data — will be reliable.

如果我们通过拖动图表下方的窗口小部件进行放大,并通过单击购物车上方的颜色框使AR结果静音,则效果会更加明显。 历史数据和DLM不同步,因此,超出历史数据的预测结果不太可能可靠。

Image for post
Not looking good…
不好看...

Let’s try Time-Varying AutoRegression (TVAR). It looks like it produces similar results to AR.

让我们尝试时变自动回归(TVAR)。 看起来它产生与AR类似的结果。

Image for post
Looking good.
看起来不错。

Now, let’s try Long Short-Term Memory (LSTM). This is way off! An LSTM often produces great results for “noisier” data that has less regular patterns. However, on highly patterned data like this dataset, it has trouble.

现在,让我们尝试长短期记忆(LSTM)。 这是路! LSTM通常会为规则模式较少的“噪点”数据产生很好的结果。 但是,在像该数据集这样的高度模式化的数据上,它会遇到麻烦。

There are ways to improve the performance of the LSTM (and any algorithm) by adjusting the algorithm’s parameters, but we already have algorithms performing well, so it doesn’t seem worth the effort.

有多种方法可以通过调整算法的参数来提高LSTM(和任何算法)的性能,但是我们已经拥有性能良好的算法,因此这似乎不值得付出努力。

Image for post
The LSTM has forsaken us…
LSTM抛弃了我们……

Now, let’s zoom in to see what we are working with by using the windowing widget on the bottom of the chart. Let’s also click the circles icon in the top right of Monument and select “forecast” to remove the training period and only show the prediction.

现在,让我们使用图表底部的窗口小部件放大以查看我们正在使用什么。 我们还单击“纪念碑”右上角的圆圈图标,然后选择“预测”以删除训练时间并仅显示预测。

The TVAR had looked good when zoomed out, but up close all of our algorithms seem to agree with one another, with the exception of TVAR. Let’s drop TVAR.

缩小时,TVAR看起来不错,但近距离我们的所有算法似乎彼此一致,但TVAR除外。 让我们放下TVAR。

Image for post
TVAR does not look so good up close.
近距离来看,TVAR看起来不太好。

Let’s bring back “training+forecast,” remove everything but AR, and apply the Gaussian Dynamic Boltzmann Machine (G-DyBM). Things are looking pretty good now :)

让我们带回“训练+预测”,除去AR之外的所有内容,然后应用高斯动态玻尔兹曼机(G-DyBM)。 现在情况看起来不错:)

Image for post
The sweet spot.
最好的地方。

Let’s flip over to the OUTPUT tab and scroll to the bottom to see our forecasts. Because we made our data periods monthly, p1, p2, p3, p4, and p5 are Month-1, Month-2, Month-3, Month-4, and Month-5 into the future.

让我们转到“输出”选项卡并滚动到底部以查看我们的预测。 因为我们将数据周期设为每月,所以p1,p2,p3,p4和p5分别是未来的第1个月,2个月,3个月,4个月和5个月。

Image for post

In this tutorial, we took open source data from the internet, cleaned it, loaded it into Monument, and — in minutes! — used advanced data science methods to get forecasts for future median monthly water temperatures in the Gulf of Mexico at a depth of 0.5 meters.

在本教程中,我们从互联网上获取了开放源数据,将其清理,然后将其加载到Monument中,然后-只需几分钟! -使用先进的数据科学方法来获得对墨西哥湾0.5米深处未来每月平均水温的预测。

You can download the .mai file of our results from this link.

您可以从此链接下载结果的.mai文件。

In the next tutorial, we’ll look deeper at the error rates for each of the algorithms we tried above and discuss why we might select one algorithm over another. We’ll also calculate the standard deviation for the outliers and discuss why this is important.

在下一个教程中,我们将更深入地研究上面尝试的每种算法的错误率,并讨论为什么我们可能选择一种算法而不是另一种算法。 我们还将计算离群值的标准偏差,并讨论为什么这很重要。

翻译自: https://medium.com/swlh/using-open-source-data-machine-learning-to-predict-ocean-temperatures-2c8d65165665

知乎 开源机器学习

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值