多层神经元感知器模型_使用多层感知器模型对星系进行分类

最新推荐文章于 2024-06-09 09:53:44 发布

weixin_26704853

最新推荐文章于 2024-06-09 09:53:44 发布

阅读量456

点赞数

文章标签：机器学习 tensorflow 人工智能神经网络 java

原文链接：https://medium.com/swlh/classifying-galaxies-using-multilayered-perceptron-model-6e1c8fe5b044

版权

多层神经元感知器模型

Cosmos is an intriguing space to observe and analyse, it is the stronghold for any science we discovered to this point. Galaxies are the mines with immense data clusters just lying there to be explored in this infinite space. the above picture is the result of a 1 Million-second-long exposure of Hubble space telescope revealing the earliest galaxies named as Hubble Ultra Deep Field. Each star-like object in the photograph is actually an entire galaxy.

çosmos是一个有趣的空间来观察和分析，这是因为我们发现了这一点，任何科学的堡垒。星系是具有巨大数据簇的地雷，它们正躺在该无限空间中进行探索。上图是哈勃太空望远镜长时间曝光100万秒的结果，它揭示了最早的名为哈勃超深视场的星系。照片中的每个恒星物体实际上都是一个完整的星系。

The first galaxy was observed by a Persian astronomer Abd al-Rahman over a 1,000 years ago, and it was first believed to be an unknown extended structure. which is now known as the Messier-31 or the infamous Andromeda Galaxy. From that point, these unknown structures are more frequently observed and recorded but it took over 9 centuries for the astronomers to manifest on an agreement that they were not simply any astronomical objects but are entire galaxies. As the discoveries of these galaxies increased astronomers observed the divergent morphologies. Then they started grouping the previously reported and the newly discovered galaxies according to the morphological characteristics which later formed a significant classification Scheme.

波斯天文学家阿卜杜勒·拉赫曼(Abd al-Rahman)于1000年前观测到了第一个星系，并首次认为它是未知的扩展结构。现在称为Messier-31或臭名昭著的 仙女座星系 。从那时起，人们更频繁地观察和记录这些未知的结构，但是花了9个多世纪的时间，天文学家才一致同意它们不是简单的天文物体，而是整个星系。随着这些星系的发现增加，天文学家观察到了不同的形态。然后他们开始根据形态特征对先前报告的星系和新发现的星系进行分组，随后形成了重要的分类方案。

现代进步 (Modern Advancements)

Astronomy in this contemporary age has massively evolved parallelly with computational advancement during the years. Sophisticated computational techniques like Machine Learning models are much more efficient now because of the substantially increased computers performance efficiency and the enormous data that we have now. Centuries ago, Classification tasks like these were done by hand with a massive group of people, evaluating the results by cross-validation and collective post-agreement.

这些年来，当今时代的天文学与计算的进步并行发展。诸如机器学习模型之类的复杂计算技术现在已大大提高了效率，因为计算机性能效率已大大提高，而我们现在拥有的数据量也很大。几个世纪前，此类分类任务是由一大批人手工完成的，通过交叉验证和集体后协议来评估结果。

为什么我选择处理此数据 (Why I chose to work on this Data)

I recently started working on theoretical Deep Learning concepts and wanted my first practical approach of Technically applying those concepts to a task, which is driven by passion, yet is still prominent and relatable to my core attempt of learning Neural Networks.

我最近开始研究理论上的深度学习概念，并希望我的第一个实际方法是将这些概念从技术上应用到任务中，这是由热情驱动的，但仍然与我学习神经网络的核心尝试紧密相关。

数据采集 (Data Collection)

The Galaxy Zoo project hosted diverse Sky-survey data Catalogues online for astronomers around the world who are allowed to access, study and analyse the data. For this Classification task, I grabbed the data with the features very well defining the classes. For feature description [METADATA].

银河动物园项目为世界各地的天文学家提供了各种在线的天空调查数据目录，这些天文学家被允许访问，研究和分析数据。对于此分类任务，我获取了具有很好定义类的功能的数据。有关功能描述[ METADATA ]。

分类架构 (The Classification Schema)

The Hubble’s Tuning fork is the most famous Classification Scheme, Edwin Hubble divided Galaxies into three main types to be simplified they are Elliptical, Spiral and Merger classes.

哈勃的音叉是最著名的分类方案，埃德温·哈勃将星系划分为三种主要类型，为简化起见，它们分别是椭圆，螺旋和合并类。

Elliptical galaxies have a smooth, spheroidal appearance with little internal structure. They are dominated by a spheroidal bulge and have no prominent thin disk. Spiral galaxies all show spiral arms. And the third class, Merger galaxies are Irregular looking often described as chaotic appearances, are most likely the remnants of a collision between two Galaxies as described in The Realm of Nebulae.

椭圆形星系具有光滑的球形外观，内部结构很少。它们以球形凸起为主，没有明显的薄盘。旋涡星系都显示旋臂。第三类，合并星系看起来是不规则的，通常被描述为混乱的外观，很可能是两个星系之间碰撞的残余物，如《星云的境界》中所述。

深入研究神经网络 (Diving Deep into the Neural Networks)

The astronomical data, reading the data from the flat file,

天文数据，从平面文件中读取数据，

The Dimensionality of the entire data in (Row, Column )-shape(667944, 13)

(Row，Column)- 形状的整个数据的维度(667944，13)

数据预处理 (Data Preprocessing)

The first column doesn’t have any effect on the final performance of this classification model because it is not at all correlational to the classes, OBIJD is a unique id used for a particular object of interest in the dataset, the RA ( Right Ascension ) and Dec ( Declination ) on the other hand are recorded absolute positions of these objects of interest, which are also unique to every datapoint so dropping the 3 columns would be better, aiming for better accuracy. we can do that using,

第一列对该分类模型的最终性能没有任何影响，因为它与类别完全不相关，OBIJD是用于数据集中特定感兴趣对象 RA( Right Ascension )的唯一ID另一方面，Dec和Dec( Declination )记录了这些感兴趣对象的绝对位置，这对于每个数据点也是唯一的，因此删除3列会更好，目的是提高准确性。我们可以使用

data = data.drop([‘OBJID’,’RA’,’DEC’],axis=1)

lookout for Null values and NaNs,

查找Null值和NaN，

Since this is a classification task we need to check for Class Imbalance, In a dataset where we are performing a classification task even if its Binary, Class Imbalance can have a major effect in the training phase, and ultimately on the accuracy. To plot the value_counts for the three-class columns we can do that by the following code snippet.

由于这是分类任务，因此我们需要检查“ 类别不平衡” ，在执行分类任务的数据集中，即使其“二进制”，“类别不平衡”在训练阶段可能会产生重大影响，并最终影响准确性。要绘制三类列的value_counts ，我们可以通过以下代码段来实现。

plt.figure(figsize=(10,7))
plt.title('Count plot for Galaxy types ')
countplt = data[['SPIRAL','ELLIPTICAL','UNCERTAIN']]
sns.countplot(x="variable",hue='value', data=pd.melt(countplt))
plt.xlabel('Classes')

Inferring from the plot, considering 1’s, there exists a class imbalance of the classes, but this is not a prominent difference to have any effect on the model's performance, considering that they are in a One-Hot Encoding format. It’s in my typical workflow with the classification tasks to check for an Imbalance, I believe it to be a good practise. ( Note, we only consider 1’s from the above plot since 1 determines the class for a set of features and is 0 for all the other classes with the same features )

从图上推断，考虑到1，存在类的类不平衡，但是考虑到它们采用单点编码格式，这对模型的性能没有任何显着差异。在我典型的工作流中，您将使用分类任务来检查不平衡，我认为这是一个很好的实践。 ( 请注意，由于上图1决定了一组要素的类，而其他所有具有相同要素的类为0，因此我们仅考虑上图中的1)

归一化和train_test_split (Normalisation and train_test_split)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScalerX = data.drop(['SPIRAL','ELLIPTICAL','UNCERTAIN'],axis=1).values
y = data[['SPIRAL','ELLIPTICAL','UNCERTAIN']].valuesX_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=101)scaler = MinMaxScaler()X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

For any Machine Learning model to learn from the data its a conventional method to split the original data into Training Sets and Testing Sets, where the split percentages are 80% training set and 20% testing set. and the whole dataset at least should have 1,000 data points to avoid any overfitting and to simply increase the learning period of any model.

对于任何机器学习模型来说，从数据中学习都是一种传统方法，即将原始数据拆分为训练集和测试集，其中拆分百分比为80％训练集和20％测试集。并且整个数据集至少应具有1,000个数据点，以避免任何过度拟合并仅增加任何模型的学习时间。

实例化神经网络并设置超参数 (Instantiating Neural Network and setting Hyper-parameters)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

Sequential, in Keras, allows us to manufacture the Multilayered Perceptron model from scratch. We can add each layer with a unit number as a parameter for the Dense function where every unit number implies that many Densely connected neurons.

在Keras中 ，Sequential允许我们从头开始制造多层Perceptron模型。我们可以将具有单位编号的每一层添加为Dense函数的参数，其中每个单位编号表示许多密集连接的神经元。

model = Sequential()
# Input Layer
model.add(Dense(10,activation='relu’))# Hidden Layer
model.add(Dense(5,activation='relu’))
# Output Layer
model.add(Dense(3, activation = 'softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy’])

Explanation,

说明，

The Arrangement of my Neural Network is 10–5–3, 10 input neurons since we have 10 feature columns, 3 output neurons because we have 3 output classes everything in between is arbitrarily selected, they are called as Hidden Layers.

我的神经网络的排列是10–5–3，因为我们有10个特征列，所以输入了10个输入神经元，因为我们有3个输出类别，因此任意选择了3个输出神经元，它们被称为“ 隐藏层” 。

Each Neuron/perceptron performs a complex calculation followed by an Activation function, In my case, I used the most common activation, ReLu ( Rectified Linear unit ) and for the last three output neurons, the Activation is Softmax, which returns the probability distributions of the three classes.

每个神经元/感知器都执行复杂的计算，然后执行激活函数 。在我的情况下，我使用了最常见的激活ReLu( 整流线性单位 )，对于最后三个输出神经元，激活为Softmax ，它返回的概率分布为这三个班。

Adam Optimizer is used to achieve an efficient Gradient decent i.e the optimal minimum of the bounded weights, and the conventional loss function used for a Multi-class Classification task is Categorical_crossentropy. Metrics for realtime evaluation will be accuracy for any classification task.

Adam Optimizer用于实现高效的梯度体面，即边界权重的最佳最小值 ，用于多类分类任务的常规损失函数为Categorical_crossentropy 。实时评估的指标将是任何分类任务的准确性。

Following is the Theoretical Model Description for the above Sequential Setup.

以下是上述顺序设置的理论模型说明。

部署模型 (Deploying the Model)

start = time.perf_counter()model.fit(x=X_train,y=y_train,epochs=20)print('\nTIME ELAPSED {}Seconds'.format(time.perf_counter() - start))

Note, Time Elapsed will be different for different Computer configurations and it is completely optional to calculate the time-taken, I just did it because I was intrigued to find out how long it would take for all the epochs.

请注意，对于不同的计算机配置，“经过的时间”将有所不同，并且计算所花费的时间是完全可选的，我之所以这样做，是因为我很想知道所有纪元要花费多长时间。

每个时期的绘图精度 (Plotting Accuracy at each Epoch)

From this Accuracy plot, We can infer that after a certain epoch, i.e approximately from the 6th epoch, the accuracy remained constant for all the other epochs. ( A single Epoch means, 1 whole cycle through the entire Training Set )

从这个精度图，我们可以推断出某个时期之后，即大约从第6个时期开始，所有其他时期的精度保持不变。 (一个纪元意味着整个训练集的整个周期为1个)

Achieved Model Accuracy is 0.90, i.e 90%

达到的模型精度为0.90，即90％

分类报告 (Classification Report)

有改善吗？ (Improvement?)

Sophisticatedly structured feature-data is paramount for any Machine Learning/Deep Learning model, more the number of defining features, more will be the anticipated performance. Feature Engineering is the most essential stage in any Data analysis task, but one cannot efficiently perform feature extraction without having any domain knowledge about the task/data that he/she’s handling, the leverage of knowing how to correlate the features does not exist. Well, This is Core Astronomical Data with the features which I cannot even begin to interpret, If I had an Astronomy background to study, organise and add more features then, it will be certain that this model can perform way better than what It did.

对于任何机器学习/深度学习模型而言，结构复杂的特征数据都是至关重要的，定义的特征数量越多，预期的性能就越高。特征工程是任何数据分析任务中最重要的阶段，但是如果没有他/她正在处理的任务/数据的任何领域知识，就无法有效地执行特征提取，不存在知道如何关联特征的优势。好吧，这是具有我什至无法开始解释的功能的核心天文学数据。如果我有天文学的背景来研究，组织和添加更多功能，那么可以肯定的是，该模型的性能将比其更好。

We’ve always defined ourself by the ability to overcome the impossible.

我们始终以克服不可能的能力来定义自己。

For the Source Code visit my GitHub where I, Idealised this task along with the other Computational Astronomy concepts I worked on, I call it The Anecdote of Computational-Astronomy ~ GitHub

对于源代码，请访问我的GitHub，在这里我将该任务以及我工作的其他计算天文学概念理想化，我将其称为“计算天文学轶事” 〜GitHub