数据库数据类型使用_如何开始使用任何类型的数据第2部分

最新推荐文章于 2021-02-02 12:26:20 发布

weixin_26749889

最新推荐文章于 2021-02-02 12:26:20 发布

阅读量149

点赞数

文章标签：数据库 python java mysql sql

原文链接：https://towardsdatascience.com/how-to-get-started-with-any-kind-of-data-part-2-c4ccdbcc8b13

版权

数据库数据类型使用

从数据开始 (START WITH DATA)

Data Science is an Art. You need practice to perfect it. I was first associated with it in 2017. You can read my story and the first part to this article here. In this article I have continued with the next to phases of your data project — Analysis by building models and Presentation using graphics.

数据科学是一门艺术。您需要练习来完善它。我于2017年首次与它建立联系。您可以在此处阅读我的故事和本文的第一部分。在本文中，我将继续进行数据项目的下一阶段-通过构建模型进行分析并使用图形进行演示。

分析： (Analysis:)

Choosing the right data fields enables the right results to show up.

选择正确的数据字段可以显示正确的结果。

优化变量选择 (Optimizing variable selection)

Method 1:

方法1：

Selecting the optimum variables for your analysis can have a huge impact on your results. While using regression model, the significance of variables can be gauged by looking at the asterisk in the p-value column in the statistical results of your model.

选择最佳变量进行分析可能会对结果产生巨大影响。使用回归模型时，可以通过查看模型统计结果中p值列中的星号来评估变量的重要性。

In the snapshot of linear regression results below, look at the last column on the right:

在下面的线性回归结果快照中，查看右侧的最后一列：

Each predictor variable is significant as its p-value <= 0.05. A low p-value suggests that the slope is not zero, which in turn suggests that changes in the predictor variable are associated with changes in the response variable. The multiple R-squared and the adjusted R-squared values must be close to 1. Low p-values and high R-squared value suggests that the model is highly predictive.

每个预测变量均具有显着性，因为其p值<= 0.05。低p值表明斜率不为零，这又表明预测变量的变化与响应变量的变化相关。多个R平方和调整后的R平方值必须接近1。低p值和高R平方值表明该模型具有高度预测性。

Image for post — Priyanka Mane from Alteryx Software Priyanka Mane提供的图像

Method 2:

方法2：

Use Crosstab! A contingency table or a cross-tabulation table shows the variation of frequency with which certain groups of data appear. You can achieve better readability. It facilitates comparison of various fields with each other at the same time to decide upon the most effective ones clearly.

使用交叉表！列联表或交叉表显示了某些数据组出现的频率变化。您可以实现更好的可读性。它有利于同时比较各个领域，从而清楚地确定最有效的领域。

If you are familiar with Pandas in Python, the following link is helpful.

如果您熟悉Python中的Pandas，则以下链接会有所帮助。

进行较小的分析 (Analyze in smaller chunks)

If the dataset is very large, dividing it into smaller chunks. Use these chunks for training and validation. Compare the overall accuracy of your model for these chunks to know how the model performs for different sets.

如果数据集非常大，请将其分成较小的块。使用这些块进行训练和验证。比较这些块的模型的整体准确性，以了解模型在不同集合上的表现。

建立并运行“模型” (Building and running ‘Models’)

Once the data-cleaning regime is over, store the cleaned data in a single separate file to be used for analysis. This way your model will look lean. It will run faster as the preprocessing is already done before.

数据清理机制结束后，将清理后的数据存储在一个单独的文件中以进行分析。这样，您的模型将显得精简。由于预处理已经完成，因此它将运行得更快。

Divide the cleaned data into test and validation samples. Apply the model to both these data samples. This step helps you understand how accurate your predictions will be on new data. No overfitting too. Here, you must compare the training dataset results with the validation dataset results. Overall accuracy is a good parameter for comparison. Select the model with the highest accuracy for your final results.

将清除的数据分为测试和验证样本。将模型应用于这两个数据样本。此步骤可帮助您了解对新数据的预测将有多准确。也没有过度拟合。在这里，您必须将训练数据集结果与验证数据集结果进行比较。总体精度是比较的一个很好的参数。为您的最终结果选择精度最高的模型。

In the model comparison results for the analysis above, look at the column ‘Accuracy’. FM_Credit is the name I gave for the Forest Model. It is seen to have the highest accuracy among the other models so I selected it for the final analysis. The AUC (Area Under Curve) must also be high.

在以上分析的模型比较结果中，查看“准确性”列。 FM_Credit是我为森林模型指定的名称。在其他模型中，该模型具有最高的准确性，因此我选择了该模型进行最终分析。 AUC(曲线下面积)也必须很高。

介绍： (Presentation:)

The human brain processes images 60,000 times faster than text, and 90 percent of information transmitted to the brain is visual.

人脑处理图像的速度比文本快60,000倍，并且传输到大脑的信息的90％是视觉的 。

The better your visualizations, the more influential is your work!

可视化效果越好，您的工作就越有影响力！

Data Visualization

数据可视化

Present this data in meaningful ways: graphs, visualizations, charts, tables, etc. Data analysts may report their findings to project managers, department heads, and senior-level business executives to help them make decisions and spot patterns and trends. The Tableau Software mentioned in the Part 1 of this article is by far the best and most user-friendly software for persuasive visualizations.

以有意义的方式显示此数据：图形，可视化效果，图表，表格等。数据分析师可以将其发现报告给项目经理，部门主管和高级业务主管，以帮助他们做出决策以及发现模式和趋势。本文第1部分中提到的Tableau软件是迄今为止最具说服力的可视化可视化软件的最佳，最用户友好的软件。

Mathplotlib is very famous but not necessarily the easiest to use. It has the advantage of ‘can be used frm the comfort of python’. Some other lesser known but highly influential options for Rand python are plot.ly — it can embed java into a web page and works offline too, bokeh and ggplot.

Mathplotlib非常有名，但不一定最容易使用。它的优点是“可以舒适地使用python”。 plot.ly是Rand python的其他一些鲜为人知但很有影响力的选项，它可以将Java嵌入到网页中并且也可以脱机工作( bokeh和ggplot) 。

作为初学者 (As a Beginner)

Major portion of a data analyst’s work hours are spent over understanding and interpreting business cases for the management. Therefore, a rational step is to read as many data science cases as you can get your hands on. You will be able to understand different, scenarios, trade-offs. At the same time, you will gain an expertise on thinking in a structured way according to the problem.
数据分析师工作时间的大部分时间都花在了理解和解释管理业务案例上。因此，一个合理的步骤是阅读尽可能多的数据科学案例。您将能够理解不同的方案，权衡。同时，您将获得有关根据问题以结构化方式进行思考的专业知识。
If you don’t know where to get data from in order to get started, simply look for data science projects and competitions on Kaggle.
如果您不知道从何处获取数据以开始使用，只需在Kaggle上寻找数据科学项目和竞赛。
Upload these projects on GitHub and get insights from other users.
将这些项目上载到GitHub，并从其他用户那里获得见解。
Stackoverflow too is a good place to learn from the problems and mistakes of others.
Stackoverflow也是从他人的问题和错误中学习的好地方。

Do let me know if you found this recipe useful or if you want a detailed step by step explanation of any of the projects I used in this article series.

请让我知道您是否认为此食谱有用，或者是否想要详细逐步说明我在本系列文章中使用的任何项目。

Priyanka Mane | Learner | Digital Content Creator

Priyanka Mane | 学习者| 数字内容创作者

Instagram: Wanderess_Priyanka | LinkedIn: Priyanka Mane

Instagram： Wanderess_Priyanka | 领英： Priyanka Mane

翻译自: https://towardsdatascience.com/how-to-get-started-with-any-kind-of-data-part-2-c4ccdbcc8b13

数据库数据类型使用

weixin_26749889

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据库数据类型使用_如何开始使用任何类型的数据第2部分

数据库数据类型使用从数据开始 (START WITH DATA)Data Science is an Art. You need practice to perfect it. I was first associated with it in 2017. You can read my story and the first part to this article here. In thi...
复制链接

扫一扫