数据库从事什么工作_我如何开始从事数据科学-CSDN博客

本文详细介绍数据科学的基础知识，涵盖数学、统计学、编程和商业知识，以及机器学习的基本概念，包括有监督和无监督学习。同时，提供了学习路径、在线资源和实践建议。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

数据库从事什么工作

内容(Contents)

Why Learn Data Science?
为什么要学习数据科学？
What Exactly is Machine Learning (ML) in Data Science?
数据科学中的机器学习(ML)到底是什么？
An ML ‘Syllabus’
ML“课程提纲”
Online Courses
在线课程
Step by Step ML Process
逐步ML处理
Example ML Algorithms
ML算法示例
ML Applications
机器学习应用
Taking this Further
更进一步

1.为什么要学习数据科学？(1. Why Learn Data Science?)

Data Science is a skill set formed from multiple disciplines including maths, statistics, programming, and business knowledge, which can be applied to a broad range of problems. It is a field for organising large data sets, analysing data and coding solutions to address business challenges.

数据科学是由多种学科(包括数学，统计学，编程和商业知识)组成的技能集，可以应用于广泛的问题。它是组织大型数据集，分析数据和编码解决方案以应对业务挑战的领域。

With this knowledge, you could:

有了这些知识，您可以：

Design a Spotify music recommender system
设计Spotify音乐推荐器系统
Predict future stock prices
预测未来股价
Build a facial recognition system
建立人脸识别系统
… and many more!
… 还有很多！

Not experienced in these fields? No problem! Now it’s easier than ever to learn Data Science, with a wealth of online resources, Medium articles and YouTube tutorials!

在这些领域没有经验？没问题！现在，拥有大量在线资源，中型文章和YouTube教程，比以往任何时候都更容易学习数据科学！

Just a note about this article: it is quite a detailed overview — you don’t need to read the whole thing at once! Treat this as a guide to keep checking back to as you learn, and feel free to skip bits you don’t need or are already familiar with!

只是关于本文的一条注释：它是一个非常详细的概述-您无需一次阅读全部内容！ 将此作为指导，以便在学习时不断检查，并随时跳过不需要或已经熟悉的部分！

2.数据科学中的机器学习(ML)到底是什么？ (2. What Exactly is Machine Learning (ML) in Data Science?)

Machine learning is where a computer takes a series of inputs, learns patterns, and produces outputs. Returning to our examples:

机器学习是计算机获取一系列输入，学习模式并产生输出的地方。回到我们的例子：

Spotify: input features to describe the songs you like (e.g. time signature, key, lyrics), learn patterns (often looking at patterns derived for similar listeners to yourself), output your ‘Discover Weekly’ playlist
Spotify：输入功能来描述您喜欢的歌曲(例如，时签名，键，歌词)，学习模式(通常查看为您自己的相似听众推导的模式)，输出“发现每周”播放列表
Predict future stock prices: input previous stock market data, detect underlying trends, output prediction
预测未来股价：输入先前的股市数据，检测潜在趋势，预测产量
Build a facial recognition system: input an image of someone’s face, compare it to a stored database of faces, identify the person
建立面部识别系统：输入某人的面部图像，将其与存储的面部数据库进行比较，识别该人

In order to do these things, a computer must first learn from train data, and then be checked using test data. You usually provide input (x) and output (y) data in both cases.

为了执行这些操作，计算机必须首先从火车数据中学习，然后使用测试数据进行检查。通常在两种情况下都提供输入(x)和输出(y)数据。

When starting out, there are 2 broad categories of ML, supervised and unsupervised:

刚开始时，有2个主要类别的ML，有监督的和无监督的：

Image for post — Types of ML — an overview

Supervised — labelled train data. The algorithm wants to learn the rules connecting an input to a given output, and to use those rules for making predictions e.g. using which past job applicants got hired to decide whether to hire a new applicant, this has a yes/no label indicating whether the applicant was hired.

有监督-带标签的火车数据。该算法要学习将输入连接到给定输出的规则，并使用这些规则进行预测，例如，使用过去雇用的应聘者来决定是否雇用新的应聘者，它具有是/否标签，指示是否申请人被录用了。

Generally 5 variables to deal with:

通常有5个变量要处理：

- x_train: features based on someone’s CV e.g. number of GCSEs, number of job requirements matched

-x_train ：基于某人的简历的功能，例如，GCSE的数量，匹配的工作要求的数量

- y_train: binary (yes/no) labels for whether or not someone got hired

-y_train ：是否有人被录用的二进制(是/否)标签

- x_test: same as x_train but for new data

-x_test ：与x_train相同，但用于新数据

- y_pred: the predicted output when the model is trained on x_train and y_train, and then fed the input data x_test

-y_pred ：模型在x_train和y_train上训练，然后输入输入数据x_test时的预测输出

- y_test: same as y_train but for the new data. y_test is compared to y_pred to evaluate how well the model performs

-y_test ：与y_train相同，但用于新数据。将y_test与y_pred进行比较以评估模型的性能

Unsupervised — unlabelled train data. The algorithm finds structures and patterns of inputs on its own e.g. clustering types of customer based on demographic info and their spending habits. There are no labels as the customer segments are not yet known in this example.

无监督-未标记的火车数据。该算法根据人口统计信息及其消费习惯自行找到输入的结构和模式，例如，客户的聚类类型。 没有标签，因为在此示例中尚不了解客户群。

Generally 2 variables:

通常有2个变量：

- X: features about a customer

-X ：关于客户的特征

- y: the model’s output when trained on the X data

-y ：在X数据上训练后模型的输出

There are also a number of other types of ML. You often start with a simple artificial neural network (ANN), which is designed like the human brain with neurons which strengthen when fired more often. Then there’s reinforcement learning, natural language processing (NLP), convolutional neural networks (CNN), deep neural networks (DNN), recurrent neural networks (RNN), and many more!

还有许多其他类型的ML。您通常从一个简单的人工神经网络(ANN)开始，该网络的设计类似于人脑，其神经元在频繁发射时会增强。然后是强化学习，自然语言处理(NLP)，卷积神经网络(CNN)，深度神经网络(DNN)，递归神经网络(RNN)等！

3. ML“教学大纲”(3. An ML ‘Syllabus’)

This is the most comprehensive video I’ve found so far on how to get started. These steps are outlined in the video:

这是到目前为止我找到的最全面的视频，介绍如何开始使用。视频中概述了这些步骤：

1. Install Python/R and the relevant libraries

1.安装Python / R和相关的库

It’s up to you which — R was developed for statisticians, whilst Python is more general-purpose, being the most popular option.

这取决于您-R是为统计人员开发的，而Python更通用，是最受欢迎的选项。

Let’s go with Python. Download Anaconda, and it’s common to use Jupyter Notebook. You can find lots of useful Python libraries on PyPI. You will want to install:

让我们开始使用Python。下载Anaconda ，通常使用Jupyter Notebook 。您可以在PyPI上找到许多有用的Python库。您将要安装：

numpy — indexing, basic operations on arrays, reshaping, broadcasting arrays
numpy —索引，数组的基本操作，整形，广播数组
pandas — dataframes, series, feature engineering. For more advanced pandas tips, see this video
熊猫-数据框，系列，特征工程。 有关更多高级熊猫提示，请观看此视频
Visualisation libraries: matplotlib, seaborn
可视化库： matplotlib ， 海生的
Also helpful: scikit-learn — for ML models. Later on you might want tensorflow, keras, pytorch, but leave these for the moment
也有帮助： scikit-learn —用于ML模型。 稍后您可能需要tensorflow ， keras ， pytorch ，但暂时不要使用
Just for fun: geopandas — for plotting maps and spatial coordinates
只是为了好玩： geopandas —用于绘制地图和空间坐标

2. Statistics

2.统计

Mean, median, mode
均值，中位数，众数
Normal and standard distributions
正态分布和标准分布
Correlations
相关性

No need to learn formulae by-heart. You mostly want a general understanding of the data you’ve got and how features are related.

无需刻苦学习公式。您通常希望对所获得的数据以及功能之间的关系有一个大致的了解。

3. Exploratory Data analysis (EDA)

3.探索性数据分析(EDA)

Understanding the features in a data set
了解数据集中的功能
Data scaling — MinMax, Standard, LogNormal
数据缩放— MinMax，Standard，LogNormal

4. Understanding ML algorithms, focussing on

4.了解机器学习算法，重点关注

Intuition — watch videos with good diagrams of what’s going on
直觉—观看视频，清楚地了解正在发生的事情
Implementation with python libraries — you don’t need to code all the inner workings of an ML model yourself, someone has already done it for you!
使用python库实现-您无需自己编写ML模型的所有内部代码，有人已经为您完成了！

5. Deployment

5.部署

Cloud computing deployment: AWS (Amazon Web Services), GCP (Google Cloud Platform), Microsoft Azure. You may need to pay for these services, but there are free online tutorials to get started
云计算部署： AWS(Amazon Web Services) ， GCP(Google Cloud Platform) ， Microsoft Azure 。 您可能需要为这些服务付费，但是有免费的在线教程可以入门
Flask & Django — Python frameworks to turn your model into an interactive web interface. Flask is easier to get started with than Django
Flask ＆ Django-将您的模型转变为交互式Web界面的Python框架。 Flask比Django更容易上手

6. Databases

6.数据库

SQL for structured data (in table format). Click here for a SQL tutorial website
用于结构化数据的SQL (表格式)。 单击此处访问SQL教程网站
MongoDB for unstructured data (e.g. in json format)
MongoDB用于非结构化数据(例如json格式)

7. Other visualisation software

7.其他可视化软件

Tableau — free on Tableau Public
Tableau —在Tableau Public上免费
PowerBI — paid
PowerBI-付费
Qlik Sense — free trial
Qlik Sense-免费试用

4.在线课程(4. Online Courses)

Next I opted for these two Udemy courses on Machine Learning and Deep Learning. You can find the syllabus info for these courses here and here. Note there is some overlap between the courses, and only buy them when they’re on offer (about £15 each, try an incognito tab or checking back regularly as Udemy often has sales on). They have a great overview of different ML techniques.

接下来，我选择了这两个有关机器学习和深度学习的Udemy课程。您可以在这里和这里找到这些课程的课程表信息。请注意，课程之间存在一些重叠，只有在提供课程时才购买(每个课程约15英镑，尝试使用隐身标签或定期复查，因为Udemy经常有销售)。他们对不同的机器学习技术有很好的概述。

5.分步机器学习(ML)过程 (5. Step by Step Machine Learning (ML) Process)

Start with a dataset — this can be provided for you on Kaggle, or you can use other online datasets, APIs, or collect your own data. You can start using an excel file or csv. For unstructured data, use json. For large files, hdf files are helpful. If you want, you can query your data from a database and use that directly.

从数据集开始-可以在Kaggle上为您提供，也可以使用其他在线数据集，API或收集自己的数据。您可以开始使用excel文件或csv。对于非结构化数据，请使用json。对于大文件，hdf文件很有帮助。如果需要，可以从数据库查询数据并直接使用它。

You will need a sufficient amount of data to be able to train a model — a good rule of thumb is roughly 10 times as many data points as features (inputs) in your model. For example, if you have 4 features in a salary prediction algorithm (job sector, role title, years of experience, performance review score), you will want data on at least 40 people
您将需要足够的数据量来训练模型-一个好的经验法则是，数据点的数量大约是模型中特征(输入)的10倍。 例如，如果您在薪水预测算法中具有4个功能(职位，职位，工作经验，绩效评估分数)，则需要至少40个人的数据

Data cleaning — Check how much data you actually have and its quality

数据清理—检查您实际拥有的数据量及其质量

For each feature, what % of the data has a NaN (missing or unknown) value? You can often handle NaN values by replacing the NaN with the mean
对于每个功能，有%%的数据具有NaN(缺失或未知)值？ 您通常可以通过用均值代替NaN来处理NaN值

df = df.fillna(df.mean())

df = df.fillna(df.mean())

Check for and potentially remove anomalies — anomalies may skew the data when calculating the mean or training a general model, but may be useful if the purpose of your model is to spot these anomalies (e.g. in a fraud detection system aiming to spot unusual behaviour)
检查并可能消除异常-异常可能会在计算平均值或训练通用模型时使数据偏斜，但是如果模型的目的是发现这些异常(例如在旨在发现异常行为的欺诈检测系统中)，则异常可能会有用
Ensure that the data you have is of the correct type (sometimes numbers get stored as strings and need to be converted back to int/float)
确保您拥有的数据类型正确(有时数字将存储为字符串，并且需要转换回int / float)

Feature engineering — decide which features will go into your model

特征工程-确定将哪些特征纳入模型

You can create new features by manipulating the data e.g. for a car, take the ratio of fuel usage to distance travelled, or for a time series problem, calculate rolling averages or find the mean/standard deviation/frequency as features
您可以通过处理数据来创建新功能，例如，针对汽车，获取燃料使用量与行驶距离的比率，或针对时间序列问题，计算滚动平均值或找到均值/标准偏差/频率作为特征
Dimensionality reduction — cutting down on the number of features you’ve got. For this you can use Principal Component Analysis (PCA): this method lets you see which variables are most important in explaining the variance of the dataset
降维—减少拥有的功能数量。 为此，您可以使用主成分分析(PCA) ：此方法使您可以查看哪些变量在解释数据集的方差方面最重要
If you have any categorical variables (e.g. favourite colour), you will want to OneHotEncode them, meaning split them into separate columns for each category (e.g. red, yellow, blue), having a value of 0 or 1
如果您有任何类别变量(例如，最喜欢的颜色)，则需要对它们进行OneHotEncode编码，这意味着将它们分为每个类别(例如，红色，黄色，蓝色)的单独列，值为0或1
You can produce correlation and covariance plots. If two features are correlated, you may not want to include them both in the model as the second feature will be redundant (e.g. how many products you sell, how much profit you make from that product type). This could result in over-fitting, where a model works very well on train data but not on test data
您可以生成相关和协方差图。 如果两个功能相关联，则您可能不希望将两个功能都包括在模型中，因为第二个功能将是多余的(例如，您销售了多少产品，从该产品类型获利多少)。 这可能会导致过度拟合，其中模型在火车数据上效果很好，但在测试数据上效果不佳

Next is the train-test split. Split up the data (often randomly, but sometimes in time-series chronological order), often with a 80:20 train:test split.

接下来是火车试车课。拆分数据(通常是随机的，但有时按时间序列的时间顺序)，通常采用80:20的train：test拆分。

from sklearn.model_selection import train_test_split

从sklearn.model_selection导入train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

X_train，X_test，y_train，y_test = train_test_split(X，y，test_size = 0.2，random_state = 0)

You’ll need to feature scale the data, which often requires normalising it (if you have a skewed distribution, you can use log(x) to normalise, and then you scale. There are many types of scaling but a common one is min-max scaling between 0 and 1. This is so that the actual value of a feature doesn’t affect the model too much (e.g. when looking at car safety, a speed of 60mph shouldn’t be weighted more highly than an acceleration of 2m/s^2 just because speed values generally tend to be higher)
您需要对数据进行缩放，这通常需要对其进行归一化(如果分布偏斜，则可以使用log(x)进行归一化，然后进行缩放。缩放的类型很多，但常见的是最小-max缩放，介于0和1之间。这样一来，要素的实际值不会对模型产生太大影响(例如，在考虑汽车安全性时，不应将60mph的速度加权得比2m的加速度高得多) / s ^ 2，只是因为速度值通常会更高)

from sklearn.preprocessing import StandardScaler

从sklearn.preprocessing导入StandardScaler

sc_X = StandardScaler()

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)

X_train = sc_X.fit_transform(X_train)

X_test = sc_X.transform(X_test)

X_test = sc_X.transform(X_test)

With this normalisation used on the train set, you’ll want to apply the same normalisation separately to the test set. Don’t scale all the data together or there will be data leakage effects, where knowledge from the test set creeps into the train set so the ML model can cheat
在火车集合上使用此归一化方法后，您需要将相同的归一化方法分别应用于测试集。 不要一起缩放所有数据，否则会有数据泄漏的影响，因为测试集中的知识会渗入训练集中，因此ML模型可以作弊

Apply the learning algorithm (examples outlined below) to the train set, using:

使用以下方法将学习算法(下面概述的示例)应用于训练集：

model.fit(X_train, y_train)

model.fit(X_train，y_train)

Then apply the model to the test set:

然后将模型应用于测试集：

y_pred = model.predict(X_test)

y_pred = model.predict(X_test)

You can then make a plot of the output, or compare y_pred to y_test — there are a number of metrics for model scoring, which use these terms:
然后，您可以绘制输出图，或将y_pred与y_test进行比较-有许多用于模型评分的指标，这些指标使用以下术语：

True Position (TP) — e.g. is ill and tests positive (correct)

真实位置(TP)-例如生病并且测试呈阳性(正确)

False Negative (FN) — e.g. is ill but tests negative (incorrect) — type 2 error (more bad)

假阴性(FN)-例如生病但测试阴性(不正确)-2型错误(更严重)

False Position (FP) — e.g. is not ill but tests positive (incorrect) — type 1 error (less bad)

错误位置(FP)-例如，没有生病，但测试呈阳性(不正确)-类型1错误(较差)

True Negative (TN) — e.g. is not ill and tests negative (correct)

真阴性(TN)-例如，没有病且测试阴性(正确)

There are lots of metrics you can use, as shown in the diagrams: accuracy is very important, and then either precision or recall depending on what the context of your model is
如图所示，您可以使用许多度量标准：准确性非常重要，然后根据模型的上下文确定精度或召回率

It can also be helpful to make a confusion matrix, formed from the 4 white squares in the middle of the diagram below, with each square containing a number. In general you want high numbers of data points in TP or TN, and low numbers in FP and FN
制作一个混淆矩阵，由下图中间的4个白色正方形组成，每个正方形包含一个数字，也可能会有所帮助。 通常，在TP或TN中需要大量的数据点，而在FP和FN中则需要少量的数据点

Evaluate the model — try out a few different models (outlined below) and compare the confusion matrix and metrics, to see which performance is best. It depends on the problem you are trying to solve.

评估模型-尝试一些不同的模型(下面概述)，并比较混淆矩阵和指标，以查看哪种性能最佳。这取决于您要解决的问题。

Also try some k-fold cross validation, where you shuffle which data points are put in the train and tests sets, retrain the model, and see if it performs in a similar way and didn’t just get lucky the first time
还可以尝试一些k-fold交叉验证，在其中您可以混洗将哪些数据点放入训练和测试集中，重新训练模型，并查看其是否以类似的方式运行，并且不仅第一次幸运

6. ML算法示例(6. Example ML Algorithms)

Within the ML categories, we have a number of different algorithms and libraries:

在ML类别中，我们有许多不同的算法和库：

To discuss some of these further:

进一步讨论其中一些：

Regression — find the straight line or curve to fit the data

回归-找到适合数据的直线或曲线

Linear — find the coefficients of the relationship: y = b0 + b1x1
线性-找到关系的系数：y = b0 + b1x1
Polynomial — find the coefficients of the relationship: y = b0 + b1x1 + b2x1^2 + … + bnx1^n
多项式-查找以下关系的系数：y = b0 + b1x1 + b2x1 ^ 2 +…+ bnx1 ^ n
Decision Tree — find the hidden rules that determine an outcome
决断树-查找确定结果的隐藏规则

Random Forest — an ensemble learning method combining results from many decision trees to produce a single output

随机森林-一种集成学习方法，将许多决策树的结果结合起来以产生单个输出

Classification

分类

KNN — K-Nearest Neighbours — k is a parameter set by you. For each data point in a distribution, pick the k nearest points, and the data point in question gets assigned the same label as the majority of its k-nearest-neighbours
KNN-K最近邻居-k是您设置的参数。对于分布中的每个数据点，选择k个最近的点，然后将有问题的数据点与其大部分k最近邻分配相同的标签
Logistic Regression — like linear regression but a sigmoid curve is used to produce an output between 0 and 1, which can be converted to a binary output by rounding up or down
Logistic回归-类似于线性回归，但使用S形曲线来产生0到1之间的输出，可以通过向上或向下舍入将其转换为二进制输出

Naïve-Bayes — a probabilistic classifier based on applying Bayes’ theorem with strong independence assumptions between features
朴素贝叶斯( Naïve-Bayes) -一种概率分类器，基于贝叶斯定理，特征之间具有很强的独立性假设
SVM — a supervised method for classification, based on finding a line or hyperplane that nicely divides clusters of data points
SVM —一种监督分类方法，基于找到可以很好地划分数据点簇的线或超平面