27 个Python数据科学库实战案例 (附代码)_库里python代码

本文链接：https://blog.csdn.net/2301_76268112/article/details/138884976

如果你也是看准了Python，想自学Python，在这里为大家准备了丰厚的免费学习大礼包，带大家一起学习，给大家剖析Python兼职、就业行情前景的这些事儿。

一、Python所有方向的学习路线

Python所有方向路线就是把Python常用的技术点做整理，形成各个领域的知识点汇总，它的用处就在于，你可以按照上面的知识点去找对应的学习资源，保证自己学得较为全面。

二、学习软件

工欲善其必先利其器。学习Python常用的开发软件都在这里了，给大家节省了很多时间。

三、全套PDF电子书

书籍的好处就在于权威和体系健全，刚开始学习的时候你可以只看视频或者听某个人讲课，但等你学完之后，你觉得你掌握了，这时候建议还是得去看一下书籍，看权威技术书籍也是每个程序员必经之路。

四、入门学习视频

我们在看视频学习的时候，不能光动眼动脑不动手，比较科学的学习方法是在理解之后运用它们，这时候练手项目就很适合了。

四、实战案例

光学理论是没用的，要学会跟着一起敲，要动手实操，才能将自己的所学运用到实际当中去，这时候可以搞点实战案例来学习。

五、面试资料

我们学习Python必然是为了找到高薪的工作，下面这些面试题是来自阿里、腾讯、字节等一线互联网大厂最新的面试资料，并且有阿里大佬给出了权威的解答，刷完这一套面试资料相信大家都能找到满意的工作。

成为一个Python程序员专家或许需要花费数年时间，但是打下坚实的基础只要几周就可以，如果你按照我提供的学习路线以及资料有意识地去实践，你就有很大可能成功！
最后祝你好运！！！

网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。

需要这份系统化学习资料的朋友，可以戳这里获取

一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！

随机颜色2:def rndColor2(): return (random.randint(32, 127), random.randint(32, 127), random.randint(32, 127))

240 x 60:width = 60 * 6height = 60 * 6image = Image.new(‘RGB’, (width, height), (255, 255, 255))# 创建Font对象:font = ImageFont.truetype(‘/usr/share/fonts/wps-office/simhei.ttf’, 60)# 创建Draw对象:draw = ImageDraw.Draw(image)# 填充每个像素:for x in range(width): for y in range(height): draw.point((x, y), fill=rndColor())# 输出文字:for t in range(6): draw.text((60 * t + 10, 150), rndChar(), font=font, fill=rndColor2())# 模糊:image = image.filter(ImageFilter.BLUR)image.save(‘code.jpg’, ‘jpeg’)



![图片](https://img-blog.csdnimg.cn/img_convert/1c3173832a07143085328e457ef36f12.jpeg)


验证码


### **6、SimpleCV**


SimpleCV 是一个用于构建计算机视觉应用程序的开源框架。使用它，可以访问高性能的计算机视觉库，如 OpenCV，而不必首先了解位深度、文件格式、颜色空间、缓冲区管理、特征值或矩阵等术语。但其对于 Python3 的支持很差很差，在 Python3.7 中使用如下代码：

from SimpleCV import Image, Color, Display# load an image from imgurimg = Image(‘http://i.imgur.com/lfAeZ4n.png’)# use a keypoint detector to find areas of interestfeats = img.findKeypoints()# draw the list of keypointsfeats.draw(color=Color.RED)# show the resulting image. img.show()# apply the stuff we found to the image.output = img.applyLayers()# save the results.output.save(‘juniperfeats.png’)


会报如下错误，因此不建议在 Python3 中使用：

SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(‘unit test’)?


### **7、Mahotas**


Mahotas 是一个快速计算机视觉算法库，其构建在 Numpy 之上，目前拥有超过100种图像处理和计算机视觉功能，并在不断增长。使用 Mahotas 加载图像，并对像素进行操作：

import numpy as npimport mahotasimport mahotas.demos
from mahotas.thresholding import soft_thresholdfrom matplotlib import pyplot as pltfrom os import pathf = mahotas.demos.load(‘lena’, as_grey=True)f = f[128:,128:]plt.gray()# Show the data:print(“Fraction of zeros in original image: {0}”.format(np.mean(f==0)))plt.imshow(f)plt.show()



![图片](https://img-blog.csdnimg.cn/img_convert/5967e7e71e70d41b492af7fe7a97b474.png)


Mahotas


### **8、Ilastik**


Ilastik 能够给用户提供良好的基于机器学习的生物信息图像分析服务，利用机器学习算法，轻松地分割，分类，跟踪和计数细胞或其他实验数据。大多数操作都是交互式的，并不需要机器学习专业知识。


### **9、Scikit-Learn**


Scikit-learn 是针对 Python 编程语言的免费软件机器学习库。它具有各种分类，回归和聚类算法，包括支持向量机，随机森林，梯度提升，k均值和 DBSCAN 等多种机器学习算法。使用Scikit-learn实现KMeans算法：

import time
import numpy as npimport matplotlib.pyplot as plt
from sklearn.cluster import MiniBatchKMeans, KMeansfrom sklearn.metrics.pairwise import pairwise_distances_argminfrom sklearn.datasets import make_blobs

Generate sample datanp.random.seed(0)

batch_size = 45centers = [[1, 1], [-1, -1], [1, -1]]n_clusters = len(centers)X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)

Compute clustering with Means

k_means = KMeans(init=‘k-means++’, n_clusters=3, n_init=10)t0 = time.time()k_means.fit(X)t_batch = time.time() - t0

Compute clustering with MiniBatchKMeans

mbk = MiniBatchKMeans(init=‘k-means++’, n_clusters=3, batch_size=batch_size, n_init=10, max_no_improvement=10, verbose=0)t0 = time.time()mbk.fit(X)t_mini_batch = time.time() - t0

Plot resultfig = plt.figure(figsize=(8, 3))fig.subplots_adjust(left=0.02, right=0.98, bottom=0.05, top=0.9)colors = [‘#4EACC5’, ‘#FF9C34’, ‘#4E9A06’]

We want to have the same colors for the same cluster from the# MiniBatchKMeans and the KMeans algorithm. Let’s pair the cluster centers per# closest one.k_means_cluster_centers = k_means.cluster_centers_order = pairwise_distances_argmin(k_means.cluster_centers_, mbk.cluster_centers_)mbk_means_cluster_centers = mbk.cluster_centers_[order]

k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers)

KMeansfor k, col in zip(range(n_clusters), colors): my_members = k_means_labels == k cluster_center = k_means_cluster_centers[k] plt.plot(X[my_members, 0], X[my_members, 1], ‘w’, markerfacecolor=col, marker=‘.’) plt.plot(cluster_center[0], cluster_center[1], ‘o’, markerfacecolor=col, markeredgecolor=‘k’, markersize=6)plt.title(‘KMeans’)plt.xticks(())plt.yticks(())

plt.show()



![图片](https://img-blog.csdnimg.cn/img_convert/53135d8cd68a04438bffd3bcb09c96f7.png)


KMeans


### **10、SciPy**


SciPy 库提供了许多用户友好和高效的数值计算，如数值积分、插值、优化、线性代数等。SciPy 库定义了许多数学物理的特殊函数，包括椭圆函数、贝塞尔函数、伽马函数、贝塔函数、超几何函数、抛物线圆柱函数等等。

from scipy import specialimport matplotlib.pyplot as pltimport numpy as np
def drumhead_height(n, k, distance, angle, t): kth_zero = special.jn_zeros(n, k)[-1] return np.cos(t) * np.cos(nangle) * special.jn(n, distancekth_zero)
theta = np.r_[0:2*np.pi:50j]radius = np.r_[0:1:50j]x = np.array([r * np.cos(theta) for r in radius])y = np.array([r * np.sin(theta) for r in radius])z = np.array([drumhead_height(1, 1, r, theta, 0.5) for r in radius])

fig = plt.figure()ax = fig.add_axes(rect=(0, 0.05, 0.95, 0.95), projection=‘3d’)ax.plot_surface(x, y, z, rstride=1, cstride=1, cmap=‘RdBu_r’, vmin=-0.5, vmax=0.5)ax.set_xlabel(‘X’)ax.set_ylabel(‘Y’)ax.set_xticks(np.arange(-1, 1.1, 0.5))ax.set_yticks(np.arange(-1, 1.1, 0.5))ax.set_zlabel(‘Z’)plt.show()



![图片](https://img-blog.csdnimg.cn/img_convert/d43c2630e77a91c4ed10e81461081cbc.png)


SciPy


### **11、NLTK**


NLTK 是构建Python程序以处理自然语言的库。它为50多个语料库和词汇资源(如 WordNet )提供了易于使用的接口，以及一套用于分类、分词、词干、标记、解析和语义推理的文本处理库、工业级自然语言处理 (Natural Language Processing, NLP) 库的包装器。NLTK被称为 “a wonderful tool for teaching, and working in, computational linguistics using Python”。

import nltkfrom nltk.corpus import treebank

首次使用需要下载nltk.download(‘punkt’)nltk.download(‘averaged_perceptron_tagger’)nltk.download(‘maxent_ne_chunker’)nltk.download(‘words’)nltk.download(‘treebank’)

sentence = “”“At eight o’clock on Thursday morning Arthur didn’t feel very good.”“”# Tokenizetokens = nltk.word_tokenize(sentence)tagged = nltk.pos_tag(tokens)

Identify named entitiesentities = nltk.chunk.ne_chunk(tagged)

Display a parse treet = treebank.parsed_sents(‘wsj_0001.mrg’)[0]t.draw()



![图片](https://img-blog.csdnimg.cn/img_convert/353ad9ebe15204baa402667074e9c0f7.jpeg)


NLTK


**12、spaCy**


spaCy 是一个免费的开源库，用于 Python 中的高级 NLP。它可以用于构建处理大量文本的应用程序；也可以用来构建信息提取或自然语言理解系统，或者对文本进行预处理以进行深度学习。

import spacy
texts = [
“Net income was $9.4 million compared to the prior year of $2.7 million.”,
“Revenue exceeded twelve billion dollars, with a loss of $1b.”,
]
nlp = spacy.load(“en_core_web_sm”)
for doc in nlp.pipe(texts, disable=[“tok2vec”, “tagger”, “parser”, “attribute_ruler”, “lemmatizer”]):
# Do something with the doc here
print([(ent.text, ent.label_) for ent in doc.ents])


nlp.pipe 生成 Doc 对象，因此我们可以对它们进行迭代并访问命名实体预测：

[(‘$9.4 million’, ‘MONEY’), (‘the prior year’, ‘DATE’), (‘$2.7 million’, ‘MONEY’)][(‘twelve billion dollars’, ‘MONEY’), (‘1b’, ‘MONEY’)]


### **13、LibROSA**


librosa 是一个用于音乐和音频分析的 Python 库，它提供了创建音乐信息检索系统所必需的功能和函数。

Beat tracking exampleimport librosa

1. Get the file path to an included audio examplefilename = librosa.example(‘nutcracker’)

2. Load the audio as a waveform `y`# Store the sampling rate as `sr`y, sr = librosa.load(filename)

3. Run the default beat trackertempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)print(‘Estimated tempo: {:.2f} beats per minute’.format(tempo))

4. Convert the frame indices of beat events into timestampsbeat_times = librosa.frames_to_time(beat_frames, sr=sr)


### **14、Pandas**


Pandas 是一个快速、强大、灵活且易于使用的开源数据分析和操作工具， Pandas 可以从各种文件格式比如 CSV、JSON、SQL、Microsoft Excel 导入数据，可以对各种数据进行运算操作，比如归并、再成形、选择，还有数据清洗和数据加工特征。Pandas 广泛应用在学术、金融、统计学等各个数据分析领域。

import matplotlib.pyplot as pltimport pandas as pdimport numpy as np
ts = pd.Series(np.random.randn(1000), index=pd.date_range(“1/1/2000”, periods=1000))ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list(“ABCD”))df = df.cumsum()df.plot()plt.show()



![图片](https://img-blog.csdnimg.cn/img_convert/2572c48cf1df2e91ddff06ec478ac229.png)


Pandas


### **15、Matplotlib**


Matplotlib 是Python的绘图库，它提供了一整套和 matlab 相似的命令 API，可以生成出版质量级别的精美图形，Matplotlib 使绘图变得非常简单，在易用性和性能间取得了优异的平衡。使用 Matplotlib 绘制多曲线图：

plot_multi_curve.pyimport numpy as npimport matplotlib.pyplot as pltx = np.linspace(0.1, 2 * np.pi, 100)y_1 = xy_2 = np.square(x)y_3 = np.log(x)y_4 = np.sin(x)plt.plot(x,y_1)plt.plot(x,y_2)plt.plot(x,y_3)plt.plot(x,y_4)plt.show()


![图片](https://img-blog.csdnimg.cn/img_convert/866b8fae0a83ed8c2655055da465341a.png)


Matplotlib


### **16、Seaborn**


Seaborn 是在 Matplotlib 的基础上进行了更高级的API封装的Python数据可视化库，从而使得作图更加容易，应该把 Seaborn 视为 Matplotlib 的补充，而不是替代物。

import seaborn as snsimport matplotlib.pyplot as pltsns.set_theme(style=“ticks”)
df = sns.load_dataset(“penguins”)sns.pairplot(df, hue=“species”)plt.show()



![图片](https://img-blog.csdnimg.cn/img_convert/c9b03274ea927312a43eb97a0419f70c.png)


seaborn


### **17、Orange**


Orange 是一个开源的数据挖掘和机器学习软件，提供了一系列的数据探索、可视化、预处理以及建模组件。Orange 拥有漂亮直观的交互式用户界面，非常适合新手进行探索性数据分析和可视化展示；同时高级用户也可以将其作为 Python 的一个编程模块进行数据操作和组件开发。使用 pip 即可安装 Orange，好评～

$ pip install orange3


安装完成后，在命令行输入 orange-canvas 命令即可启动 Orange 图形界面：

$ orange-canvas


启动完成后，即可看到 Orange 图形界面，进行各种操作。



![图片](https://img-blog.csdnimg.cn/img_convert/ae7943b04307c962e90819a2b3353201.png)


Orange


### **18、PyBrain**


PyBrain 是 Python 的模块化机器学习库。它的目标是为机器学习任务和各种预定义的环境提供灵活、易于使用且强大的算法来测试和比较算法。PyBrain 是 Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library 的缩写。我们将利用一个简单的例子来展示 PyBrain 的用法，构建一个多层感知器 (Multi Layer Perceptron, MLP)。首先，我们创建一个新的前馈网络对象：

from pybrain.structure import FeedForwardNetworkn = FeedForwardNetwork()


接下来，构建输入、隐藏和输出层：

from pybrain.structure import LinearLayer, SigmoidLayer
inLayer = LinearLayer(2)hiddenLayer = SigmoidLayer(3)outLayer = LinearLayer(1)


为了使用所构建的层，必须将它们添加到网络中：

n.addInputModule(inLayer)n.addModule(hiddenLayer)n.addOutputModule(outLayer)


可以添加多个输入和输出模块。为了向前计算和反向误差传播，网络必须知道哪些层是输入、哪些层是输出。这就需要明确确定它们应该如何连接。为此，我们使用最常见的连接类型，全连接层，由 FullConnection 类实现：

from pybrain.structure import FullConnectionin_to_hidden = FullConnection(inLayer, hiddenLayer)hidden_to_out = FullConnection(hiddenLayer, outLayer)


与层一样，我们必须明确地将它们添加到网络中：

n.addConnection(in_to_hidden)n.addConnection(hidden_to_out)


所有元素现在都已准备就位，最后，我们需要调用.sortModules()方法使MLP可用：

n.sortModules()


这个调用会执行一些内部初始化，这在使用网络之前是必要的。


### **19、Milk**


MILK(MACHINE LEARNING TOOLKIT) 是 Python 语言的机器学习工具包。它主要是包含许多分类器比如 SVMS、K-NN、随机森林以及决策树中使用监督分类法，它还可执行特征选择，可以形成不同的例如无监督学习、密切关系传播和由 MILK 支持的 K-means 聚类等分类系统。使用 MILK 训练一个分类器：

import numpy as npimport milkfeatures = np.random.rand(100,10)labels = np.zeros(100)features[50:] += .5labels[50:] = 1learner = milk.defaultclassifier()model = learner.train(features, labels)

Now you can use the model on new examples:example = np.random.rand(10)print(model.apply(example))example2 = np.random.rand(10)example2 += .5print(model.apply(example2))


### **20、TensorFlow**


TensorFlow 是一个端到端开源机器学习平台。它拥有一个全面而灵活的生态系统，一般可以将其分为 TensorFlow1.x 和 TensorFlow2.x，TensorFlow1.x 与 TensorFlow2.x 的主要区别在于 TF1.x 使用静态图而 TF2.x 使用Eager Mode动态图。这里主要使用TensorFlow2.x作为示例，展示在 TensorFlow2.x 中构建卷积神经网络 (Convolutional Neural Network, CNN)。

import tensorflow as tf
from tensorflow.keras import datasets, layers, models

数据加载(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

数据预处理train_images, test_images = train_images / 255.0, test_images / 255.0

模型构建model = models.Sequential()model.add(layers.Conv2D(32, (3, 3), activation=‘relu’, input_shape=(32, 32, 3)))model.add(layers.MaxPooling2D((2, 2)))model.add(layers.Conv2D(64, (3, 3), activation=‘relu’))model.add(layers.MaxPooling2D((2, 2)))model.add(layers.Conv2D(64, (3, 3), activation=‘relu’))model.add(layers.Flatten())model.add(layers.Dense(64, activation=‘relu’))model.add(layers.Dense(10))

模型编译与训练model.compile(optimizer=‘adam’, loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[‘accuracy’])history = model.fit(train_images, train_labels, epochs=10, validation_data=(test_images, test_labels))


### **21、PyTorch**


PyTorch 的前身是 Torch，其底层和 Torch 框架一样，但是使用 Python 重新写了很多内容，不仅更加灵活，支持动态图，而且提供了 Python 接口。

导入库import torchfrom torch import nnfrom torch.utils.data import DataLoaderfrom torchvision import datasetsfrom torchvision.transforms import ToTensor, Lambda, Composeimport matplotlib.pyplot as plt

模型构建device = “cuda” if torch.cuda.is_available() else "cpu"print(“Using {} device”.format(device))

Define modelclass NeuralNetwork(nn.Module): def init(self): super(NeuralNetwork, self).init() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10), nn.ReLU() )

def forward(self, x):        x = self.flatten(x)        logits = self.linear_relu_stack(x)        return logits

model = NeuralNetwork().to(device)

损失函数和优化器loss_fn = nn.CrossEntropyLoss()optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

模型训练def train(dataloader, model, loss_fn, optimizer): size = len(dataloader.dataset) for batch, (X, y) in enumerate(dataloader): X, y = X.to(device), y.to(device)

    # Compute prediction error        pred = model(X)        loss = loss_fn(pred, y)
    # Backpropagation        optimizer.zero_grad()        loss.backward()        optimizer.step()
    if batch % 100 == 0:            loss, current = loss.item(), batch * len(X)            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


### **22、Theano**


Theano 是一个 Python 库，它允许定义、优化和有效地计算涉及多维数组的数学表达式，建在 NumPy 之上。在 Theano 中实现计算雅可比矩阵：




### 最后

> **🍅 硬核资料**：关注即可领取PPT模板、简历模板、行业经典书籍PDF。  
> **🍅 技术互助**：技术群大佬指点迷津，你的问题可能不是问题，求资源在群里喊一声。  
> **🍅 面试题库**：由技术群里的小伙伴们共同投稿，热乎的大厂面试真题，持续更新中。  
> **🍅 知识体系**：含编程语言、算法、大数据生态圈组件（Mysql、Hive、Spark、Flink）、数据仓库、Python、前端等等。




**网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。**

**[需要这份系统化学习资料的朋友，可以戳这里获取](https://bbs.csdn.net/topics/618317507)**

**一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！**