Python学习笔记5-MOSH-6小时-项目

Adore_rui

已于 2023-08-04 11:35:32 修改

阅读量884

点赞数

分类专栏： python 文章标签： python 学习机器学习笔记 jupyter

于 2023-07-26 23:39:33 首次发布

本文链接：https://blog.csdn.net/Adore_rui/article/details/131949912

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

项目1- EXCEL Spreadsheets

如果你使用的是Windows系统，PyCharm没有Reveal in Finder选项，因为Finder是Mac OS X的文件管理器。在Windows系统中，你可以使用快捷键“Alt+F1”来打开文件所在的文件夹，然后选择“Show in Explorer”选项。

import openpyxl as xl
wb=xl.load_workbook('transaction.xlsx')
sheet=wb['Sheet1']
cell=sheet['a1']  #cell=sheet.cell(1,1)
print(cell.value)

这行代码访问了 A1单元格，有2种方式，第一种事锁定A1第二种是用行列定位到（1,1）

想要更多操

import openpyxl as xl
from openpyxl.chart import BarChart,Reference
wb=xl.load_workbook('transaction.xlsx')
sheet=wb['Sheet1']

for row in range(2,sheet.max_row+1):
    cell=sheet.cell(row,3)
    correct_price=cell.value*0.9
    correct_price_cell=sheet.cell(row,4)
    correct_price_cell.value=correct_price

values=Reference(sheet,
          min_row=2,
          max_row=sheet.max_row,
          min_col=4,
          max_col=4)
chart=BarChart()
chart.add_data(values)
sheet.add_chart(chart,'e2')

wb.save('transaction2.xlsx')

作可以阅读openpyxl文档

让chatgpt帮我改的新代码，加入了可以输入操作文件夹和命名新文件夹的地方

import openpyxl as xl
from openpyxl.chart import BarChart, Reference


def process_workbook():
    filename = input("请输入要处理的Excel文件名：")
    wb = xl.load_workbook(filename)
    sheet = wb['Sheet1']

    for row in range(2, sheet.max_row + 1):
        cell = sheet.cell(row, 3)
        correct_price = cell.value * 0.9
        correct_price_cell = sheet.cell(row, 4)
        correct_price_cell.value = correct_price

    values = Reference(sheet,
                       min_row=2,
                       max_row=sheet.max_row,
                       min_col=4,
                       max_col=4)
    chart = BarChart()
    chart.add_data(values)
    sheet.add_chart(chart, 'e2')

    new_filename = input("请输入要保存的新Excel文件名：")
    wb.save(new_filename)


process_workbook()

项目2-机器学习子集

Machine learning in action

Import Data
Clean Data
Spilt the data into training/Test sets-82原则
Create a model
Train the model
evaluate and im prove

Libraries and tools

Numpy：多维数组

Pandas：数据分析库，数据帧是二维的（类似于excel表格，有行和列）

MatPlotLib：二维绘图库，创建图形

Scikit-Learn：决策树、神经网络

使用一个叫木星的环境编写代码：Project Jupyter | Home

转到Anaconda.com/下载，他的魅力在于直接会把以上流行的库都安装上也不需要手动去配pip，安装过程中需要自己勾选的5个复选框全部勾选。第二个写了不推荐也要勾上。

成功之后我注册了账号，然后mosh老师去terminal window里敲了一行命令行，由于他是mac电脑，我说一下我转化为win11可使用版本的过程

先输入>cd /d D：，切换到D盘，因为我把软件装到了D盘

D:\>jupyter notebook

并没有自动打开网页，但是copy：http://localhost:8888/?token=03c6dca12a6ba13affaa360eee1aae3abd889637baa94447打开

看了如下教程不管用--

http://t.csdn.cn/XKA5v

看了教程都不行，最后碰巧让我给解决了，我也不懂为啥

jupter无法保存，无法更新名称具体方法如下：

1.先下载

pip uninstall pyzmq

2. 在直接安装

pip install pyzmq

显示如下，复制链接打开后就可以正常的save和run了:

(base) C:\Users\郎蕊>pip uninstall pyzmq
WARNING: Skipping pyzmq as it is not installed.

(base) C:\Users\郎蕊>pip install pyzmq
Collecting pyzmq
  Downloading pyzmq-25.1.0-cp311-cp311-win_amd64.whl (1.1 MB)
     ---------------------------------------- 1.1/1.1 MB 21.7 kB/s eta 0:00:00
Installing collected packages: pyzmq
Successfully installed pyzmq-25.1.0

(base) C:\Users\郎蕊>jupyter notebook

  _   _          _      _
 | | | |_ __  __| |__ _| |_ ___
 | |_| | '_ \/ _` / _` |  _/ -_)
  \___/| .__/\__,_\__,_|\__\___|
       |_|

Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.

https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html

Please note that updating to Notebook 7 might break some of your extensions.

[W 22:38:36.105 NotebookApp] Loading JupyterLab as a classic notebook (v6) extension.
[C 22:38:36.105 NotebookApp] You must use Jupyter Server v1 to load JupyterLab as notebook extension. You have v2.5.0 installed.
    You can fix this by executing:
        pip install -U "jupyter-server<2.0.0"
[I 22:38:38.090 NotebookApp] Serving notebooks from local directory: C:\Users\郎蕊
[I 22:38:38.090 NotebookApp] Jupyter Notebook 6.5.4 is running at:
[I 22:38:38.090 NotebookApp] http://localhost:8888/?token=fcee76f188e58259df8bcb09142177121a2de8898cd3d79b
[I 22:38:38.106 NotebookApp]  or http://127.0.0.1:8888/?token=fcee76f188e58259df8bcb09142177121a2de8898cd3d79b
[I 22:38:38.106 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 22:38:38.168 NotebookApp]

    To access the notebook, open this file in a browser:
        file:///C:/Users/%E9%83%8E%E8%95%8A/AppData/Roaming/jupyter/runtime/nbserver-8404-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/?token=fcee76f188e58259df8bcb09142177121a2de8898cd3d79b
     or http://127.0.0.1:8888/?token=fcee76f188e58259df8bcb09142177121a2de8898cd3d79b
This version of python seems to be incorrectly compiled
(internal generated filenames are not absolute).
This may make the debugger miss breakpoints.
Related bug: http://bugs.python.org/issue1666807
[I 22:39:45.030 NotebookApp] 302 GET /?token=fcee76f188e58259df8bcb09142177121a2de8898cd3d79b (::1) 0.000000ms
[I 22:39:45.071 NotebookApp] Saving file at /PycharmProjects/Untitled.ipynb
C:\anaconda\Lib\site-packages\nbformat\__init__.py:128: MissingIDFieldWarning: Code cell is missing an id field, this will become a hard error in future nbformat versions. You may want to use `normalize()` on your notebooks before validations (available since nbformat 5.1.4). Previous versions of nbformat are fixing this issue transparently, and will stop doing so in the future.
  validate(nb)
C:\anaconda\Lib\site-packages\notebook\services\contents\manager.py:353: MissingIDFieldWarning: Code cell is missing an id field, this will become a hard error in future nbformat versions. You may want to use `normalize()` on your notebooks before validations (available since nbformat 5.1.4). Previous versions of nbformat are fixing this issue transparently, and will stop doing so in the future.
  validate_nb(model['content'])
[I 22:40:19.999 NotebookApp] Creating new notebook in /PycharmProjects
[I 22:40:21.597 NotebookApp] Kernel started: db2a5c36-1b4d-4408-9b4e-2727cf20ab2a, name: python3
This version of python seems to be incorrectly compiled
(internal generated filenames are not absolute).
This may make the debugger miss breakpoints.
Related bug: http://bugs.python.org/issue1666807
[I 22:42:22.063 NotebookApp] Saving file at /PycharmProjects/helloworld.ipynb

继续跟着mosh敲项目

import pandas as pd
df =pd.read_csv('vgsales.csv')
df.shape

df.describe()

Jupter Shortcut

绿色的为可编辑模式，点空白处出现蓝色矩形框，按下h，即可获得keyboard shortcuts

Jupyter 笔记本有两种不同的键盘输入模式。 编辑模式允许您将代码或文本输入到一个单元格中，并通过一个绿色边框的单元格来表示 命令模式将键盘与笔记本级命令绑定在一起，并通过一个灰框、左边距蓝色的单元格显示。

tab自动补全单词，shift+tab查看解释，windows的ctrl+/可以把语句变成注释，在输入一遍 ctrl+/变为正常语句

A Real Problems

step:

Import Data
Clean Data
Spilt the data into training/Test sets-82原则
Create a model
Train the model
Make Predictions
Evaluate and Improve

Step1:Import Data

https://www.dropbox.com/s/81s0t9ayh8m4sq8/music.csv?dl=0https://www.dropbox.com/s/81s0t9ayh8m4sq8/music.csv?dl=0

下载csv，放到jupyter文件夹下，在notebook里面输入

import pandas as pd
music_data = pd.read_csv('music.csv')
music_data

Step2：Preparing the Data

删除重复的数据，零值等。虽然我们对这个数据集不需要做这件事，但是需要分成2个独立的数据集，input 和 output。

import pandas as pd
music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre'])
Y = music_data['genre']
Y

Step3: Learn and Predicting

使用决策树作为算法模型，sklearn的库已经实现了，sklearn是最流行的机器学习库in python。

python里的package叫做sklearn，class叫做DecisionTreeClassifier

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre'])
Y = music_data['genre']

model=DecisionTreeClassifier()
model.fit(X,Y)
predictions = model.predict([[21,1],[22,0]])
predictions

Step4: Calculating the Accuracy

首先将数据集分成两组，一组用与tranning，另一组用于testing。70-80%用于训练，20-30%用于测试。

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre'])
Y = music_data['genre']
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)

model=DecisionTreeClassifier()
model.fit(X_train,Y_train)
predictions = model.predict(X_test)


score = accuracy_score(Y_test,predictions)

需要大量的干净的数据才能把模型训练的更加准确，例如把样本数据降低为20%，精准度只有40%

Step5: Persisting Models

数据量太大了不能每次都训练，这样需要的时间太长了，构建模型并保存到文件中，下次做预测的时候，我们只需要从文件中加载模型让他做出预测。

作业lib对象有保存和加载模块的方法

报错：ImportError: cannot import name 'joblib' from 'sklearn.externals' (C:\anaconda\Lib\site-packages\sklearn\externals\__init__.py)

看的51cto的博主说的终于通过尝试找到了正确方法，以为是没有包

pip install joblib

输入这个不对，最后通过把from sklearn.externals import joblib直接改为import joblib，就好了

1.报错版本

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals import joblib

music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre'])
Y = music_data['genre']

model=DecisionTreeClassifier()
model.fit(X,Y)

joblib.dump(model,'music_recommender.joblib')

# predictions = model.predict([[21,1],[22,0]])

2.执行ok版本

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import joblib

music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre'])
Y = music_data['genre']

model=DecisionTreeClassifier()
model.fit(X,Y)

joblib.dump(model,'music_recommender.joblib')

# predictions = model.predict([[21,1],[22,0]])

文件被存在了一起的文件夹里

然后调用程序：

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import joblib

model=joblib.load('music_recommender.joblib')

predictions = model.predict([[21,1]])
predictions

写错了不懂为啥为了chatgpt解释的挺好：

The distinction between using model.predict([[21, 1]]) and model.predict([21, 1]) lies in the shape and format of the input data that the predict method expects.

在scikit learn中，当使用机器学习模型进行预测时，输入数据应该被结构化为二维数组（或矩阵），即使你是针对单个实例进行预测。原因是scikit学习模型被设计为一次处理多个样本，并且使用二维数组可以保持一致性。

Here's why model.predict([[21, 1]]) is correct, and model.predict([21, 1]) is wrong:

1.`model.prdict（[[21，1]]）`：这正确地提供了一个具有一个样本和两个特征的二维数组。外括号“[[]]”创建列表列表，其中内列表“[21，1]”表示单个实例的特征。这种形式符合scikit learn的期望。

2.“model.prdict（[21，1]）”：这是一个一维列表，包含两个值，“21”和“1”。它缺少创建二维数组所需的外括号“[]”。因此，scikit learn会将其解释为两个独立的样本，每个样本都有一个功能，这不是您想要的。

总之，scikit learn的“预测”方法希望输入数据被结构化为二维数组，即使是单个实例。因此，在进行预测时，重要的是以正确的格式提供输入数据，即“[[]]”来包装实例的特征值。

Step6: Visualiziong A Decisions Tree

1.把二叉树写出来，生成.dot文件

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre'])
Y = music_data['genre']

model=DecisionTreeClassifier()
model.fit(X,Y)

tree.export_graphviz(model,out_file='music-recommender.dot',
                   feature_names= ['age','gender'],
                   class_names =sorted(Y.unique()),
                    label='all',
                    rounded=True,
                    filled=True)

2.文件内容如下：

digraph Tree {
node [shape=box, style="filled, rounded", color="black", fontname="helvetica"] ;
edge [fontname="helvetica"] ;
0 [label="age <= 30.5\ngini = 0.778\nsamples = 18\nvalue = [3, 6, 3, 3, 3]\nclass = Classical", fillcolor="#e5fad7"] ;
1 [label="gender <= 0.5\ngini = 0.75\nsamples = 12\nvalue = [3, 0, 3, 3, 3]\nclass = Acoustic", fillcolor="#ffffff"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="age <= 25.5\ngini = 0.5\nsamples = 6\nvalue = [3, 0, 3, 0, 0]\nclass = Acoustic", fillcolor="#ffffff"] ;
1 -> 2 ;
3 [label="gini = 0.0\nsamples = 3\nvalue = [0, 0, 3, 0, 0]\nclass = Dance", fillcolor="#39e5c5"] ;
2 -> 3 ;
4 [label="gini = 0.0\nsamples = 3\nvalue = [3, 0, 0, 0, 0]\nclass = Acoustic", fillcolor="#e58139"] ;
2 -> 4 ;
5 [label="age <= 25.5\ngini = 0.5\nsamples = 6\nvalue = [0, 0, 0, 3, 3]\nclass = HipHop", fillcolor="#ffffff"] ;
1 -> 5 ;
6 [label="gini = 0.0\nsamples = 3\nvalue = [0, 0, 0, 3, 0]\nclass = HipHop", fillcolor="#3c39e5"] ;
5 -> 6 ;
7 [label="gini = 0.0\nsamples = 3\nvalue = [0, 0, 0, 0, 3]\nclass = Jazz", fillcolor="#e539c0"] ;
5 -> 7 ;
8 [label="gini = 0.0\nsamples = 6\nvalue = [0, 6, 0, 0, 0]\nclass = Classical", fillcolor="#7be539"] ;
0 -> 8 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
}

和gpt搏斗半天，可以使用在线工具或软件：将你的.dot文件复制到在线Graphviz可视化工具（例如 http://www.webgraphviz.com/）中，然后获取图形预览。

feature_names= ['age','gender'], 生成了决策树的age、gender
class_names =sorted(Y.unique()), 生成了上图的class
label='all', 箭头
rounded=True, 生成了矩形框
filled=True) 给每个矩形框填充颜色