实验七 综合实验

本文详细介绍了如何在Anaconda环境中通过Jupyter和pip安装Python库,包括基本的pip命令检查和第三方库如numpy、pandas等的安装。此外,还展示了如何处理数据集,如adult.data,进行预处理,包括数据描述和可视化特征分布。
摘要由CSDN通过智能技术生成

一.

下载并成功运行Anaconda,jupyter book ,spyder

输入检验(print (“hello”))

二.

在jupyter prompt中安装库:

  1. 找到anaconda 的Scripts库,并复制路径以备后面安装命令

D:\Program Files\anaconda3\Scripts

  1. 进入prompt命令界面输入pip:

①第一个pip命令

检查pip是否成功

②第二个pip命令

③第三、四个pip命令

三.输入代码,依次运行

1.

import io, os, sys, types, time, datetime, math, random

import requests, subprocess,io, tempfile

2.

#导入第三方库

# 数据处理

import numpy as np

import pandas as pd

# 数据可视化

import matplotlib.pyplot as plt

import missingno

import seaborn as sns

from pandas.plotting import scatter_matrix

from mpl_toolkits.mplot3d import Axes3D

# 特征选择和编码

from sklearn.feature_selection import RFE, RFECV

from sklearn.svm import SVR

from sklearn.decomposition import PCA

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize

# 机器学习

import sklearn.ensemble as ske

from sklearn import datasets, model_selection, tree, preprocessing, metrics

from sklearn import linear_model

from sklearn.svm import LinearSVC

from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, SGDClassifier

from sklearn.tree import DecisionTreeClassifier

import xgboost as xgb

import lightgbm as lgb

# 网格搜索、随机搜索

import scipy.stats as st

from scipy.stats import randint as sp_randint

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import RandomizedSearchCV

# 模型度量(分类)

from sklearn.metrics import precision_recall_fscore_support, roc_curve, auc

# 警告处理

import warnings

warnings.filterwarnings('ignore')

# 在 Jupyter 上画图

%matplotlib inline

3.

# 字段名

headers = ['age', 'workclass', 'fnlwgt',

           'education', 'education-num',

           'marital-status', 'occupation',

           'relationship', 'race', 'sex',

           ' capital-gain', 'capital-loss',

           'hours-per-week', 'native-country',

           'predclass']

# 加载训练集

# 读数据时 如何处理缺失值

training_raw = pd.read_csv('C://Users//Administrator//Desktop//adult.data',

                           header=None,

                           names=headers,

                           sep=',\s',

                           na_values=["?"],

                           engine='python')

# 加载测试集

test_raw = pd.read_csv('C://Users//Administrator//Desktop//adult.test',

                       header=None,

                       names=headers,

                       sep=',\s',

                       na_values=["?"],

                       engine='python',

                       skiprows=1)

test_raw.shape # : (16281, 15),数据的维度# 训练集和测试集加到一起做分析

dataset_raw = training_raw._append(test_raw) # 合并数据集# 为了避免索引引起的不必要错误,对索引进行处理

dataset_raw.reset_index(inplace=True) # 还原索引为数据

dataset_raw.drop('index', inplace=True, axis=1) # 删除还原的索引

#查看 DataFrame 占用内存

def convert_size(size_bytes):

    if size_bytes == 0:

        return "0B"

    size_name = ("Bytes", "KB", "MB", "GB", "TB", "PB", "EB", "ZB","YB")

    i = int(math.floor(math.log(size_bytes, 1024))) # 获取占用内存的级别(向下取整)

    p = math.pow(1024, i)

    s = round(size_bytes / p, 2) # 获取占用内存的大小,四舍五入,保留两位小数

    # 返回数据对应的内存空间的大小。

    # memory_usage:返回 DataFrame 中每个 series 对应的内存大小。

    # sum:求和

    return "%s %s" % (s, size_name[i])

convert_size(dataset_raw.memory_usage().sum())

  • 运行结果

五、

dataset_raw.describe()

data

set_raw.describe(include=['O'])

六、(再 + 1 分)可视化图形(上节课运行不出来,这次老师改了下关于python版本的代码就好了)

def plot_distribution(dataset, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):

    plt.style.use('seaborn-v0_8-whitegrid')

    fig = plt.figure(figsize=(width, height))

    fig.subplots_adjust(left=None, bottom=None, right=None, top=None,wspace=wspace, hspace=hspace) # 调整图表位置和大小间距

    rows = math.ceil(float(dataset.shape[1]) / cols) # ceil 方法向上取整

    for i, column in enumerate(dataset.columns): # 返回索引和列名

        ax = fig.add_subplot(rows, cols, i + 1) # 创建子图,类似于subplot方法

        ax.set_title(column) # 设置轴的标题

        if dataset.dtypes[column] == object: # 通过列的类型来区分所选取的图像类型

            g = sns.countplot(y=column, data=dataset)

            substrings = [s.get_text()[:18] for s in g.get_yticklabels()]

            g.set(yticklabels=substrings)

            plt.xticks(rotation=25)

        else:

            g = sns.distplot(dataset[column])

            plt.xticks(rotation=25)

plot_distribution(dataset_raw, cols=3, width=20, height=20,hspace=0.45, wspace=0.5)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值