数据分析Seaborn常用画图方式汇总|多种方法画图找趋势、找关系、找分布|20 mins速成|Kaggle 学习(一)

在这里插入图片描述

  • Trends

    - A trend is defined as a pattern of change.

    • sns.lineplot - Line charts are best to show trends over a period of time, and multiple lines can be used to show trends in more than one group.
  • Relationship

    - There are many different chart types that you can use to understand relationships between variables in your data.

    • sns.barplot - Bar charts are useful for comparing quantities corresponding to different groups.
    • sns.heatmap - Heatmaps can be used to find color-coded patterns in tables of numbers.
    • sns.scatterplot - Scatter plots show the relationship between two continuous variables; if color-coded, we can also show the relationship with a third categorical variable.
    • sns.regplot - Including a regression line in the scatter plot makes it easier to see any linear relationship between two variables.
    • sns.lmplot - This command is useful for drawing multiple regression lines, if the scatter plot contains multiple, color-coded groups.
    • sns.swarmplot - Categorical scatter plots show the relationship between a continuous variable and a categorical variable.
  • Distribution

    - We visualize distributions to show the possible values that we can expect to see in a variable, along with how likely they are.

    • sns.distplot - Histograms show the distribution of a single numerical variable.
    • sns.kdeplot - KDE plots (or 2D KDE plots) show an estimated, smooth distribution of a single numerical variable (or two numerical variables).
    • sns.jointplot - This command is useful for simultaneously displaying a 2D KDE plot with the corresponding KDE plots for each individual variable.

1. Line Chart

import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib,pyplot as plt
%matplotlib inline
import seaborn as sns
# Path of the file to read
spotify_filepath = "../input/spotify.csv"

# Read the file into a variable spotify_data
spotify_data = pd.read_csv(spotify_filepath, index_col="Date", parse_dates=True)

spotify_data.tail()
Shape of YouDespacitoSomething Just Like ThisHUMBLE.Unforgettable
Date
2018-01-0544929783450315.02408365.02685857.02869783.0
2018-01-0644164763394284.02188035.02559044.02743748.0
2018-01-0740091043020789.01908129.02350985.02441045.0
2018-01-0841355052755266.02023251.02523265.02622693.0
2018-01-0941685062791601.02058016.02727678.02627334.0
# Line chart showing daily global streams of each song
sns.lineplot(data=spotify_data)
<matplotlib.axes._subplots.AxesSubplot at 0x7fc8b2bb6f98>

在这里插入图片描述

# Set the width and height of the figure
# sets the size of the figure to 14 inches (in width) by 6 inches (in height)
plt.figure(figsize=(14, 6))

# Add title
plt.title("Daily Global Streams of Popular Songs in 2017-2018")

# Line chart showing daily global streams of each song
sns.lineplot(data=spotify_data)
<matplotlib.axes._subplots.AxesSubplot at 0x7fc8b2a74780>

在这里插入图片描述

Changing styles

# Seaborn has five different themes:(1)"darkgrid", (2)"whitegrid", (3)"dark", (4)"white", and (5)"ticks"
# Change the style of the figure to the "dark" theme
sns.set_style("dark")

# Line chart 
plt.figure(figsize=(12,6))
sns.lineplot(data=spotify_data)
<matplotlib.axes._subplots.AxesSubplot at 0x7f5faa4bc828>

在这里插入图片描述

Plot a subset of the data

# Set the width and height of the figure
plt.figure(figsize=(14,6))

# Add title
plt.title("Daily Global Streams of Popular Songs in 2017-2018")

# Line chart showing daily global streams of 'Shape of You'
sns.lineplot(data=spotify_data['Shape of You'], label="Shape of You")

# Line chart showing daily global streams of 'Despacito'
sns.lineplot(data=spotify_data['Despacito'], label="Despacito")

# Add label for horizontal axis
plt.xlabel("Date")

在这里插入图片描述

2.Bar Charts

# Print the data
flight_data
AAASB6DLEVF9HAMQNKOOUAUSVXWN
Month
16.955843-0.3208887.347281-2.0438478.53749718.3572383.51264018.16497411.39805410.8898946.3527293.1074571.4207023.389466
27.530204-0.78292318.6576735.61474510.41723627.4241796.02996721.30162716.4744669.5888957.2606627.1144557.7844103.501363
36.693587-0.54473110.7413172.0779656.73010120.0748553.46838311.01841810.0391183.1816934.8922123.3307875.3482073.263341
44.931778-3.0090032.7801050.0833434.82125312.6404400.0110225.1312288.7662243.2237964.3760922.6602900.9955072.996399
55.173878-1.716398-0.7090190.1493337.72429013.0075540.8264265.46679022.3973474.1411626.8276950.6816057.1020215.680777
68.191017-0.2206215.0471554.41959413.95279319.7129510.8827869.63932335.5615018.33847716.9326635.7662965.77941510.743462
73.8704400.3774085.8414541.2048626.92642114.4645432.0015863.98028914.3523826.79033310.262551NaN7.13577310.504942
83.1939072.5038999.2809500.6531145.1544229.1757377.4480291.89656520.5190185.6066895.014041NaN5.1062215.532108
9-1.432732-1.8138003.539154-3.7033770.8510620.9784603.696915-2.1672688.0001011.530896-1.794265NaN0.070998-1.336260
10-0.580930-2.9936173.676787-5.0115162.3037600.0821270.467074-3.7350546.8107361.750897-2.456542NaN2.254278-0.688851
110.772630-1.9165161.418299-3.1754144.41593011.164527-2.7198940.2200617.5438814.9255480.281064NaN0.1163700.995684
124.149684-1.84668113.8392902.5045956.6851769.346221-1.7064750.66248612.73312310.9476127.012079NaN13.4987206.720893
# Set the width and height of the figure
plt.figure(figsize=(10,6))

# Add title
plt.title("Average Arrival Delay for Spirit Airlines Flights, by Month")

# Bar chart showing average arrival delay for Spirit Airlines flights by month
sns.barplot(x=flight_data.index, y=flight_data['NK'])

# Add label for vertical axis
plt.ylabel("Arrival delay (in minutes)")
Text(0, 0.5, 'Arrival delay (in minutes)')

在这里插入图片描述

3.Heatmap

# Set the width and height of the figure
plt.figure(figsize=(14,7))

# Add title
plt.title("Average Arrival Delay for Each Airline, by Month")

# Heatmap showing average arrival delay for each airline by month
sns.heatmap(data=flight_data, annot=True)

# Add label for horizontal axis
plt.xlabel("Airline")
Text(0.5, 42.0, 'Airline')

在这里插入图片描述

4.Scatter Plots

insurance_data.head()
agesexbmichildrensmokerregioncharges
019female27.9000yessouthwest16884.92400
118male33.7701nosoutheast1725.55230
228male33.0003nosoutheast4449.46200
333male22.7050nonorthwest21984.47061
432male28.8800nonorthwest3866.85520
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f44f2300048>

在这里插入图片描述

# Add a regression line
sns.regplot(x=insurance_data['bmi'], y=insurance_data['charges'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f44f222c588>

在这里插入图片描述

Color-coded scatter plots

# color-code the points by 'smoker', plot the other two columns('bmi', 'charges') on the axes 
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'], hue=insurance_data['smoker'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f44f19b49e8>

在这里插入图片描述

# add two regression lines, corresponding to smokers and nonsmokers
# Instead of setting x=insurance_data['bmi'] to select the 'bmi' column in insurance_data, we set x="bmi" to specify the name of the column only.
# Similarly, y="charges" and hue="smoker" also contain the names of columns.
# We specify the dataset with data=insurance_data.

sns.lmplot(x="bmi", y="charges", hue="smoker", data=insurance_data)
<seaborn.axisgrid.FacetGrid at 0x7f44f192d668>

在这里插入图片描述

sns.swarmplot(x=insurance_data['smoker'],
							y=insurance_data['charges'])

在这里插入图片描述

5.Histograms

Sepal Length (cm)Sepal Width (cm)Petal Length (cm)Petal Width (cm)Species
Id
15.13.51.40.2Iris-setosa
24.93.01.40.2Iris-setosa
34.73.21.30.2Iris-setosa
44.63.11.50.2Iris-setosa
55.03.61.40.2Iris-setosa
# 'a' chooses the columns of the data
# kde=False is something we'll always provide when creating a histogram, as leaving it out will create a slightly different plot.
sns.displot(a=iris_data['Petal Length(cm)'], kde=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f96c5b1da20>

在这里插入图片描述

Color-coded plots
# Histograms for each species
sns.distplot(a=iris_set_data['Petal Length (cm)'], label="Iris-setosa", kde=False)
sns.distplot(a=iris_ver_data['Petal Length (cm)'], label="Iris-versicolor", kde=False)
sns.distplot(a=iris_vir_data['Petal Length (cm)'], label="Iris-virginica", kde=False)

# Add title
plt.title("Histogram of Petal Lengths, by Species")

# Force legend to appear
plt.legend()
<matplotlib.legend.Legend at 0x7f96c5849470>

在这里插入图片描述

6. Density plots

# Kernel density estimate(KDE) plot is like as a smoothed histogram
# 'shade=True' colors the area below the curve
sns.kdeplot(data=iris_data['Petal Length (cm)'], shade=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f96c5a664e0>

在这里插入图片描述

# 2D KDE plot
sns.jointplot(x=iris_data['Petal Length (cm)'], y=iris_data['Sepal Width (cm)'], kind="kde")
<seaborn.axisgrid.JointGrid at 0x7f96c59cbef0>

The color-coding shows us how likely we are to see different combinations of sepal width and petal length, where darker parts of the figure are more likely.
在这里插入图片描述

  • the curve at the top of the figure is a KDE plot for the data on the x-axis (in this case, iris_data['Petal Length (cm)']), and
  • the curve on the right of the figure is a KDE plot for the data on the y-axis (in this case, iris_data['Sepal Width (cm)']).
Color-coded plots
# KDE plots for each species
sns.kdeplot(data=iris_set_data['Petal Length (cm)'], label="Iris-setosa", shade=True)
sns.kdeplot(data=iris_ver_data['Petal Length (cm)'], label="Iris-versicolor", shade=True)
sns.kdeplot(data=iris_vir_data['Petal Length (cm)'], label="Iris-virginica", shade=True)

# Add title
plt.title("Distribution of Petal Lengths, by Species")
Text(0.5, 1.0, 'Distribution of Petal Lengths, by Species')

在这里插入图片描述

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,我会尝试回答这个问题。首先,我们需要对数据集进行探索性数据分析(EDA),以便更好地了解数据集中包含哪些信息和特征。然后,我们可以使用两种不同的机器学习算法对数据集进行分析。 在进行数据分析和可视化之前,我们需要导入必要的库和数据集。我们可以使用Python编程语言及其库,如Pandas,Matplotlib,Seaborn和Scikit-learn来完成这项任务。 首先,我们需要导入数据集并查看一些基本信息。 然后,我们可以开始进行探索性数据分析。 ```python import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns #导入数据集 netflix_data = pd.read_csv('netflix_titles.csv') # 查看前5个数据行 print(netflix_data.head()) # 查看数据集的形状 print(netflix_data.shape) # 查看数据集的基本信息 print(netflix_data.info()) # 查看数据集的描述统计信息 print(netflix_data.describe()) ``` 接下来,我们可以使用各种可视化工具来探索数据集。 在这里,我们将使用Seaborn和Matplotlib库来可视化数据。 ```python # 绘制电影和电视节目的计数图 sns.set(style="darkgrid") ax = sns.countplot(x="type", data=netflix_data) # 设置图表标题和标签 plt.title('Netflix Movies vs TV Shows') plt.xlabel('Type') plt.ylabel('Count') # 显示图表 plt.show() # 绘制各国家电影和电视节目的计数图 sns.set(style="darkgrid") ax = sns.countplot(x="country", hue="type", data=netflix_data, order=netflix_data['country'].value_counts().iloc[:10].index) # 设置图表标题和标签 plt.title('Top 10 Countries with Most Netflix Content') plt.xlabel('Country') plt.ylabel('Count') # 显示图表 plt.show() # 绘制不同类型电影和电视节目的评分箱线图 sns.set(style="whitegrid") ax = sns.boxplot(x="rating", y="type", data=netflix_data) # 设置图表标题和标签 plt.title('Ratings of Netflix Movies and TV Shows') plt.xlabel('Rating') plt.ylabel('Type') # 显示图表 plt.show() ``` 接下来,我们可以使用两种不同的机器学习算法对数据集进行分析。 在这里,我们将使用逻辑回归和决策树算法。 ```python # 导入必要的库 from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score # 筛选特征 features = ['type', 'director', 'cast', 'country', 'date_added', 'rating'] target = 'listed_in' # 将特征和目标分配给X和y变量 X = netflix_data[features] y = netflix_data[target] # 将分类变量转换为数值变量 X = pd.get_dummies(X) # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 初始化逻辑回归模型 lr_model = LogisticRegression() # 训练逻辑回归模型 lr_model.fit(X_train, y_train) # 预测测试集结果 lr_pred = lr_model.predict(X_test) # 计算准确率 lr_acc = accuracy_score(y_test, lr_pred) # 输出逻辑回归模型的准确率 print('Logistic Regression Accuracy:', lr_acc) # 初始化决策树模型 dt_model = DecisionTreeClassifier() # 训练决策树模型 dt_model.fit(X_train, y_train) # 预测测试集结果 dt_pred = dt_model.predict(X_test) # 计算准确率 dt_acc = accuracy_score(y_test, dt_pred) # 输出决策树模型的准确率 print('Decision Tree Accuracy:', dt_acc) ``` 以上是对Netflix Movies and TV Shows | Kaggle数据集进行数据分析及其可视化,并用两种机器学习算法进行分析的示例代码。 请注意,还有许多其他的数据分析和机器学习算法可以应用于此数据集。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值