在python中分析药品销售数据

最新推荐文章于 2022-07-14 22:05:24 发布

weixin_26714477

最新推荐文章于 2022-07-14 22:05:24 发布

阅读量1.7k

点赞数 1

文章标签： python 人工智能机器学习 java 大数据

原文链接：https://towardsdatascience.com/analysing-pharmaceutical-sales-data-in-python-6ce74da818ab

版权

本文档详细介绍了使用Python进行药品销售数据分析的项目，包括加载数据、问题定义、逐步解决方案、预处理、模型与技术。通过Pandas库进行数据加载和处理，探究了2015年1月、2016年7月和2017年9月销售额最高的三种药物，以及2017年周一最畅销的药物。此外，利用线性回归、多项式回归和简单向量回归预测2020年1月的药品销售情况。项目旨在提高数据科学技能，如数据加载、集成学习和模型预测。

摘要由CSDN通过智能技术生成

介绍 (Introduction)

The purpose of this project is to analyse pharmaceutical sales data. Analysing sales data and predicting future sales based on historical data is a very common data science task. This is a great way to start working with data science.

该项目的目的是分析药品销售数据。分析销售数据并根据历史数据预测未来的销售是一项非常常见的数据科学任务。这是开始使用数据科学的好方法。

你会学什么？ (What will you learn?)

In this project, you will learn loading data sets from text files to Pandas, the most popular data manipulation and data analysis Python library and finding specific information in different sales data sets like when a specific drug was sold most often. In addition to this, we will predict future sales based on the existing data using Linear Regression, Polynomial Regression and Simple Vector Regression. We will do some data preprocessing and standardisation. To get better results, we will also learn an important and useful data science technique - ensemble learning.

在这个项目中，您将学习将数据集从文本文件加载到Pandas，最流行的数据处理和数据分析Python库，以及在不同的销售数据集中查找特定信息，例如最常销售某种药物的时间。除此之外，我们还将使用线性回归，多项式回归和简单向量回归，根据现有数据预测未来的销售量。我们将进行一些数据预处理和标准化。为了获得更好的结果，我们还将学习一种重要而有用的数据科学技术-集成学习。

You will also learn how to test your model and plot results using Matplotlib.

您还将学习如何使用Matplotlib测试模型和绘制结果。

Let’s get started.

让我们开始吧。

问题定义 (Problem definition)

Here are the specific questions we will be answering in this exercise:

以下是我们在本练习中将回答的特定问题：

On which day of the week is the second drug (M01AE) most often sold?
第二种药物(M01AE)在一周的哪一天最常销售？
Which three drugs have the highest sales in January 2015, July 2016, September 2017.
哪三种药物在2015年1月，2016年7月和2017年9月的销售额最高。
Which drug has sold most often on Mondays in 2017?
哪种药物在2017年的星期一销售最频繁？
What medicine sales may be in January 2020? (Our data set only contains information about sales from January 2014 to October 2019)
2020年1月可能会售出什么药？ (我们的数据集仅包含2014年1月至2019年10月的销售信息)

逐步解决方案 (Step by step solution)

创建一个项目文件夹 (Create a project folder)

Create a folder for a project on your computer called “Analysing-pharmaceutical-sales-data”

在计算机上为项目创建一个名为“ Analysing-pharmaceutical-sales-data”的文件夹

从此Kaggle项目下载数据集： (Download data sets from this Kaggle project:)

https://www.kaggle.com/milanzdravkovic/pharma-sales-data

Place these data sets in a folder called “data” in your project folder.

将这些数据集放置在项目文件夹中名为“ data”的文件夹中。

If you’ve never used Python or Jupyter Notebook on your computer read my article How to set up your computer for Data Science to check if you have everything you need to run the below analysis in your computer.

如果您从未在计算机上使用过Python或Jupyter Notebook，请阅读我的文章如何为数据科学设置计算机以检查是否具备在计算机上运行以下分析所需的一切。

启动新笔记本 (Start a new notebook)

Start Jupyter Notebook by typing a command in the Terminal/Command Prompt:

通过在终端/命令提示符中键入命令来启动Jupyter Notebook：

$ jupyter notebook

Click new in the top right corner and select Python 3.

单击右上角的“新建”，然后选择“ Python 3”。

This will open a new Jupyter Notebook in your browser. Rename the Untitled project name to your project name and you are ready to start.

这将在浏览器中打开一个新的Jupyter Notebook。将Untitled项目名称重命名为您的项目名称，您就可以开始了。

If you have Anaconda installed on your computer you will already have all libraries needed for this project installed on your computer.

如果您的计算机上安装了Anaconda，则已经在计算机上安装了此项目所需的所有库。

If you are using Google Colab, open a new notebook.

如果您使用的是Google Colab，请打开一个新笔记本。

加载库和设置 (Loading libraries and setup)

The first thing we usually do in a new notebook is adding different libraries we will need to use when working on the project.

我们通常在新笔记本中要做的第一件事是添加在项目上需要使用的不同库。

# Pandas - Data manipulation and analysis library
import pandas as pd
# NumPy - mathematical functions on multi-dimensional arrays and matrices
import numpy as np
# Matplotlib - plotting library to create graphs and charts
import matplotlib.pyplot as plt
# Re - regular expression module for Python
import re
# Calendar - Python functions related to the calendar
import calendar


# Manipulating dates and times for Python
from datetime import datetime


# Scikit-learn algorithms and functions
from sklearn import linear_model
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor


# Settings for Matplotlib graphs and charts
from pylab import rcParams
rcParams['figure.figsize'] = 12, 8


# Display Matplotlib output inline
%matplotlib inline


# Additional configuration
np.set_printoptions(precision=2)

Now we are ready to solve our first question.

现在我们准备解决我们的第一个问题。

第二种药物(M01AE)在一周的哪一天最常销售？ (On which day of the week is the second drug (M01AE) most often sold?)

Once we have all libraries loaded usually what we do next is to load our dataset. Because we need to find out on which day of the week is the second drug most o

最低0.47元/天解锁文章

weixin_26714477

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
在python中分析药品销售数据

介绍 (Introduction)The purpose of this project is to analyse pharmaceutical sales data. Analysing sales data and predicting future sales based on historical data is a very common data science task. Th...
复制链接

扫一扫