在python中分析药品销售数据

介绍 (Introduction)

The purpose of this project is to analyse pharmaceutical sales data. Analysing sales data and predicting future sales based on historical data is a very common data science task. This is a great way to start working with data science.

该项目的目的是分析药品销售数据。 分析销售数据并根据历史数据预测未来的销售是一项非常常见的数据科学任务。 这是开始使用数据科学的好方法。

你会学什么? (What will you learn?)

In this project, you will learn loading data sets from text files to Pandas, the most popular data manipulation and data analysis Python library and finding specific information in different sales data sets like when a specific drug was sold most often. In addition to this, we will predict future sales based on the existing data using Linear Regression, Polynomial Regression and Simple Vector Regression. We will do some data preprocessing and standardisation. To get better results, we will also learn an important and useful data science technique - ensemble learning.

在这个项目中,您将学习将数据集从文本文件加载到Pandas,最流行的数据处理和数据分析Python库,以及在不同的销售数据集中查找特定信息,例如最常销售某种药物的时间。 除此之外,我们还将使用线性回归,多项式回归和简单向量回归,根据现有数据预测未来的销售量。 我们将进行一些数据预处理和标准化。 为了获得更好的结果,我们还将学习一种重要而有用的数据科学技术-集成学习。

You will also learn how to test your model and plot results using Matplotlib.

您还将学习如何使用Matplotlib测试模型和绘制结果。

Let’s get started.

让我们开始吧。

问题定义 (Problem definition)

Here are the specific questions we will be answering in this exercise:

以下是我们在本练习中将回答的特定问题:

  1. On which day of the week is the second drug (M01AE) most often sold?

    第二种药物(M01AE)在一周的哪一天最常销售?
  2. Which three drugs have the highest sales in January 2015, July 2016, September 2017.

    哪三种药物在2015年1月,2016年7月和2017年9月的销售额最高。
  3. Which drug has sold most often on Mondays in 2017?

    哪种药物在2017年的星期一销售最频繁?
  4. What medicine sales may be in January 2020? (Our data set only contains information about sales from January 2014 to October 2019)

    2020年1月可能会售出什么药? (我们的数据集仅包含2014年1月至2019年10月的销售信息)

逐步解决方案 (Step by step solution)

创建一个项目文件夹 (Create a project folder)

Create a folder for a project on your computer called “Analysing-pharmaceutical-sales-data”

在计算机上为项目创建一个名为“ Analysing-pharmaceutical-sales-data”的文件夹

从此Kaggle项目下载数据集: (Download data sets from this Kaggle project:)

https://www.kaggle.com/milanzdravkovic/pharma-sales-data

https://www.kaggle.com/milanzdravkovic/pharma-sales-data

Place these data sets in a folder called “data” in your project folder.

将这些数据集放置在项目文件夹中名为“ data”的文件夹中。

If you’ve never used Python or Jupyter Notebook on your computer read my article How to set up your computer for Data Science to check if you have everything you need to run the below analysis in your computer.

如果您从未在计算机上使用过Python或Jupyter Notebook,请阅读我的文章如何为数据科学设置计算机以检查是否具备在计算机上运行以下分析所需的一切。

启动新笔记本 (Start a new notebook)

Start Jupyter Notebook by typing a command in the Terminal/Command Prompt:

通过在终端/命令提示符中键入命令来启动Jupyter Notebook:

$ jupyter notebook

Click new in the top right corner and select Python 3.

单击右上角的“新建”,然后选择“ Python 3”。

Image for post
Image by Author
图片作者

This will open a new Jupyter Notebook in your browser. Rename the Untitled project name to your project name and you are ready to start.

这将在浏览器中打开一个新的Jupyter Notebook。 将Untitled项目名称重命名为您的项目名称,您就可以开始了。

Image for post
Image by Author
图片作者

If you have Anaconda installed on your computer you will already have all libraries needed for this project installed on your computer.

如果您的计算机上安装了Anaconda,则已经在计算机上安装了此项目所需的所有库。

If you are using Google Colab, open a new notebook.

如果您使用的是Google Colab,请打开一个新笔记本。

加载库和设置 (Loading libraries and setup)

The first thing we usually do in a new notebook is adding different libraries we will need to use when working on the project.

我们通常在新笔记本中要做的第一件事是添加在项目上需要使用的不同库。

# Pandas - Data manipulation and analysis library
import pandas as pd
# NumPy - mathematical functions on multi-dimensional arrays and matrices
import numpy as np
# Matplotlib - plotting library to create graphs and charts
import matplotlib.pyplot as plt
# Re - regular expression module for Python
import re
# Calendar - Python functions related to the calendar
import calendar


# Manipulating dates and times for Python
from datetime import datetime


# Scikit-learn algorithms and functions
from sklearn import linear_model
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor


# Settings for Matplotlib graphs and charts
from pylab import rcParams
rcParams['figure.figsize'] = 12, 8


# Display Matplotlib output inline
%matplotlib inline


# Additional configuration
np.set_printoptions(precision=2)

Now we are ready to solve our first question.

现在我们准备解决我们的第一个问题。

第二种药物(M01AE)在一周的哪一天最常销售? (On which day of the week is the second drug (M01AE) most often sold?)

Once we have all libraries loaded usually what we do next is to load our dataset. Because we need to find out on which day of the week is the second drug most o

  • 1
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值