熊猫数据集_熊猫迈向数据科学的第一步

最新推荐文章于 2023-09-05 09:56:36 发布

张_伟_杰

最新推荐文章于 2023-09-05 09:56:36 发布

阅读量653

点赞数

文章标签： python 人工智能机器学习大数据

原文链接：https://medium.com/swlh/pandas-first-step-towards-data-science-91b39beb825c

版权

熊猫数据集

I started learning Data Science like everyone else by creating my first model using some machine learning technique. My first line of code was :

通过使用某种机器学习技术创建我的第一个模型，我开始像其他所有人一样学习数据科学。我的第一行代码是：

import pandas as pd

Apart from noticing a cuddly bear name, I didn’t pay much attention to this library but used it a lot while creating models. Soon I realized that I was underestimating power of Pandas, it can do more than Kung-fu and that is what we are going to learn through the series of articles where I am going to explore Pandas library to gain skills which can help us analyze data in depth.

除了注意到一个可爱的熊名外，我并没有过多地关注这个库，但是在创建模型时经常使用它。很快，我意识到我低估了熊猫的力量，它比功夫还可以做更多的事情，这就是我们将通过系列文章学习的内容，在这些文章中，我将探索熊猫图书馆以获得技能，以帮助我们分析数据深入。

In this article, we will understand

在本文中，我们将了解

How to read data using Pandas?
如何使用熊猫读取数据？
How data is stored ?
数据如何存储？
How can we access data ?
我们如何访问数据？

什么是熊猫？ (What is Pandas ?)

Pandas is a python library for data analysis and manipulation. That said, pandas revolve all around data. Data that we read through pandas is most commonly in Comma Seperated Values or csv format.

Pandas是用于数据分析和处理的python库。就是说，大熊猫围绕着数据。我们通过熊猫读取的数据通常以逗号分隔值或csv格式显示。

如何读取数据？ (How to read data ?)

We use read_csv() method to read csv file which is first line of code that we all come across when we start using Pandas library. Remember to import pandas before you start coding.

我们使用read_csv()方法读取csv文件，这是我们开始使用Pandas库时遇到的第一行代码。在开始编码之前，请记住要导入熊猫。

import pandas as pdtitanic_data = pd.read_csv("../Dataset/titanic.csv")

In this article we are going to use Titanic database, which you can access from here. After reading data using pd.read_csv(), we store it in a variable titanic_data which is of type Dataframe.

在本文中，我们将使用Titanic数据库，您可以从此处访问它。使用pd.read_csv()读取数据后，我们将其存储在Dataframe类型的变量titanic_data中。

什么是数据框？ (What is a Dataframe ?)

Dataframe is collection of data in rows and columns.Technically, dataframes are made up of individual Series. Series is simply a list of data. Lets understand with some example code

数据框是行和列中数据的集合。从技术上讲，数据框由各个Series组成。 系列只是数据列表。让我们看一些示例代码

#We use pd.Series() to create a series in Pandas>> colors = pd.Series(['Blue','Green']) 
>> print(colors)output:0     Blue
1    Green
dtype: object>> names_list = ['Ram','Shyam']
>> names = pd.Series(names_list)output:0      Ram
1    Shyam
dtype: object

We provide a list as parameter to pd.Series() method which create a series with index. As default, index starts with 0. However, we can even change index since index is also a series.

我们提供一个列表作为pd.Series()方法的参数，该方法创建带有索引的序列。默认情况下，索引以0开头。但是，由于索引也是一个序列，因此我们甚至可以更改索引。

>> index = pd.Series(["One","Two"])
>> colors = pd.Series(['Blue','Green'],index = index) 
>> print(colors)output:One     Blue
Two    Green
dtype: object

Now coming back to our definition, Dataframe is collection of individual Series. Let us use colors and names series that we initialized above to create a dataframe.

现在回到我们的定义，Dataframe是各个系列的集合。让我们使用上面初始化的颜色和名称系列来创建数据框。

>> df = pd.DataFrame({"Colors":colors,"Names":names})
>> print(df)output:   Colors  Names
0   Blue    Ram
1  Green  Shyam

We used pd.DataFrame() to create a dataframe and passed a dictionary to it. Keys of this dictionary represents the column name and values represents corresponding data to that column which is a series. So from above example you can understand that Dataframe is nothing but collection of series. We can also change index of the Dataframe in same manner as we did with series.

我们使用pd.DataFrame()创建一个数据框，并向其传递了一个字典。该字典的键代表列名，值代表该列的对应数据，该列是一个序列。因此，从以上示例中您可以理解，Dataframe只是系列的集合。我们也可以像处理序列一样更改Dataframe的索引。

>> index = pd.Series(["One","Two"])
>> colors = pd.Series(['Blue','Green'],index = index) 
>> names = pd.Series(['Ram','Shyam'],index = index)# Creating a Dataframe
>> data = pd.DataFrame({"Colors":colors,"Names":names},index=index)
>> print(data)output:Colors  Names
One   Blue    Ram
Two  Green  Shyam

So far we have understood how we read csv data and how this data is represented. Lets move on to understand how can we access this data.

到目前为止，我们已经了解了如何读取csv数据以及如何表示该数据。让我们继续了解如何访问这些数据。

如何从数据框访问数据？ (How to access data from Dataframes ?)

There are two ways to access data from Dataframes :

有两种方法可以从数据框访问数据：

Column-wise
列式
Row-wise
逐行

列式 (Column-wise)

First of all let us check columns in our Titanic data

首先让我们检查一下泰坦尼克号数据中的列

>> print(titanic_data.columns)output:Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket','Fare', 'Cabin', 'Embarked'],
      dtype='object')

We can now access data using column name in two ways either by using column name as property of our dataset object or by using column name as index of our dataset object. Advantage of using column name as index is that we can use columns with names such as “First Name”,”Last Name” which is not possible to use as property.

现在，我们可以通过两种方式使用列名访问数据：将列名用作数据集对象的属性，或者将列名用作数据集对象的索引。使用列名作为索引的优点是我们可以使用名称不能使用的列，例如“ First Name”，“ Last Name”。

# Using column name as property>> print(titanic_data.Name)output:0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
....
Name: Name, Length: 891, dtype: object# Using column name as index
>> print(titanic_data['Name'])output:0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
....
Name: Name, Length: 891, dtype: object>> print(titanic_data['Name'][0])output:Braund, Mr. Owen Harris

逐行 (Row-wise)

In order to access data row-wise we use methods like loc() and iloc(). Lets take a look at some example to understand these methods.

为了按行访问数据，我们使用loc()和iloc()之类的方法。让我们看一些例子来了解这些方法。

# Using loc() to display a row
>> print(titanic_data.loc[0])output:PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                                 22
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object# Using iloc() to display a row
>> print(titanic_data.iloc[0])output: same as above>> print(titanic_data.loc[0,'Name'])output:Braund, Mr. Owen Harris>> print(titanic_data.iloc[0,3])output: same as above

As we saw in code above, we access rows using their index values and to further grill down to a specific value in a row we use either column name or column index. Remember as we saw earlier that columns are also stored as list whose index start from 0. So first column “PassengerId” is present at index 0. Apart from this we saw a difference between loc() and iloc() methods. Both perform same task but in a different way.

正如我们在上面的代码中所看到的，我们使用行的索引值访问行，并进一步使用行名或列索引将行取到特定的值。记住，如前所述，列也存储为索引从0开始的列表。因此第一列“ PassengerId”出现在索引0。除此之外，我们还看到了loc()和iloc()方法之间的区别。两者执行相同的任务，但方式不同。

We can also access more than one row at a time with all or some columns. Lets understand how

我们还可以一次访问全部或部分列的多个行。让我们了解如何

# To display whole dataset
>> print(titanic_data.loc[:]) # or titanic_data.iloc[:]output:     PassengerId  Survived  Pclass  .....
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1
...
[891 rows x 12 columns]# To display first four rows with Name and Ticket
>> print(titanic_data.loc[:3,["Name","Ticket"]]) # or titanic_data.iloc[:3,[3,8]]output:                                Name            Ticket
0               Braund, Mr. Owen Harris         A/5 21171
1  Cumings, Mrs. John Bradley (Flor...          PC 17599
2               Heikkinen, Miss. Laina          STON/O2. 3101282
3  Futrelle, Mrs. Jacques Heath....             113803

I hope you got an idea to use loc() and iloc() methods, also understood the difference between two methods. With this we come to end of this article. We will continue exploring Pandas library in second part but till then keep practicing. Happy Coding !

希望您对使用loc()和iloc()方法有所了解，也希望您理解两种方法之间的区别。至此，我们结束了本文。我们将在第二部分中继续探索Pandas图书馆，但在此之前继续练习。编码愉快！

翻译自: https://medium.com/swlh/pandas-first-step-towards-data-science-91b39beb825c

熊猫数据集

张_伟_杰

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
熊猫数据集_熊猫迈向数据科学的第一步

熊猫数据集I started learning Data Science like everyone else by creating my first model using some machine learning technique. My first line of code was : 通过使用某种机器学习技术创建我的第一个模型，我开始像其他所有人一样学习数据科学。我的第一行代码是：...
复制链接

扫一扫