Learning Summary:
Ps: Some of the code comes from Kaggle's Learn Pandas, If I make mistakes, please point out.
1. What is a DataFrame object: A DataFrame object is a table that stores data
2. What is a Series object: A Series object is a list of data stores
3. Relationship between DataFrame and Series: Series is a part of the DataFrame
4. Create a DataFrame:
(All pd in the code block are Pandas.)
In [1]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
Out [1]:
Yes | No | |
0 | 50 | 131 |
1 | 21 | 2 |
In [2]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
Out [2]:
Bob | Sue | |
0 | I liked it | Pretty good |
1 | It was awful | Bland |
In [3]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
Out [3]:
Bob | Sue | |
Product A | I liked it | Pretty good |
Product B | It was awful | Bland |
Index replaces the default number by filling the leftmost column with the contents of index
5. Create a Series:
In [1]:
pd.Series([1, 2, 3, 4, 5])
Out [1]:
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
dtype : int 64
In [2]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
Out [2]:
2015 Sales | 30 |
2016 Sales | 35 |
2017 Sales | 40 |
Name: Product A , dtype : int 64
6. Read the CSV file:
You can go to Pandas's read_csv method to read data from a CSV file and return a DataFrame object
reviews = pd.read_csv('path')
The read_csv method can also specify a column in the data as the index
reviews = pd.read_csv(’path‘, index_col=0)
Index_col = 0, the data in the first column is taken as index
7. shape:
The DataFrame object can get its number of rows and columns using the shape method
reviews.shape
The code above returns reviews with a tuple whose first element is the number of rows and second element is the number of columns
8. head () :
reviews.head()
The code above returns the first five rows of data from reviews
reviews.head(2)
The above code returns the first two rows of data from reviews
Problems encountered in learning:
In the exercise of Kaggle, he asked me to read a CSV file. When I read, I did not use index_col=0 to take the element in the first column as index, so the result was inconsistent with the expected result
Solution:
At that time, I did not know the function of index_col, but just thought about how to remove the extra column. When I searched online, I found that someone introduced the drop method of DataFrame, as shown below:
df.drop(['column name 1', 'column name 2', 'column name 3'], axis=1)
However, this method is not suitable for me, because the column I need to delete does not have a name, but it provides me with a new method called 'drop'. I started looking for information related to the drop method, and finally found a method to delete columns by the number of columns:
df.drop(df.columns[[0]], axis=1)
The above code removes the contents of the first column, where axis=1 refers to the column and axis=0 refers to the row
That can also delete more than one column at a time:
df.drop(df.columns[[0, 1, 2]], axis=1)
The above code removes columns 1, 2, and 3
Reference:
Creating, Reading and Writing | Kaggle
python - About how to drop first columns from DataFrame? - Stack Overflow
学习汇总:
ps:部分代码来自于kaggle的learn pandas,刚开始学习,错误的地方欢迎大家指出
1. 什么是DataFrame对象:DataFrame对象就是一个存储数据的表格
2. 什么是Series对象:Series对象就是一个存储数据的列表
3. DataFrame和Series的关系:Series是DataFrame中的一部分
4. 创建DataFrame:
(代码块中所有的pd为pandas缩写)
输入1:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
输出1:
Yes | No | |
0 | 50 | 131 |
1 | 21 | 2 |
输入2:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
输出2:
Bob | Sue | |
0 | I liked it | Pretty good |
1 | It was awful | Bland |
输入3:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
输出3:
Bob | Sue | |
Product A | I liked it | Pretty good |
Product B | It was awful | Bland |
index会将index中的内容填入最左侧一列来替换默认的数字
5. 创建Series:
输入1:
pd.Series([1, 2, 3, 4, 5])
输出1:
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
dtype : int 64
输入2:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
输出2:
2015 Sales | 30 |
2016 Sales | 35 |
2017 Sales | 40 |
Name: Product A , dtype : int 64
6. 读取csv文件:
可以通过pandas的read_csv方法来读取csv文件中的数据并且返回一个DataFrame对象
reviews = pd.read_csv('path')
read_csv方法还可以指定数据中的某一列作为index
reviews = pd.read_csv(’path‘, index_col=0)
index_col = 0,将第一列的数据作为index
7. shape:
DataFrame对象可以通过shape方法来获取它的行数和列数
reviews.shape
上面的代码会返回reviews一个tuple,tuple中的第一个元素是行数,第二个元素是列数
8. head():
reviews.head()
上面的代码会返回reviews中的前五行数据
reviews.head(2)
上面的代码会返回reviews中的前两行数据
学习中遇到的问题:
在kaggle的练习中他要求我读取一个csv文件,我读取时没有用index_col=0将第一列的元素作为index导致结果与预期不符
解决:
在当时并不了解index_col的作用,只是想着该如何将多出来的一列去除,在网上搜索时发现有人介绍DataFrame的drop方法,如下所示:
df.drop(['column name 1', 'column name 2', 'column name 3'], axis=1)
但是这个方法并不适合我,因为我需要删除的那一列并没有名称,但是他为我提供了一个新的方法drop,我开始查找和drop方法相关的资料,最后找到了通过列数来删除列的方法:
df.drop(df.columns[[0]], axis=1)
上面的代码可以删除第一列的内容,axis=1是指列,axis=0是指行
也可以同时删除多列:
df.drop(df.columns[[0, 1, 2]], axis=1)
上面的代码可以删除第1,2,3列
引用:
Creating, Reading and Writing | Kaggle
python - About how to drop first columns from DataFrame? - Stack Overflow