【问题1 】字符串离散化的案例
案例:
对于这一组电影数据,如果我们希望统计电影分类(genre)的情况,应该如何处理数据?
例如,喜剧片的电影个数,冒险片的电影个数,爱情片的电影个数. . . . . .
思路:
(1 )重新构造一个全为0 的数组,列名为分类。
(2 )如果某一条数据中分类出现过,就让0 变为1 .
(3 )最后统计每个分类(即列表的列)的电影个数(即1 的个数)
注意:
(1 )新数组的行数和以前一样
(2 )新数组的列数是所有的genre(不重复)
import pandas as pd
df = pd. read_csv( './code2/datasets_IMDB-Movie-Data.csv' )
print ( '\n【df.info()】' )
print ( df. info( ) )
print ( '\n【df.head(1)】' )
print ( df. head( 1 ) )
print ( '\n【df["Genre"]】' )
print ( df[ 'Genre' ] )
print ( '\n【df["Genre"].str】' )
print ( df[ 'Genre' ] . str )
print ( '\n【df["Genre"].str.split(",")】' )
print ( df[ 'Genre' ] . str . split( ',' ) )
temp_list = df[ 'Genre' ] . str . split( ',' ) . tolist( )
genre_list = [ i for j in temp_list for i in j]
【df.info()】
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 1000 non-null int64
1 Title 1000 non-null object
2 Genre 1000 non-null object
3 Description 1000 non-null object
4 Director 1000 non-null object
5 Actors 1000 non-null object
6 Year 1000 non-null int64
7 Runtime (Minutes) 1000 non-null int64
8 Rating 1000 non-null float64
9 Votes 1000 non-null int64
10 Revenue (Millions) 872 non-null float64
11 Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
None
【df.head(1)】
Rank Title Genre \
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi
Description Director \
0 A group of intergalactic criminals are forced ... James Gunn
Actors Year Runtime (Minutes) \
0 Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121
Rating Votes Revenue (Millions) Metascore
0 8.1 757074 333.13 76.0
【df["Genre"]】
0 Action,Adventure,Sci-Fi
1 Adventure,Mystery,Sci-Fi
2 Horror,Thriller
3 Animation,Comedy,Family
4 Action,Adventure,Fantasy
...
995 Crime,Drama,Mystery
996 Horror
997 Drama,Music,Romance
998 Adventure,Comedy
999 Comedy,Family,Fantasy
Name: Genre, Length: 1000, dtype: object
【df["Genre"].str】
<pandas.core.strings.StringMethods object at 0x0000020818FBC910>
【df["Genre"].str.split(",")】
0 [Action, Adventure, Sci-Fi]
1 [Adventure, Mystery, Sci-Fi]
2 [Horror, Thriller]
3 [Animation, Comedy, Family]
4 [Action, Adventure, Fantasy]
...
995 [Crime, Drama, Mystery]
996 [Horror]
997 [Drama, Music, Romance]
998 [Adventure, Comedy]
999 [Comedy, Family, Fantasy]
Name: Genre, Length: 1000, dtype: object
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd. read_csv( './code2/datasets_IMDB-Movie-Data.csv' )
print ( "\n【df['Genre'].head(3)】" )
print ( df[ 'Genre' ] . head( 3 ) )
temp_list = df[ 'Genre' ] . str . split( ',' ) . tolist( )
temp_genre_list = [ i for j in temp_list for i in j]
genre_list = list ( set ( temp_genre_list) )
'''
(1)np.zeros((行数,列数))------- 构造全为0的数组
np.ones ((行数,列数))------- 构造全为1的数组
(2)df.shape ------ 返回行和列(数组类型)
df.shape[0] ------ 行
df.shape[1] ------ 列
(3)index: 行索引
columns: 列索引
'''
zeros_df = pd. DataFrame( np. zeros( ( df. shape[ 0 ] , len ( genre_list) ) ) , columns= genre_list )
'''
DataFrame的索引
(1)df.loc通过“标签索引”获取行数据(切片不包含C,但loc的切片包含C)
t = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'), columns=list('WXYZ'))
t1 = t.loc['a','Z'] # 索引为:“a”行“Z”列的元素 ----- 取某行某列------- numpy数值类型
t2 = t.loc['a'] # 索引为:“a”行元素 ---- 取整行 ------ Series类型
t3 = t.loc['a',:] # 与上述同
(2)df.iloc通过“位置” 获取行数据
t1 = t.iloc[1:] # 第1行至末 -------- 行
t2 = t.iloc[:,2] # 第2列 --------列
t3 = t.iloc[:,[2,1]] # 第2列,第1列 ----- 多列(不连续)
t.iloc[1:,:2] = 30 # 选中地方改为30,其他地方不变 ----- 赋值更改数据
print(t)
'''
for i in range ( df. shape[ 0 ] ) :
zeros_df. loc[ i, temp_list[ i] ] = 1
print ( '\n【zeros_df.head(3)】' )
print ( zeros_df. head( 3 ) )
'''
df.sum(axis = 1) # 计算行的和
df.sum(axis = 0) # 计算列的和
'''
genre_count = zeros_df. sum ( axis= 0 )
print ( '\n【genre_count】' )
print ( genre_count)
genre_count = genre_count. sort_values( )
print ( '\n【type(genre_count)】' )
print ( type ( genre_count) )
print ( '\n【genre_count:排序后genre_count】' )
print ( genre_count)
plt. figure( figsize= ( 20 , 8 ) , dpi= 80 )
_x = genre_count. index
_y = genre_count. values
plt. bar( range ( len ( _x) ) , _y)
plt. xticks( range ( len ( _x) ) , _x)
plt. show( )
print ( '\n【genre_count】' )
print ( genre_count)
print ( '\n【genre_count.index】' )
print ( genre_count. index)
print ( '\n【genre_count.values】' )
print ( genre_count. values)
【df['Genre'].head(3)】
0 Action,Adventure,Sci-Fi
1 Adventure,Mystery,Sci-Fi
2 Horror,Thriller
Name: Genre, dtype: object
【zeros_df.head(3)】
Biography Music Comedy Crime Fantasy Sport Family Horror Romance \
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
Action Mystery Sci-Fi Western Animation Adventure Drama War \
0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
1 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Musical Thriller History
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 1.0 0.0
【genre_count】
Biography 81.0
Music 16.0
Comedy 279.0
Crime 150.0
Fantasy 101.0
Sport 18.0
Family 51.0
Horror 119.0
Romance 141.0
Action 303.0
Mystery 106.0
Sci-Fi 120.0
Western 7.0
Animation 49.0
Adventure 259.0
Drama 513.0
War 13.0
Musical 5.0
Thriller 195.0
History 29.0
dtype: float64
【type(genre_count)】
<class 'pandas.core.series.Series'>
【genre_count:排序后genre_count】
Musical 5.0
Western 7.0
War 13.0
Music 16.0
Sport 18.0
History 29.0
Animation 49.0
Family 51.0
Biography 81.0
Fantasy 101.0
Mystery 106.0
Horror 119.0
Sci-Fi 120.0
Romance 141.0
Crime 150.0
Thriller 195.0
Adventure 259.0
Comedy 279.0
Action 303.0
Drama 513.0
dtype: float64
【genre_count】
Musical 5.0
Western 7.0
War 13.0
Music 16.0
Sport 18.0
History 29.0
Animation 49.0
Family 51.0
Biography 81.0
Fantasy 101.0
Mystery 106.0
Horror 119.0
Sci-Fi 120.0
Romance 141.0
Crime 150.0
Thriller 195.0
Adventure 259.0
Comedy 279.0
Action 303.0
Drama 513.0
dtype: float64
【genre_count.index】
Index(['Musical', 'Western', 'War', 'Music', 'Sport', 'History', 'Animation',
'Family', 'Biography', 'Fantasy', 'Mystery', 'Horror', 'Sci-Fi',
'Romance', 'Crime', 'Thriller', 'Adventure', 'Comedy', 'Action',
'Drama'],
dtype='object')
【genre_count.values】
[ 5. 7. 13. 16. 18. 29. 49. 51. 81. 101. 106. 119. 120. 141.
150. 195. 259. 279. 303. 513.]
【问题2 】数据合并
(1 ) 数据合并之join
join:默认情况下他是把行索引相同的数据合并到一起。(按照行索引进行合并)
(2 ) 数据合并之merge
merge:按照指定的列把数据按照一定的方式合并到一起。(按照列索引进行合并)
(3 )merge之列索引相同:
1 . 默认的合并方式 how= 'inner' (交集)
2 . how = 'outer' ,并集。nan补全(并集,即左并集与右并集)
(4 )merge之列索引不同
1 . left_on = 'O' , 左边为准。nan补齐(左并集)
2 . right_on = 'X' ,右边为准。nan补齐(右并集)(若一个有,一个没有,则用nan来补齐)
'''
注意:按照行索引进行合并
(1)join:默认情况下他是把行索引相同的数据合并到一起。(按照行索引进行合并)
(2)创建全为0的数组 ---- np.zeros((行数,列数))
(3)a.shape --- 返回a的行和列(返回数组类型)
a.shape[0] --- 行
a.shape[1] --- 列
'''
import pandas as pd
import numpy as np
df1 = pd. DataFrame( np. ones( ( 2 , 4 ) ) , index= [ 'A' , 'B' ] , columns= list ( 'abcd' ) )
df2 = pd. DataFrame( np. ones( ( 3 , 3 ) ) , index= [ 'A' , 'B' , 'C' ] , columns= list ( 'xyz' ) )
print ( '\n【df1】' )
print ( df1)
print ( '\n【df2】' )
print ( df2)
t1 = df1. join( df2)
t2 = df2. join( df1)
print ( '\n【df1.join(df2)】' )
print ( t1)
print ( '\n【df2.join(df1)】' )
print ( t2)
【df1】
a b c d
A 1.0 1.0 1.0 1.0
B 1.0 1.0 1.0 1.0
【df2】
x y z
A 1.0 1.0 1.0
B 1.0 1.0 1.0
C 1.0 1.0 1.0
【df1.join(df2)】
a b c d x y z
A 1.0 1.0 1.0 1.0 1.0 1.0 1.0
B 1.0 1.0 1.0 1.0 1.0 1.0 1.0
【df2.join(df1)】
x y z a b c d
A 1.0 1.0 1.0 1.0 1.0 1.0 1.0
B 1.0 1.0 1.0 1.0 1.0 1.0 1.0
C 1.0 1.0 1.0 NaN NaN NaN NaN
import pandas as pd
df1 = pd. DataFrame( {
'key' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'a' , 'b' ] , 'data1' : range ( 7 ) } )
df2 = pd. DataFrame( {
'key' : [ 'a' , 'b' , 'd' ] , 'data2' : range ( 3 ) } )
print ( '\n【df1】' )
print ( df1)
print ( '\n【df2】' )
print ( df2)
t1 = pd. merge( df1, df2)
t2 = pd. merge( df1, df2, on= 'key' )
t3 = pd. merge( df1, df2, on= 'key' , how= 'outer' )
print ( '\n【t1】' )
print ( t1)
print ( '\n【t2】' )
print ( t2)
print ( '\n【t3】' )
print ( t3)
t4 = pd. merge( df1, df2, on= 'key' , how= 'left' )
t5 = pd. merge( df1, df2, on= 'key' , how= 'right' )
print