WeRateDog---分析推特数据

数据收集

导入需要的库

In [60]:

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import requests
import json
import os

打开并评估twitter-archive-enhanced

In [61]:twitter_archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv')

In [62]:twitter_archive_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 non-null   object 
 14  floofer                     2356 non-null   object 
 15  pupper                      2356 non-null   object 
 16  puppo                       2356 non-null   object 
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

通过上面的info,可以看出tweet_id, timestamp 类型错误,in_reply_to_status_id,in_reply_to_user_id 仅有78列,expanded_urls 含有空值,是没有照片的数据, 根据项目要求,这些数据后面需要删除

In [63]:twitter_archive_enhanced.retweeted_status_id.notnull().value_counts()

Out[63]:

False    2175
True      181
Name: retweeted_status_id, dtype: int64

retweeted_status_id 不为nan的为转发数据,181条转发数据,根据项目要求,这些数据后面需要删除

In [64]:twitter_archive_enhanced.name.value_counts()

Out[64]:

None        745
a            55
Charlie      12
Oliver       11
Lucy         11
           ... 
Karll         1
Tiger         1
old           1
Meatball      1
Stormy        1
Name: name, Length: 957, dtype: int64

In [65]:twitter_archive_enhanced.text[twitter_archive_enhanced.name=='a'].iloc[1]

Out[65]:

'Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq'

*55个名字为a的狗狗,调用一个名字为a的看了下,显然a不是狗狗的名字,是为质量问题
*text里面含有链接

In [66]:twitter_archive_enhanced.rating_denominator.value_counts()

Out[66]:

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

可见,rating_denominator不全为10

In [67]:twitter_archive_enhanced.source.iloc[0]

Out[67]:

'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'

source中含有html文本

另外,这个数据集还有个整洁度问题,狗狗地位是一个变量,doggo,floofer, pupper, puppo应为一列

收集并评估image-predictions

In [68]:folder_name ='pred-image'

if not os.path.exists(folder_name):
os.makedirs(folder_name) url='https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv'

response = requests.get(url)

response

Out[68]:

<Response [200]>

In [69]:

with open(os.path.join(folder_name, url.split('/')[-1]), mode='wb') as file:

file.write(response.content)

In [70]:os.listdir(folder_name)

Out[70]:

['image-predictions.tsv']

In [71]:image_predictions = pd.read_csv('image-predictions.tsv',sep='\t')

In [72]:image_predictions.head()

Out[72]:

 tweet_idjpg_urlimg_nump1p1_confp1_dogp2p2_confp2_dogp3p3_confp3_dog
0666020888022790149https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg1Welsh_springer_spaniel0.465074Truecollie0.156665TrueShetland_sheepdog0.061428True
1666029285002620928https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg1redbone0.506826Trueminiature_pinscher0.074192TrueRhodesian_ridgeback0.072010True
2666033412701032449https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg1German_shepherd0.596461Truemalinois0.138584Truebloodhound0.116197True
3666044226329800704https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg1Rhodesian_ridgeback0.408143Trueredbone0.360687Trueminiature_pinscher0.222752True
4666049248165822465https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg1miniature_pinscher0.560311TrueRottweiler0.243682TrueDoberman0.154629True

In [73]:image_predictions.jpg_url.duplicated().value_counts()

Out[73]:

False    2009
True       66
Name: jpg_url, dtype: int64

有66条重复的图片链接

tweet_id类型错误

打开并评估tweet_json

In [74]:tweet_json = pd.DataFrame()

In [75]:

file = open('tweet_json.txt','r')

for line in file.readlines():

dic = json.loads(line)

tweet_id = dic['id']

retweet_count = dic['retweet_count']

favorite_count = dic['favorite_count']

tem_df = pd.DataFrame({'tweet_id':tweet_id,

'retweet_count':retweet_count,

'favorite_count':favorite_count},index=[0])

tweet_json = pd.concat([tweet_json,tem_df])

In [76]:

tweet_json

Out[76]:

 tweet_idretweet_countfavorite_count
0892420643555336193884239492
0892177421306343426648033786
0891815181378084864430125445
0891689557279858688892542863
0891327558926688256972141016
............
066604924816582246541111
0666044226329800704147309
066603341270103244947128
066602928500262092848132
06660208880227901495302528

2352 rows × 3 columns

tweet_id 类型错误

综上,

#*数据集里的质量问题:

  1. tweet_id,timestamp类型错误
  2. jpg_url有66条重复的链接
  3. source中含有html文本
  4. rating_denominator不全为10,还有分母为0的情况出现
  5. 55个名字为a的狗狗,调用一个名字为a的看了下,显然a不是狗狗的名字,是为质量问题
  6. text里面含有链接
  7. retweeted_status_id 不为nan的为转发数据,181条转发数据,根据项目要求,这些数据后面需要删除
  8. in_reply_to_status_id,in_reply_to_user_id 仅有78列
  9. 没有照片的数据, 根据项目要求,这些数据后面需要删除

#*整洁度问题:

  1. 狗狗地位是一个变量,doggo,floofer, pupper, puppo应为一列
  2. 三个数据集有一个观察对象tweet_id,可以合为一个数据集

数据清洗

In [77]:

twitter_archive_enhanced_clean = twitter_archive_enhanced.copy()

image_predictions_clean = image_predictions.copy()

tweet_json_clean = tweet_json.copy()

issue: tweet_id类型错误

define: 修改tweet_id为str

code:

In [78]:twitter_archive_enhanced_clean['tweet_id'] = twitter_archive_enhanced_clean['tweet_id'].astype('str')

In [79]:image_predictions_clean['tweet_id'] = image_predictions_clean['tweet_id'].astype('str')

In [80]:tweet_json_clean['tweet_id'] = tweet_json_clean['tweet_id'].astype('str')

Test

In [81]:twitter_archive_enhanced_clean['tweet_id']

Out[81]:

0       892420643555336193
1       892177421306343426
2       891815181378084864
3       891689557279858688
4       891327558926688256
               ...        
2351    666049248165822465
2352    666044226329800704
2353    666033412701032449
2354    666029285002620928
2355    666020888022790149
Name: tweet_id, Length: 2356, dtype: object

In [82]:image_predictions_clean['tweet_id']

Out[82]:

0       666020888022790149
1       666029285002620928
2       666033412701032449
3       666044226329800704
4       666049248165822465
               ...        
2070    891327558926688256
2071    891689557279858688
2072    891815181378084864
2073    892177421306343426
2074    892420643555336193
Name: tweet_id, Length: 2075, dtype: object

In [83]:tweet_json_clean['tweet_id']

Out[83]:

0    892420643555336193
0    892177421306343426
0    891815181378084864
0    891689557279858688
0    891327558926688256
            ...        
0    666049248165822465
0    666044226329800704
0    666033412701032449
0    666029285002620928
0    666020888022790149
Name: tweet_id, Length: 2352, dtype: object

issue: timestamp类型错误

define: 修改为datetime

code:

In [84]:twitter_archive_enhanced_clean['timestamp'] = pd.to_datetime(twitter_archive_enhanced_clean['timestamp'])

Test

In [85]:twitter_archive_enhanced_clean['timestamp']

Out[85]:

0      2017-08-01 16:23:56+00:00
1      2017-08-01 00:17:27+00:00
2      2017-07-31 00:18:03+00:00
3      2017-07-30 15:58:51+00:00
4      2017-07-29 16:00:24+00:00
                  ...           
2351   2015-11-16 00:24:50+00:00
2352   2015-11-16 00:04:52+00:00
2353   2015-11-15 23:21:54+00:00
2354   2015-11-15 23:05:30+00:00
2355   2015-11-15 22:32:08+00:00
Name: timestamp, Length: 2356, dtype: datetime64[ns, UTC]

issue: 55个名字为a的狗狗,调用一个名字为a的看了下,显然a不是狗狗的名字

define: 将a用None代替

code:

In [86]:twitter_archive_enhanced_clean['name']= twitter_archive_enhanced_clean['name'].replace('a',np.nan)

Test

In [88]:twitter_archive_enhanced_clean['name'].value_counts()

Out[88]:

None        745
Charlie      12
Lucy         11
Oliver       11
Cooper       11
           ... 
Karll         1
Tiger         1
old           1
Meatball      1
Stormy        1
Name: name, Length: 956, dtype: int64

Issue:

分母不全为10

define: Create new column rating=rating_numerator/rating_denominator. Drop rating_numerator and rating_denominator.

Code:

In [90]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating_denominator!= 0]

In [91]:twitter_archive_enhanced_clean['rating']=twitter_archive_enhanced_clean.rating_numerator/twitter_archive_enhanced_clean.rating_denominator

In [92]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean.drop(['rating_numerator','rating_denominator'],axis=1)

Test:

In [93]:twitter_archive_enhanced_clean

Out[93]:

 tweet_idin_reply_to_status_idin_reply_to_user_idtimestampsourcetextretweeted_status_idretweeted_status_user_idretweeted_status_timestampexpanded_urlsnamedoggoflooferpupperpupporating
0892420643555336193NaNNaN2017-08-01 16:23:56+00:00<a href="http://twitter.com/download/iphone" r...This is Phineas. He's a mystical boy. Only eve...NaNNaNNaNhttps://twitter.com/dog_rates/status/892420643...PhineasNoneNoneNoneNone1.3
1892177421306343426NaNNaN2017-08-01 00:17:27+00:00<a href="http://twitter.com/download/iphone" r...This is Tilly. She's just checking pup on you....NaNNaNNaNhttps://twitter.com/dog_rates/status/892177421...TillyNoneNoneNoneNone1.3
2891815181378084864NaNNaN2017-07-31 00:18:03+00:00<a href="http://twitter.com/download/iphone" r...This is Archie. He is a rare Norwegian Pouncin...NaNNaNNaNhttps://twitter.com/dog_rates/status/891815181...ArchieNoneNoneNoneNone1.2
3891689557279858688NaNNaN2017-07-30 15:58:51+00:00<a href="http://twitter.com/download/iphone" r...This is Darla. She commenced a snooze mid meal...NaNNaNNaNhttps://twitter.com/dog_rates/status/891689557...DarlaNoneNoneNoneNone1.3
4891327558926688256NaNNaN2017-07-29 16:00:24+00:00<a href="http://twitter.com/download/iphone" r...This is Franklin. He would like you to stop ca...NaNNaNNaNhttps://twitter.com/dog_rates/status/891327558...FranklinNoneNoneNoneNone1.2
...................................................
2351666049248165822465NaNNaN2015-11-16 00:24:50+00:00<a href="http://twitter.com/download/iphone" r...Here we have a 1949 1st generation vulpix. Enj...NaNNaNNaNhttps://twitter.com/dog_rates/status/666049248...NoneNoneNoneNoneNone0.5
2352666044226329800704NaNNaN2015-11-16 00:04:52+00:00<a href="http://twitter.com/download/iphone" r...This is a purebred Piers Morgan. Loves to Netf...NaNNaNNaNhttps://twitter.com/dog_rates/status/666044226...NaNNoneNoneNoneNone0.6
2353666033412701032449NaNNaN2015-11-15 23:21:54+00:00<a href="http://twitter.com/download/iphone" r...Here is a very happy pup. Big fan of well-main...NaNNaNNaNhttps://twitter.com/dog_rates/status/666033412...NaNNoneNoneNoneNone0.9
2354666029285002620928NaNNaN2015-11-15 23:05:30+00:00<a href="http://twitter.com/download/iphone" r...This is a western brown Mitsubishi terrier. Up...NaNNaNNaNhttps://twitter.com/dog_rates/status/666029285...NaNNoneNoneNoneNone0.7
2355666020888022790149NaNNaN2015-11-15 22:32:08+00:00<a href="http://twitter.com/download/iphone" r...Here we have a Japanese Irish Setter. Lost eye...NaNNaNNaNhttps://twitter.com/dog_rates/status/666020888...NoneNoneNoneNoneNone0.8

2355 rows × 16 columns

Issue: duplicated of jpg_url

define: delete the duplicated

code:

In [94]:image_predictions_clean=image_predictions_clean[~image_predictions_clean.jpg_url.duplicated()]

Test:

In [95]:sum(image_predictions_clean.jpg_url.duplicated())

Out[95]:

Issue: in_reply_to_status_id in_reply_to_user_id only 23

Define: drop them directly

Code:

In [96]:twitter_archive_enhanced_clean.drop(twitter_archive_enhanced_clean[['in_reply_to_status_id','in_reply_to_user_id']],axis=1,inplace=True)

Test:

In [97]:twitter_archive_enhanced_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2355 entries, 0 to 2355
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2355 non-null   object             
 1   timestamp                   2355 non-null   datetime64[ns, UTC]
 2   source                      2355 non-null   object             
 3   text                        2355 non-null   object             
 4   retweeted_status_id         181 non-null    float64            
 5   retweeted_status_user_id    181 non-null    float64            
 6   retweeted_status_timestamp  181 non-null    object             
 7   expanded_urls               2297 non-null   object             
 8   name                        2300 non-null   object             
 9   doggo                       2355 non-null   object             
 10  floofer                     2355 non-null   object             
 11  pupper                      2355 non-null   object             
 12  puppo                       2355 non-null   object             
 13  rating                      2355 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(3), object(10)
memory usage: 276.0+ KB

Issue: html content in source

define: delete html

Code:

In [98]:twitter_archive_enhanced_clean.source= twitter_archive_enhanced_clean.source.str.extract('>(.+)<',expand = True)

Test

In [99]:twitter_archive_enhanced_clean['source'].value_counts()

Out[99]:

Twitter for iPhone     2220
Vine - Make a Scene      91
Twitter Web Client       33
TweetDeck                11
Name: source, dtype: int64

Issue: text column contain url

define: delete url

code:

In [100]:twitter_archive_enhanced_clean.text.replace(r'https.*','',regex=True, inplace=True)

test

In [101]:twitter_archive_enhanced_clean.text

Out[101]:

0       This is Phineas. He's a mystical boy. Only eve...
1       This is Tilly. She's just checking pup on you....
2       This is Archie. He is a rare Norwegian Pouncin...
3       This is Darla. She commenced a snooze mid meal...
4       This is Franklin. He would like you to stop ca...
                              ...                        
2351    Here we have a 1949 1st generation vulpix. Enj...
2352    This is a purebred Piers Morgan. Loves to Netf...
2353    Here is a very happy pup. Big fan of well-main...
2354    This is a western brown Mitsubishi terrier. Up...
2355    Here we have a Japanese Irish Setter. Lost eye...
Name: text, Length: 2355, dtype: object

issue: 含有转发数据

define: 删除转发数据

code:

In [102]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.retweeted_status_id.isnull()]

twitter_archive_enhanced_clean=twitter_archive_enhanced_clean.drop(['retweeted_status_id'],axis=1)

Test

In [103]:twitter_archive_enhanced_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2174 entries, 0 to 2355
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2174 non-null   object             
 1   timestamp                   2174 non-null   datetime64[ns, UTC]
 2   source                      2174 non-null   object             
 3   text                        2174 non-null   object             
 4   retweeted_status_user_id    0 non-null      float64            
 5   retweeted_status_timestamp  0 non-null      object             
 6   expanded_urls               2117 non-null   object             
 7   name                        2119 non-null   object             
 8   doggo                       2174 non-null   object             
 9   floofer                     2174 non-null   object             
 10  pupper                      2174 non-null   object             
 11  puppo                       2174 non-null   object             
 12  rating                      2174 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(2), object(10)
memory usage: 237.8+ KB

issue: 狗狗地位是一个变量,应该为一列

define 将其放在一列

code

In [104]:

twitter_archive_enhanced_clean['stage']= twitter_archive_enhanced_clean.text.str.findall('(doggo|pupper|puppo|floofer)')twitter_archive_enhanced_clean['stage'] = twitter_archive_enhanced_clean['stage'].apply(lambda x: ','.join(set(x)))

In [105]:

twitter_archive_enhanced_clean['stage']=twitter_archive_enhanced_clean['stage'].replace('',np.nan)

In [106]:

twitter_archive_enhanced_clean.drop(twitter_archive_enhanced_clean[['doggo','puppo','pupper','floofer']],axis=1,inplace=True)

Test

In [107]:

twitter_archive_enhanced_clean.stage.value_counts()

Out[107]:

pupper          242
doggo            78
puppo            30
pupper,doggo      8
floofer           4
puppo,doggo       2
Name: stage, dtype: int64

ISSUE: 三个数据集共有一个观察对象,可以合并为一个数据集. 无照片的数据也可以删除。

define: 将3个数据集合并在一起,并且删除无照片的数据

code

In [108]:

df1_clean = twitter_archive_enhanced_clean.merge(image_predictions_clean,how='inner',on='tweet_id')

In [109]:

df_clean = df1_clean.merge(tweet_json_clean,how='left',on='tweet_id')

test

In [110]:

df_clean

Out[110]:

 tweet_idtimestampsourcetextretweeted_status_user_idretweeted_status_timestampexpanded_urlsnameratingstage...p1_confp1_dogp2p2_confp2_dogp3p3_confp3_dogretweet_countfavorite_count
08924206435553361932017-08-01 16:23:56+00:00Twitter for iPhoneThis is Phineas. He's a mystical boy. Only eve...NaNNaNhttps://twitter.com/dog_rates/status/892420643...Phineas1.3NaN...0.097049Falsebagel0.085851Falsebanana0.076110False884239492
18921774213063434262017-08-01 00:17:27+00:00Twitter for iPhoneThis is Tilly. She's just checking pup on you....NaNNaNhttps://twitter.com/dog_rates/status/892177421...Tilly1.3NaN...0.323581TruePekinese0.090647Truepapillon0.068957True648033786
28918151813780848642017-07-31 00:18:03+00:00Twitter for iPhoneThis is Archie. He is a rare Norwegian Pouncin...NaNNaNhttps://twitter.com/dog_rates/status/891815181...Archie1.2NaN...0.716012Truemalamute0.078253Truekelpie0.031379True430125445
38916895572798586882017-07-30 15:58:51+00:00Twitter for iPhoneThis is Darla. She commenced a snooze mid meal...NaNNaNhttps://twitter.com/dog_rates/status/891689557...Darla1.3NaN...0.170278FalseLabrador_retriever0.168086Truespatula0.040836False892542863
48913275589266882562017-07-29 16:00:24+00:00Twitter for iPhoneThis is Franklin. He would like you to stop ca...NaNNaNhttps://twitter.com/dog_rates/status/891327558...Franklin1.2NaN...0.555712TrueEnglish_springer0.225770TrueGerman_short-haired_pointer0.175219True972141016
..................................................................
19896660492481658224652015-11-16 00:24:50+00:00Twitter for iPhoneHere we have a 1949 1st generation vulpix. Enj...NaNNaNhttps://twitter.com/dog_rates/status/666049248...None0.5NaN...0.560311TrueRottweiler0.243682TrueDoberman0.154629True41111
19906660442263298007042015-11-16 00:04:52+00:00Twitter for iPhoneThis is a purebred Piers Morgan. Loves to Netf...NaNNaNhttps://twitter.com/dog_rates/status/666044226...NaN0.6NaN...0.408143Trueredbone0.360687Trueminiature_pinscher0.222752True147309
19916660334127010324492015-11-15 23:21:54+00:00Twitter for iPhoneHere is a very happy pup. Big fan of well-main...NaNNaNhttps://twitter.com/dog_rates/status/666033412...NaN0.9NaN...0.596461Truemalinois0.138584Truebloodhound0.116197True47128
19926660292850026209282015-11-15 23:05:30+00:00Twitter for iPhoneThis is a western brown Mitsubishi terrier. Up...NaNNaNhttps://twitter.com/dog_rates/status/666029285...NaN0.7NaN...0.506826Trueminiature_pinscher0.074192TrueRhodesian_ridgeback0.072010True48132
19936660208880227901492015-11-15 22:32:08+00:00Twitter for iPhoneHere we have a Japanese Irish Setter. Lost eye...NaNNaNhttps://twitter.com/dog_rates/status/666020888...None0.8NaN...0.465074Truecollie0.156665TrueShetland_sheepdog0.061428True5302528

1994 rows × 23 columns

保存数据集

In [112]:

#save the file

save_file_name = 'twitter_archive_master.csv'

df_clean.to_csv(save_file_name, encoding='utf-8',index=False)

分析与可视化

In [114]:

#data analysisdata = pd.read_csv('twitter_archive_master.csv', encoding='utf-8')

In [115]:

data.head(10)

Out[115]:

 tweet_idtimestampsourcetextretweeted_status_user_idretweeted_status_timestampexpanded_urlsnameratingstage...p1_confp1_dogp2p2_confp2_dogp3p3_confp3_dogretweet_countfavorite_count
08924206435553361932017-08-01 16:23:56+00:00Twitter for iPhoneThis is Phineas. He's a mystical boy. Only eve...NaNNaNhttps://twitter.com/dog_rates/status/892420643...Phineas1.3NaN...0.097049Falsebagel0.085851Falsebanana0.076110False884239492
18921774213063434262017-08-01 00:17:27+00:00Twitter for iPhoneThis is Tilly. She's just checking pup on you....NaNNaNhttps://twitter.com/dog_rates/status/892177421...Tilly1.3NaN...0.323581TruePekinese0.090647Truepapillon0.068957True648033786
28918151813780848642017-07-31 00:18:03+00:00Twitter for iPhoneThis is Archie. He is a rare Norwegian Pouncin...NaNNaNhttps://twitter.com/dog_rates/status/891815181...Archie1.2NaN...0.716012Truemalamute0.078253Truekelpie0.031379True430125445
38916895572798586882017-07-30 15:58:51+00:00Twitter for iPhoneThis is Darla. She commenced a snooze mid meal...NaNNaNhttps://twitter.com/dog_rates/status/891689557...Darla1.3NaN...0.170278FalseLabrador_retriever0.168086Truespatula0.040836False892542863
48913275589266882562017-07-29 16:00:24+00:00Twitter for iPhoneThis is Franklin. He would like you to stop ca...NaNNaNhttps://twitter.com/dog_rates/status/891327558...Franklin1.2NaN...0.555712TrueEnglish_springer0.225770TrueGerman_short-haired_pointer0.175219True972141016
58910879508758978562017-07-29 00:08:17+00:00Twitter for iPhoneHere we have a majestic great white breaching ...NaNNaNhttps://twitter.com/dog_rates/status/891087950...None1.3NaN...0.425595TrueIrish_terrier0.116317TrueIndian_elephant0.076902False324020548
68909719131739914262017-07-28 16:27:12+00:00Twitter for iPhoneMeet Jax. He enjoys ice cream so much he gets ...NaNNaNhttps://gofundme.com/ydvmve-surgery-for-jax,ht...Jax1.3NaN...0.341703TrueBorder_collie0.199287Trueice_lolly0.193548False214212053
78907291814112378882017-07-28 00:22:40+00:00Twitter for iPhoneWhen you watch your owner call another dog a g...NaNNaNhttps://twitter.com/dog_rates/status/890729181...None1.3NaN...0.566142TrueEskimo_dog0.178406TruePembroke0.076507True1954866596
88906091851503124482017-07-27 16:25:51+00:00Twitter for iPhoneThis is Zoey. She doesn't want to be one of th...NaNNaNhttps://twitter.com/dog_rates/status/890609185...Zoey1.3NaN...0.487574TrueIrish_setter0.193054TrueChesapeake_Bay_retriever0.118184True440328187
98902402553491988492017-07-26 15:59:51+00:00Twitter for iPhoneThis is Cassie. She is a college pup. Studying...NaNNaNhttps://twitter.com/dog_rates/status/890240255...Cassie1.4doggo...0.511319TrueCardigan0.451038TrueChihuahua0.029248True768432467

10 rows × 23 columns

In [116]:data.favorite_count.describe()

Out[116]:

count      1994.000000
mean       8923.133400
std       12400.238808
min          81.000000
25%        1972.250000
50%        4117.000000
75%       11275.500000
max      132318.000000
Name: favorite_count, dtype: float64

In [117]:data.retweet_count.describe()

Out[117]:

count     1994.000000
mean      2770.021063
std       4715.961325
min         15.000000
25%        622.250000
50%       1348.500000
75%       3202.750000
max      79116.000000
Name: retweet_count, dtype: float64

In [118]:

import matplotlib.pyplot as plt

%matplotlib inline

In [119]:

plt.bar(x=['favorite_count','retweet_count'], height = [data.favorite_count.sum(),data.retweet_count.sum()])plt.title('Number of Favorite count VS Retweet Count')

Out[119]:

Text(0.5, 1.0, 'Number of Favorite count VS Retweet Count')

* So the first conclusion is : favorate count more than retweet count

In [120]:data[data.p1_conf > 0.5].p1.value_counts()

Out[120]:

golden_retriever       116
Pembroke                70
Labrador_retriever      65
Chihuahua               47
pug                     43
                      ... 
scorpion                 1
Appenzeller              1
flamingo                 1
axolotl                  1
Irish_water_spaniel      1
Name: p1, Length: 245, dtype: int64

the second conclusion: the most dog: golden_retriever

In [121]:data['rating'].value_counts()

Out[121]:

1.200000      454
1.000000      421
1.100000      402
1.300000      261
0.900000      151
0.800000       95
0.700000       51
1.400000       35
0.500000       34
0.600000       32
0.300000       19
0.400000       15
0.200000       10
0.100000        4
0.000000        2
177.600000      1
2.600000        1
3.428571        1
0.636364        1
0.818182        1
42.000000       1
7.500000        1
2.700000        1
Name: rating, dtype: int64

#the third conclusion: most numerator are more than 10

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值