Udacity数据分析(进阶)——清洗与分析数据(Twitter数据集)

本文档详细介绍了如何清洗和分析Twitter数据集,涉及数据收集、评估、清理和分析四个阶段。数据集来自@dog_rates,包含推特评分、狗的种类预测等信息。项目重点是处理8个质量问题和2个整洁度问题,如删除空值、重复值,转换数据类型等。最后进行了数据分析,揭示了最受欢迎的狗名和狗种类等信息。
摘要由CSDN通过智能技术生成

项目概述

现实世界的数据通常都不干净。使用 Python 以及 Python 的库,可以收集各种来源、各种格式的数据,评估数据的质量和整洁度,然后进行清洗。这个过程叫做数据整理。可以在 Jupyter Notebook 中记录并展示数据整理的过程,然后使用 Python (及其库) 和/或 SQL 进行分析和可视化。

将要整理 (以及分析和可视化) 的数据集是推特用户 @dog_rates 的档案, 推特昵称为 WeRateDogs。WeRateDogs 是一个推特主,他以诙谐幽默的方式对人们的宠物狗评分。这些评分通常以 10 作为分母。但是分子则一般大于 10:11/10、12/10、13/10 等等。为什么会有这样的评分?因为 “They’re good dogs Brent.” WeRateDogs 拥有四百多万关注者,曾受到国际媒体的报道。

数据集

  1. WeRateDogs 的推特档案。这个数据文件是直接提供的,所以可以将其当作是手头文件来处理。点击此链接下载:twitter_archive_enhanced.csv
  2. 推特图像的预测数据,即根据神经网络,对出现在每个推特中狗的品种(或其他物体、动物等)进行预测的结果。这个文件需要使用 Python 的 Requests 库和以下提供的 URL 来进行编程下载。下载用的 URL:https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv)
  3. 每条推特的额外附加数据,至少要包含转发数(retweet_count)和喜欢数(favorite_count)这两列,还可以收集任何你觉得有趣的列(注意:如果你的分析中不涉及到其他列则不需要收集)。使用 WeRateDogs 推特档案中的推特 ID,使用 Python Tweepy库查询 API 中每个推特的 JSON 数据,把所有 JSON 数据存储到一个名为 tweet_json.txt 的文件中。每个推特的 JSON 数据应当写入单独一行。然后将这个 .txt 文件逐行读入一个 pandas DataFrame 中,至少包含tweet IDretweet_countfavorite_count字段。

项目要点

  • 我们只需要含有图片的原始评级 (不包括转发)。尽管数据集中有 5000 多条数据,但是并不是所有都是狗狗评分,并且其中有一些是转发。——去除转发
  • 完整地评估和清理整个数据集将需要大量时间,实践和展示数据处理技巧没有必要将这个数据集全部清理。因此,本项目的要求只是评估和清理此数据集中的至少 8 个质量问题和至少 2 个整洁度问题
  • 根据整洁数据 tidy data的规则要求,本项目的数据清理应该包括将三个数据片段进行合并。——阅读
  • 如果分子评级超过分母评级,不需要进行清洗。这个 特殊评分系统 是 WeRateDogs 人气度较高的主要原因。(同样,也不需要删除分子小于分母的数据)——该类型非质量问题
  • 不必收集 2017 年 8 月 1 日之后的数据,你可以收集到这些推特的基本信息,但是你不能收集到这些推特对应的图像预测数据,因为你没有图像预测算法的使用权限。——附加数据以8-1之前为准

项目流程

1.收集

# 导入需要的库
import numpy as np
import pandas as pd
import requests
import os
  1. 收集手头文件 twitter_archive_enhanced.csv,其中包含了一些主要的推特信息,是本次清洗的主要数据,其中的评分、地位和名字等数据是从 text 原文中提取的,但是提取的并不好,评分并不都是正确的,狗的名字和地位也有不正确的 。如果你想用评分、地位和名字进行分析和可视化,需要评估和清洗这些列。完成这些列的评估和清洗,你可以学到更加实用的技能。
  2. 编程下载收集互联网文件:image-predictions.tsv,其中包含了推特图像预测信息,根据推特中的图片预测出狗狗种类;
  3. 查询 API 收集额外推特信息 tweet_json.txt,如果你无法访问 Twitter 的话,可以直接读取项目可供下载的 tweet_json.txt 文件,从中提取所需数据。至少需要提取转发数(retweet_count)和喜欢数(favorite_count)这两列,如果你的分析中不需要用到其他列,则不需要收集其他列。如果提取了其他列只用于清洗,那么这样的清洗没有意义。
# 读取收集的原始数据'twitter-archive-enhanced.txt'()
df_one = pd.read_csv('twitter-archive-enhanced.txt',sep=',')
# 从网上下载神经网络识别数据,并加载
url = 'https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv'

with open(os.path.join('image-predictions.tsv'), mode='w') as file:
    file.write(requests.get(url).text)
df_two = pd.read_csv('image-predictions.tsv', sep='\t')
# 收集文件 3 
import json
df_three = []
with open('tweet_json.txt','r') as f:
    for row in f:
        text = json.loads(row)
        message = {
   'tweet_id':text['id_str'],
                   'retweet_count':text['retweet_count'],
                   'favorite_count':text['favorite_count']}
        df_three.append(message)
df_three = pd.DataFrame(df_three, columns=['tweet_id','retweet_count','favorite_count'])

2.评估

目测评估df_one

# 目测评估
pd.set_option('max_colwidth',200)
df_one
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU NaN NaN NaN https://twitter.com/dog_rates/status/892420643555336193/photo/1 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV NaN NaN NaN https://twitter.com/dog_rates/status/892177421306343426/photo/1 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB NaN NaN NaN https://twitter.com/dog_rates/status/891815181378084864/photo/1 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ NaN NaN NaN https://twitter.com/dog_rates/status/891689557279858688/photo/1 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f NaN NaN NaN https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 12 10 Franklin None None None None
5 891087950875897856 NaN NaN 2017-07-29 00:08:17 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh NaN NaN NaN https://twitter.com/dog_rates/status/891087950875897856/photo/1 13 10 None None None None None
6 890971913173991426 NaN NaN 2017-07-28 16:27:12 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\n\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl NaN NaN NaN https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1 13 10 Jax None None None None
7 890729181411237888 NaN NaN 2017-07-28 00:22:40 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq NaN NaN NaN https://twitter.com/dog_rates/status/890729181411237888/photo/1,https://twitter.com/dog_rates/status/890729181411237888/photo/1 13 10 None None None None None
8 890609185150312448 NaN NaN 2017-07-27 16:25:51 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b NaN NaN NaN https://twitter.com/dog_rates/status/890609185150312448/photo/1 13 10 Zoey None None None None
9 890240255349198849 NaN NaN 2017-07-26 15:59:51 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A NaN NaN NaN https://twitter.com/dog_rates/status/890240255349198849/photo/1 14 10 Cassie doggo None None None
10 890006608113172480 NaN NaN 2017-07-26 00:31:25 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Koda. He is a South Australian deckshark. Deceptively deadly. Frighteningly majestic. 13/10 would risk a petting #BarkWeek https://t.co/dVPW0B0Mme NaN NaN NaN https://twitter.com/dog_rates/status/890006608113172480/photo/1,https://twitter.com/dog_rates/status/890006608113172480/photo/1 13 10 Koda None None None None
11 889880896479866881 NaN NaN 2017-07-25 16:11:53 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Bruno. He is a service shark. Only gets out of the water to assist you. 13/10 terrifyingly good boy https://t.co/u1XPQMl29g NaN NaN NaN https://twitter.com/dog_rates/status/889880896479866881/photo/1 13 10 Bruno None None None None
12 889665388333682689 NaN NaN 2017-07-25 01:55:32 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm NaN NaN NaN https://twitter.com/dog_rates/status/889665388333682689/photo/1 13 10 None None None None puppo
13 889638837579907072 NaN NaN 2017-07-25 00:10:02 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Ted. He does his best. Sometimes that's not enough. But it's ok. 12/10 would assist https://t.co/f8dEDcrKSR NaN NaN NaN https://twitter.com/dog_rates/status/889638837579907072/photo/1,https://twitter.com/dog_rates/status/889638837579907072/photo/1 12 10 Ted None None None None
14 889531135344209921 NaN NaN 2017-07-24 17:02:04 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Stuart. He's sporting his favorite fanny pack. Secretly filled with bones only. 13/10 puppared puppo #BarkWeek https://t.co/y70o6h3isq NaN NaN NaN https://twitter.com/dog_rates/status/889531135344209921/photo/1 13 10 Stuart None None None puppo
15 889278841981685760 NaN NaN 2017-07-24 00:19:32 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Oliver. You're witnessing one of his many brutal attacks. Seems to be playing with his victim. 13/10 fr*ckin frightening #BarkWeek https://t.co/WpHvrQedPb NaN NaN NaN https://twitter.com/dog_rates/status/889278841981685760/video/1 13 10 Oliver None None None None
16 888917238123831296 NaN NaN 2017-07-23 00:22:39 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Jim. He found a fren. Taught him how to sit
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值