python练手的数据_Pythonの数据分析练手（一）

最新推荐文章于 2023-02-25 10:31:27 发布

weixin_39976499

最新推荐文章于 2023-02-25 10:31:27 发布

阅读量197

点赞数

文章标签： python练手的数据

本文链接：https://blog.csdn.net/weixin_39976499/article/details/111442744

版权

最近拿到本《Python for Data Analysis》，就用Jupyter Notebook来跑了一遍里面的例子，现在想把他做个记录，以后翻翻看也好(PS：早上翘课被点名了，欲哭无泪)

这个例子包含三个类别的数据集，分别是：USAbitlyData：访问美国官网的用户信息

MovieLens：用户对电影的打分数据

BabyNames：美国从1880到2010年孩子名字的数据集

接下来我们将对对一个数据集进行简单的操作，剩下的两个数据集的例子我放在下一篇文章写好了。

USAbitly数据

先读取数据

import json

path = 'LearningExercise/usagov_bitly_data2012-03-16-1331923249.txt'

records = [json.loads(line) for line in open(path)]

records[0]

通过输出来第一条记录，看看这个数据集长什么样子

{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',

'al': 'en-US,en;q=0.8',

'c': 'US',

'cy': 'Danvers',

'g': 'A6qOVH',

'gr': 'MA',

'h': 'wfLQtf',

'hc': 1331822918,

'hh': '1.usa.gov',

'l': 'orofrog',

'll': [42.576698, -70.954903],

'nk': 1,

'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',

't': 1331923247,

'tz': 'America/New_York',

'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

我们可以看到数据包含标签及值，那如果要获得第一条记录的'tz'标签的值(tz 是timezone的缩写)，可以这样

records[0]['tz']

以上表示数据的第一条记录的tz标签的值，返回值为

'America/New_York'

以此类推，我们可以通过for循环获得所有记录的'tz'值

time_zones = [rec['tz'] for rec in records if 'tz' in rec]

那如果我们要计算某个值重复的次数，例如要看看有多少条数据的tz值为'America/New_York'，我们可以用Python的标准库来写函数

#Using Standard Python Library

def get_counts(sequence):

counts = {}

for x in sequence:

if x in counts:

counts[x] += 1

else:

counts[x] = 1

return counts

也可以用collections的函数来写

from collections import defaultdict

def get_counts2(sequence):

counts = defaultdict(int) #values will initialize to 0

for x in sequence:

counts[x] += 1

return counts

写出的函数可以这么用

counts = get_counts2(time_zones)

counts['America/New_York']

查询出现排次数前十的时区，可以写这么个函数

# top 10 time zones and their counts

def top_counts(count_dict, n = 10):

value_key_pairs = [(count, tz) for tz, count in count_dict.items()]

value_key_pairs.sort()

return value_key_pairs[-n:]

top_counts(counts)

输出为

[(33, 'America/Sao_Paulo'),

(35, 'Europe/Madrid'),

(36, 'Pacific/Honolulu'),

(37, 'Asia/Tokyo'),

(74, 'Europe/London'),

(191, 'America/Denver'),

(382, 'America/Los_Angeles'),

(400, 'America/Chicago'),

(521, ''),

(1251, 'America/New_York')]

当然，我们还能用Python的标准库来写这个函数，也可以用pandas进行操作，现在来看看pandas的操作情况，这里用DataFrame保存所有的records到frame

# Counting Time Zones with pandas

from pandas import DataFrame, Series

import pandas as pd

frame = DataFrame(records)

tz_counts = frame['tz'].value_counts()

tz_counts[:10]

输出为

America/New_York 1251

521

America/Chicago 400

America/Los_Angeles 382

America/Denver 191

Europe/London 74

Asia/Tokyo 37

Pacific/Honolulu 36

Europe/Madrid 35

America/Sao_Paulo 33

Name: tz, dtype: int64

我们可以看到以上的输出里，有些缺失的数据，我们可以把这些数据处理一下

# Replace the missing Values nad unknown value

clean_tz = frame['tz'].fillna('Missing')

clean_tz[clean_tz == ''] = 'Unknown'

tz_counts = clean_tz.value_counts()

tz_counts[:10]

输出为

America/New_York 1251

Unknown 521

America/Chicago 400

America/Los_Angeles 382

America/Denver 191

Missing 120

Europe/London 74

Asia/Tokyo 37

Pacific/Honolulu 36

Europe/Madrid 35

Name: tz, dtype: int64

接下来我们用matplotlib来实现上面的图，但是jupyter notebook上显示plot的东西，可能无法显示，我是通过先运行以下这句来实现

%matplotlib inline

简单地运行一句

tz_counts[:10].plot(kind='barh', rot=0)

现在来通过时区tz及操作系统(标签a中的内容)来构建数据

import numpy as np

cframe = frame[frame.a.notnull()]

operating_system = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows')

by_tz_os = cframe.groupby(['tz', operating_system])

agg_counts = by_tz_os.size().unstack().fillna(0)

agg_counts[:10]

fillna(0)是将NA的缺失数据变为0，输出为

计算每个时区数据的总和，再排序一下

#Use to sort in ascending order

indexer = agg_counts.sum(1).argsort()

indexer[:10]

输出为

Africa/Cairo 20

Africa/Casablanca 21

Africa/Ceuta 92

Africa/Johannesburg 87

Africa/Lusaka 53

America/Anchorage 54

America/Argentina/Buenos_Aires 57

America/Argentina/Cordoba 26

America/Argentina/Mendoza 55

dtype: int64

我们现在来看，时区出现次数最高的十个地区中，windows的使用情况

count_subset = agg_counts.take(indexer)[-10:]

normed_subset = count_subset.div(count_subset.sum(1), axis = 0)

normed_subset.plot(kind='barh', stacked = True)

以上是对第一个数据集进行操作的情况，其中涉及到pandas，numpy，matplotlib，dataframe这些概念，而这些也是用Python做数据分析常用到的工具，希望下次逃课不会被点名了_(:з」∠)_ 或者希望下次不要逃课了xDDDDDDD

weixin_39976499

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python练手的数据_Pythonの数据分析练手（一）

最近拿到本《Python for Data Analysis》，就用Jupyter Notebook来跑了一遍里面的例子，现在想把他做个记录，以后翻翻看也好(PS：早上翘课被点名了，欲哭无泪)这个例子包含三个类别的数据集，分别是：USAbitlyData：访问美国官网的用户信息MovieLens：用户对电影的打分数据BabyNames：美国从1880到2010年孩子名字的数据集接下来我们将对对一...
复制链接

扫一扫

python练手的数据_Pythonの数据分析练手（一）

“相关推荐”对你有帮助么？