《利用Python进行数据分析》学习笔记ch02-1(1)

前言

這是我第一次开通博客,主要目的是想记录下自己学习python的过程,同时也是想作为学习笔记,我会把《利用python进行数据分析》这本树上的每个例子都自己敲一边,很多语句并不知道为什么这么写,里面也有很多语句可能是因为版本的问题而有出入,我会尽量搞懂每一部分,希望这博客能见证我学习的过程,也希望自己能坚持下来


因为是第一次用博客,不知道怎么样才能写的想别人那么好看,那就一点点来,现在就开始吧


第二章 引言

这章主要是给出一些范例数据集,并讲解了我们能对其做些什么。这章没有详细讲解每个语句,只是一个大概的讲解。

来自bit.ly的1.usa.gov数据

文件中各行的格式JSON(即JavaScript Object Notation,这是一种常用的web数据格式)Python有许多内置或第三方模块可以将JSON字符串转换成Python字典对象。这里,我将json模块及其loads函数逐行加载已经下载好的数据文件:

import json
path='C:\\pytest\\ch02\\usagov_bitly_data2012-03-16-1331923249.txt'
records=[json.loads(line) for line in open(path)]
上面最后那行表达式叫做列表推导式(list comprehension),这是一种在一组字符串(或一组别的对象)上执行一条相同操作(如json.loads)的简洁方式。在一个打开的文件句柄上进行迭代即可获得一个由行组成的序列。
records[0]
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
 'al': 'en-US,en;q=0.8',
 'c': 'US',
 'cy': 'Danvers',
 'g': 'A6qOVH',
 'gr': 'MA',
 'h': 'wfLQtf',
 'hc': 1331822918,
 'hh': '1.usa.gov',
 'l': 'orofrog',
 'll': [42.576698, -70.954903],
 'nk': 1,
 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 't': 1331923247,
 'tz': 'America/New_York',
 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

现在只要以字符串的形式给出想要访问的键就可以得到当前记录中相应的值了

records[0]['tz']
'America/New_York'

用纯Python代码对时区进行计数

time_zones=[rec['tz'] for rec in records if 'tz' in rec]
time_zones[:10]
['America/New_York',
 'America/Denver',
 'America/New_York',
 'America/Sao_Paulo',
 'America/New_York',
 'America/New_York',
 'Europe/Warsaw',
 '',
 '',
 '']

下面利用标准python库进行计数

方法1

def get_counts(sequence):
    counts={}
    for x in sequence:
        if x in counts:
            counts[x] +=1
        else:
            counts[x] =1
    return counts
方法2 在非常了解python标准库时,可以将代码写的更简洁
from collections import defaultdict
def get_counts2(sequence):
    counts = defaultdict(int) #所有的值均会被初始化为0
    for x in sequence:
        counts[x] +=1
    return counts
counts=get_counts(time_zones)
counts['America/New_York']
输出:1251
counts=get_counts2(time_zones)
counts['America/New_York']
同样输出1251
len(time_zones)  #计算有多少个时区

输出:3440

如果想得到前10位的时区及其计数值,可以用到一些字典的处理技巧

def top_counts(count_dict,n=10):
    value_key_pairs=[(count,tz) for tz,count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]
top_counts(counts)
[(33, 'America/Sao_Paulo'),
 (35, 'Europe/Madrid'),
 (36, 'Pacific/Honolulu'),
 (37, 'Asia/Tokyo'),
 (74, 'Europe/London'),
 (191, 'America/Denver'),
 (382, 'America/Los_Angeles'),
 (400, 'America/Chicago'),
 (521, ''),
 (1251, 'America/New_York')]

可以在python标准库中找到collections.Counter类,使任务变得更简单

from collections import Counter
counts = Counter(time_zones)
counts.most_common(10)
[('America/New_York', 1251),
 ('', 521),
 ('America/Chicago', 400),
 ('America/Los_Angeles', 382),
 ('America/Denver', 191),
 ('Europe/London', 74),
 ('Asia/Tokyo', 37),
 ('Pacific/Honolulu', 36),
 ('Europe/Madrid', 35),
 ('America/Sao_Paulo', 33)]

用pandas对时区进行计数

DataFrame是pandas中最重要的数据结构,它用于将数据表示为一个表格。从一组原始记录中创建DataFrame是很简单的

from pandas import DataFrame,Series
import pandas as pd
import numpy as np
frame = DataFrame(records)
frame
_heartbeat_aalccyggrhhchhkwlllnkrttzu
0NaNMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi…en-US,en;q=0.8USDanversA6qOVHMAwfLQtf1.331823e+091.usa.govNaNorofrog[42.576698, -70.954903]1.0http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/…1.331923e+09America/New_Yorkhttp://www.ncbi.nlm.nih.gov/pubmed/22415991
1NaNGoogleMaps/RochesterNYNaNUSProvomwszkSUTmwszkS1.308262e+09j.mpNaNbitly[40.218102, -111.613297]0.0http://www.AwareMap.com/1.331923e+09America/Denverhttp://www.monroecounty.gov/etc/911/rss.php
2NaNMozilla/4.0 (compatible; MSIE 8.0; Windows NT …en-USUSWashingtonxxr3QbDCxxr3Qb1.331920e+091.usa.govNaNbitly[38.9007, -77.043098]1.0http://t.co/03elZC4Q1.331923e+09America/New_Yorkhttp://boxer.senate.gov/en/press/releases/0316…
3NaNMozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)…pt-brBRBrazzCaLwp27zUtuOu1.331923e+091.usa.govNaNalelex88[-23.549999, -46.616699]0.0direct1.331923e+09America/Sao_Paulohttp://apod.nasa.gov/apod/ap120312.html
4NaNMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi…en-US,en;q=0.8USShrewsbury9b6kNlMA9b6kNl1.273672e+09bit.lyNaNbitly[42.286499, -71.714699]0.0http://www.shrewsbury-ma.gov/selco/1.331923e+09America/New_Yorkhttp://www.shrewsbury-ma.gov/egov/gallery/1341…
5NaNMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi…en-US,en;q=0.8USShrewsburyaxNK8cMAaxNK8c1.273673e+09bit.lyNaNbitly[42.286499, -71.714699]0.0http://www.shrewsbury-ma.gov/selco/1.331923e+09America/New_Yorkhttp://www.shrewsbury-ma.gov/egov/gallery/1341…
6NaNMozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1…pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4PLLubanwcndER77zkpJBR1.331923e+091.usa.govNaNbnjacobs[51.116699, 15.2833]0.0http://plus.url.google.com/url?sa=z&n=13319232…1.331923e+09Europe/Warsawhttp://www.nasa.gov/mission_pages/nustar/main/…
7NaNMozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/2…bg,en-us;q=0.7,en;q=0.3NoneNaNwcndERNaNzkpJBR1.331923e+091.usa.govNaNbnjacobsNaN0.0http://www.facebook.com/1.331923e+09http://www.nasa.gov/mission_pages/nustar/main/…
8NaNOpera/9.80 (X11; Linux zbov; U; en) Presto/2.1…en-US, enNoneNaNwcndERNaNzkpJBR1.331923e+091.usa.govNaNbnjacobsNaN0.0http://www.facebook.com/l.php?u=http%3A%2F%2F1…1.331923e+09http://www.nasa.gov/mission_pages/nustar/main/…
9NaNMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi…pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4NoneNaNzCaLwpNaNzUtuOu1.331923e+091.usa.govNaNalelex88NaN0.0http://t.co/o1Pd0WeV1.331923e+09http://apod.nasa.gov/apod/ap120312.html
10NaNMozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)…en-us,en;q=0.5USSeattlevNJS4HWAu0uD9q1.319564e+091.usa.govNaNo_4us71ccioa[47.5951, -122.332603]1.0direct1.331923e+09America/Los_Angeleshttps://www.nysdot.gov/rexdesign/design/commun…
11NaNMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4…en-us,en;q=0.5USWashingtonwG7OIHDCA0nRz41.331816e+091.usa.govNaNdarrellissa[38.937599, -77.092796]0.0http://t.co/ND7SoPyo1.331923e+09America/New_Yorkhttp://oversight.house.gov/wp-content/uploads/…
12NaNMozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)…en-us,en;q=0.5USAlexandriavNJS4HVAu0uD9q1.319564e+091.usa.govNaNo_4us71ccioa[38.790901, -77.094704]1.0direct1.331923e+09America/New_Yorkhttps://www.nysdot.gov/rexdesign/design/commun…
131.331923e+09NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
14NaNMozilla/5.0 (Windows; U; Windows NT 6.1; en-US…en-us,en;q=0.5USMarietta2rOUYcGA2rOUYc1.255770e+091.usa.govNaNbitly[33.953201, -84.5177]1.0direct1.331923e+09America/New_Yorkhttp://toxtown.nlm.nih.gov/index.php
15NaNMozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1…zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4HKCentral DistrictnQvgJp00rtrrth1.317318e+09j.mpNaNwalkeryuen[22.2833, 114.150002]1.0http://forum2.hkgolden.com/view.aspx?type=BW&m…1.331923e+09Asia/Hong_Konghttp://www.ssd.noaa.gov/PS/TROP/TCFP/data/curr…
16NaNMozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1…zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4HKCentral DistrictXdUNr00qWkgbq1.317318e+09j.mpNaNwalkeryuen[22.2833, 114.150002]1.0http://forum2.hkgolden.com/view.aspx?type=BW&m…1.331923e+09Asia/Hong_Konghttp://www.usno.navy.mil/NOOC/nmfc-ph/RSS/jtwc…
17NaNMozilla/5.0 (Macintosh; Intel Mac OS X 10.5; r…en-us,en;q=0.5USBuckfieldzH1BFfMEx3jOIv1.331840e+091.usa.govNaNandyzieminski[44.299702, -70.369797]0.0http://t.co/6Cx4ROLs1.331923e+09America/New_Yorkhttp://www.usda.gov/wps/portal/usda/usdahome?c…
18NaNGoogleMaps/RochesterNYNaNUSProvomwszkSUTmwszkS1.308262e+091.usa.govNaNbitly[40.218102, -111.613297]0.0http://www.AwareMap.com/1.331923e+09America/Denverhttp://www.monroecounty.gov/etc/911/rss.php
19NaNMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi…it-IT,it;q=0.8,en-US;q=0.6,en;q=0.4ITVenicewcndER20zkpJBR1.331923e+091.usa.govNaNbnjacobs[45.438599, 12.3267]0.0http://www.facebook.com/1.331923e+09Europe/Romehttp://www.nasa.gov/mission_pages/nustar/main/…
20NaNMozilla/5.0 (compatible; MSIE 9.0; Windows NT …es-ESESAlcalzQ95Hi51ytZYWR1.331671e+09bitly.comNaNjplnews[37.516701, -5.9833]0.0http://www.facebook.com/1.331923e+09Africa/Ceutahttp://voyager.jpl.nasa.gov/imagesvideo/uranus…
21NaNMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6…en-us,en;q=0.5USDavidsonvillewcndERMDzkpJBR1.331923e+091.usa.govNaNbnjacobs[38.939201, -76.635002]0.0http://www.facebook.com/1.331923e+09America/New_Yorkhttp://www.nasa.gov/mission_pages/nustar/main/…
22NaNMozilla/4.0 (compatible; MSIE 8.0; Windows NT …en-usUSHockessiny3ZImzDEy3ZImz1.331064e+091.usa.govNaNbitly[39.785, -75.682297]0.0direct1.331923e+09America/New_Yorkhttp://portal.hud.gov/hudportal/documents/hudd…
23NaNMozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3)…en-usUSLititzwWiOiDPAwWiOiD1.330218e+091.usa.govNaNbitly[40.174999, -76.3078]0.0http://www.facebook.com/l.php?u=http%3A%2F%2F1…1.331923e+09America/New_Yorkhttp://www.tricare.mil/mybenefit/ProfileFilter…
24NaNMozilla/5.0 (Windows; U; Windows NT 5.1; es-ES…es-es,es;q=0.8,en-us;q=0.5,en;q=0.3ESBilbaowcndER59zkpJBR1.331923e+091.usa.govNaNbnjacobs[43.25, -2.9667]0.0http://www.facebook.com/1.331923e+09Europe/Madridhttp://www.nasa.gov/mission_pages/nustar/main/…
25NaNMozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1…en-GB,en;q=0.8,en-US;q=0.6,en-AU;q=0.4MYKuala LumpurwcndER14zkpJBR1.331923e+091.usa.govNaNbnjacobs[3.1667, 101.699997]0.0http://www.facebook.com/1.331923e+09Asia/Kuala_Lumpurhttp://www.nasa.gov/mission_pages/nustar/main/…
26NaNMozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1…ro-RO,ro;q=0.8,en-US;q=0.6,en;q=0.4CYNicosiawcndER04zkpJBR1.331923e+091.usa.govNaNbnjacobs[35.166698, 33.366699]0.0http://www.facebook.com/?ref=tn_tnmn1.331923e+09Asia/Nicosiahttp://www.nasa.gov/mission_pages/nustar/main/…
27NaNMozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)…en-US,en;q=0.8BRSPaulozCaLwp27zUtuOu1.331923e+091.usa.govNaNalelex88[-23.5333, -46.616699]0.0direct1.331923e+09America/Sao_Paulohttp://apod.nasa.gov/apod/ap120312.html
28NaNMozilla/5.0 (iPad; CPU OS 5_0_1 like Mac OS X)…en-usNoneNaNvNJS4HNaNu0uD9q1.319564e+091.usa.govNaNo_4us71ccioaNaN0.0direct1.331923e+09https://www.nysdot.gov/rexdesign/design/commun…
29NaNMozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X…en-usNoneNaNFPX0IMNaNFPX0IL1.331923e+091.usa.govNaNtwittershareNaN1.0http://t.co/5xlp0B341.331923e+09http://www.ed.gov/news/media-advisories/us-dep…
3530NaNMozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1…en-US,en;q=0.8USSan FranciscoxVZg4PCAwqUkTo1.331908e+09go.nasa.govNaNnasatwitter[37.7645, -122.429398]0.0http://www.facebook.com/l.php?u=http%3A%2F%2Fg…1.331927e+09America/Los_Angeleshttp://www.nasa.gov/multimedia/imagegallery/im…
3531NaNMozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6…en-USNoneNaNwcndERNaNzkpJBR1.331923e+091.usa.govNaNbnjacobsNaN0.0direct1.331927e+09http://www.nasa.gov/mission_pages/nustar/main/…
3532NaNMozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)…en-us,en;q=0.5USWashingtonAu3aUSDCA9ct6C1.331926e+091.usa.govNaNncsha[38.904202, -77.031998]1.0http://www.ncsha.org/1.331927e+09America/New_Yorkhttp://portal.hud.gov/hudportal/HUD?src=/press…
3533NaNMozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) A…en-usUSJacksonvilleb2UtUJFLieCdgH1.301393e+09go.nasa.govNaNnasatwitter[30.279301, -81.585098]1.0direct1.331927e+09America/New_Yorkhttp://apod.nasa.gov/apod/
3534NaNMozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)…en-usUSFriscovNJS4HTXu0uD9q1.319564e+091.usa.govNaNo_4us71ccioa[33.149899, -96.855499]1.0direct1.331927e+09America/Chicagohttps://www.nysdot.gov/rexdesign/design/commun…
3535NaNMozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/…en-usUSHoustonzIgLx8TXyrPaLt1.331903e+09aash.toNaNaashto[29.775499, -95.415199]1.0direct1.331927e+09America/Chicagohttp://ntl.bts.gov/lib/44000/44300/44374/FHWA-…
3536NaNMozilla/5.0 (BlackBerry; U; BlackBerry 9800; e…en-US,en;q=0.5NoneNaNxIcyimNaNyG1TTf1.331728e+09go.nasa.govNaNnasatwitterNaN0.0http://t.co/g1VKE8zS1.331927e+09http://www.nasa.gov/mission_pages/hurricanes/a…
3537NaNMozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)…es-es,es;q=0.8,en-us;q=0.5,en;q=0.3HNTegucigalpazCaLwp08w63FZW1.331547e+091.usa.govNaNbufferapp[14.1, -87.216698]0.0http://t.co/A8TJyibE1.331927e+09America/Tegucigalpahttp://apod.nasa.gov/apod/ap120312.html
3538NaNMozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma…en-usUSLos AngelesqMac9kCAqds1Ge1.310474e+091.usa.govNaNhealthypeople[34.041599, -118.298798]0.0direct1.331927e+09America/Los_Angeleshttp://healthypeople.gov/2020/connect/webinars…
3539NaNMozilla/5.0 (compatible; Fedora Core 3) FC3 KDENaNUSBellevuezu2M5oWAzDhdro1.331586e+09bit.lyNaNglimtwin[47.615398, -122.210297]0.0direct1.331927e+09America/Los_Angeleshttp://www.federalreserve.gov/newsevents/press…
3540NaNMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi…en-US,en;q=0.8USPaysonwcndERUTzkpJBR1.331923e+091.usa.govNaNbnjacobs[40.014198, -111.738899]0.0http://www.facebook.com/l.php?u=http%3A%2F%2F1…1.331927e+09America/Denverhttp://www.nasa.gov/mission_pages/nustar/main/…
3541NaNMozilla/5.0 (X11; U; OpenVMS AlphaServer_ES40;…NaNUSBellevuezu2M5oWAzDhdro1.331586e+091.usa.govNaNglimtwin[47.615398, -122.210297]0.0direct1.331927e+09America/Los_Angeleshttp://www.federalreserve.gov/newsevents/press…
3542NaNMozilla/5.0 (compatible; MSIE 9.0; Windows NT …en-usUSPittsburgy3reI1CAy3reI11.331926e+091.usa.govNaNbitly[38.0051, -121.838699]0.0http://www.facebook.com/l.php?u=http%3A%2F%2F1…1.331927e+09America/Los_Angeleshttp://www.sba.gov/community/blogs/community-b…
35431.331927e+09NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3544NaNMozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0.1) …en-us,en;q=0.5USWentzvillevNJS4HMOu0uD9q1.319564e+091.usa.govNaNo_4us71ccioa[38.790001, -90.854897]1.0direct1.331927e+09America/Chicagohttps://www.nysdot.gov/rexdesign/design/commun…
3545NaNMozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)…en-us,en;q=0.5USSaint CharlesvNJS4HILu0uD9q1.319564e+091.usa.govNaNo_4us71ccioa[41.9352, -88.290901]1.0direct1.331927e+09America/Chicagohttps://www.nysdot.gov/rexdesign/design/commun…
3546NaNMozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma…en-usUSLos AngelesqMac9kCAqds1Ge1.310474e+091.usa.govNaNhealthypeople[34.041599, -118.298798]1.0direct1.331927e+09America/Los_Angeleshttp://healthypeople.gov/2020/connect/webinars…
3547NaNMozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)…en-usUSSilver Springy0jYkgMDy0jYkg1.331852e+091.usa.govNaNbitly[39.052101, -77.014999]1.0direct1.331927e+09America/New_Yorkhttp://www.epa.gov/otaq/regs/fuels/additive/e1…
3548NaNMozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma…en-usUSMcgeheey5rMacARxANY6O1.331916e+091.usa.govNaNtwitterfeed[33.628399, -91.356903]1.0https://twitter.com/fdarecalls/status/18069759…1.331927e+09America/Chicagohttp://www.fda.gov/Safety/Recalls/ucm296326.htm
3549NaNMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi…sv-SE,sv;q=0.8,en-US;q=0.6,en;q=0.4SESollefteeH8wu247dtjei1.260316e+091.usa.govNaNtweetdeckapi[63.166698, 17.266701]1.0direct1.331927e+09Europe/Stockholmhttp://www.nasa.gov/mission_pages/WISE/main/in…
3550NaNMozilla/4.0 (compatible; MSIE 8.0; Windows NT …en-usUSConshohockenA00b72PAyGSwzn1.331918e+091.usa.govNaNaddthis[40.0798, -75.2855]0.0http://www.linkedin.com/home?trk=hb_tab_home_top1.331927e+09America/New_Yorkhttp://www.nlm.nih.gov/medlineplus/news/fullst…
3551NaNMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi…en-US,en;q=0.8NoneNaNwcndERNaNzkpJBR1.331923e+091.usa.govNaNbnjacobsNaN0.0http://plus.url.google.com/url?sa=z&n=13319268…1.331927e+09http://www.nasa.gov/mission_pages/nustar/main/…
3552NaNMozilla/5.0 (Windows; U; Windows NT 6.1; en-US…NaNUSDecaturrqgJuEALxcz8vt1.331227e+091.usa.govNaNbootsnall[34.572701, -86.940598]0.0direct1.331927e+09America/Chicagohttp://travel.state.gov/passport/passport_5535…
3553NaNMozilla/4.0 (compatible; MSIE 7.0; Windows NT …en-usUSShrewsbury9b6kNlMA9b6kNl1.273672e+09bit.lyNaNbitly[42.286499, -71.714699]0.0http://www.shrewsbury-ma.gov/selco/1.331927e+09America/New_Yorkhttp://www.shrewsbury-ma.gov/egov/gallery/1341…
3554NaNMozilla/4.0 (compatible; MSIE 7.0; Windows NT …en-usUSShrewsburyaxNK8cMAaxNK8c1.273673e+09bit.lyNaNbitly[42.286499, -71.714699]0.0http://www.shrewsbury-ma.gov/selco/1.331927e+09America/New_Yorkhttp://www.shrewsbury-ma.gov/egov/gallery/1341…
3555NaNMozilla/4.0 (compatible; MSIE 9.0; Windows NT …enUSParamuse5SvKENJfqPSr91.301298e+091.usa.govNaNtweetdeckapi[40.9445, -74.07]1.0direct1.331927e+09America/New_Yorkhttp://www.fda.gov/AdvisoryCommittees/Committe…
3556NaNMozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1…en-US,en;q=0.8USOklahoma CityjQLtP4OKjQLtP41.307530e+091.usa.govNaNbitly[35.4715, -97.518997]0.0http://www.facebook.com/l.php?u=http%3A%2F%2F1…1.331927e+09America/Chicagohttp://www.okc.gov/PublicNotificationSystem/Fo…
3557NaNGoogleMaps/RochesterNYNaNUSProvomwszkSUTmwszkS1.308262e+09j.mpNaNbitly[40.218102, -111.613297]0.0http://www.AwareMap.com/1.331927e+09America/Denverhttp://www.monroecounty.gov/etc/911/rss.php
3558NaNGoogleProducerNaNUSMountain ViewzjtI4XCAzjtI4X1.327529e+091.usa.govNaNbitly[37.419201, -122.057404]0.0direct1.331927e+09America/Los_Angeleshttp://www.ahrq.gov/qual/qitoolkit/
3559NaNMozilla/4.0 (compatible; MSIE 8.0; Windows NT …en-USUSMc LeanqxKrTKVAqxKrTK1.312898e+091.usa.govNaNbitly[38.935799, -77.162102]0.0http://t.co/OEEEvwjU1.331927e+09America/New_Yorkhttp://herndon-va.gov/Content/public_safety/Pu…

3560 rows × 18 columns

frame['tz'][:10]
0     America/New_York
1       America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6        Europe/Warsaw
7                     
8                     
9                     
Name: tz, dtype: object

frame[‘tz’]所返回的Series对象有一个value_counts方法,该方法可以让我们得到所需的信息:

tz_counts = frame['tz'].value_counts()
tz_counts[:10]
America/New_York       1251
                        521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
America/Sao_Paulo        33
Name: tz, dtype: int64

先给记录中未知或缺失的时区填上一个替代只。fillna函数可以替换缺失值(NA),而未知值(空字符串)则可以通过布尔型数组索引加以替换:

clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz=='']='Unknown'
tz_counts = clean_tz.value_counts()
tz_counts[:10]
America/New_York       1251
Unknown                 521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Missing                 120
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
Name: tz, dtype: int64

在pandas中使用Series类的plot画图,如果tz_counts是一个Series类,需要先导入matplotlib.pyplot,最后加上plt.show(),显示图像。

import matplotlib.pyplot as plt
tz_counts[:10].plot(kind='barh',rot=0)
plt.show()

图2-1 示例数据中最常出现的时区

使用python内置的字符串函数和正则表达式可以将字符串中的信息解析出来

results = Series([x.split()[0] for x in frame.a.dropna()])
results[:5]
0               Mozilla/5.0
1    GoogleMaps/RochesterNY
2               Mozilla/4.0
3               Mozilla/5.0
4               Mozilla/5.0
dtype: object
results.value_counts()[:8]
Mozilla/5.0                 2594
Mozilla/4.0                  601
GoogleMaps/RochesterNY       121
Opera/9.80                    34
TEST_INTERNET_AGENT           24
GoogleProducer                21
Mozilla/6.0                    5
BlackBerry8520/5.0.0.681       4
dtype: int64

由于有的agent缺失,所以首先将他们从数据中移除

cframe = frame[frame.a.notnull()]
operating_system = np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')
operating_system[:5]
array(['Windows', 'Not Windows', 'Windows', 'Not Windows', 'Windows'], 
      dtype='<U11')

现在可以根据时区和新得到的操作系统列表对数据进行分组了

by_tz_os = cframe.groupby(['tz',operating_system])
通过size对分组结果进行计数(类似于上面的value_counts函数),并利用unstack对计数结果进行重塑
agg_counts=by_tz_os.size().unstack().fillna(0)
agg_counts[:10]
Not WindowsWindows
tz
245.0276.0
Africa/Cairo0.03.0
Africa/Casablanca0.01.0
Africa/Ceuta0.02.0
Africa/Johannesburg0.01.0
Africa/Lusaka0.01.0
America/Anchorage4.01.0
America/Argentina/Buenos_Aires1.00.0
America/Argentina/Cordoba0.01.0
America/Argentina/Mendoza0.01.0

最后,选取最常出现的时区。根据agg_counts中的行数构造了一个间接索引数组:

用于按升序排列

indexer = agg_counts.sum(1).argsort()
indexer[:10]
tz
                                  24
Africa/Cairo                      20
Africa/Casablanca                 21
Africa/Ceuta                      92
Africa/Johannesburg               87
Africa/Lusaka                     53
America/Anchorage                 54
America/Argentina/Buenos_Aires    57
America/Argentina/Cordoba         26
America/Argentina/Mendoza         55
dtype: int64

然后通过take按照这个顺序截取了最后10行:

count_subset = agg_counts.take(indexer)[-10:]
count_subset
输出:
Not WindowsWindows
tz
America/Sao_Paulo13.020.0
Europe/Madrid16.019.0
Pacific/Honolulu0.036.0
Asia/Tokyo2.035.0
Europe/London43.031.0
America/Denver132.059.0
America/Los_Angeles130.0252.0
America/Chicago115.0285.0
245.0276.0
America/New_York339.0912.0

使用stacked=True来生成一张堆积条形图

count_subset.plot(kind='barh',stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1d7cad929e8> #出现这语句,不知道原因
plt.show()

按windows和非windows用户统计的最常出现的时区

也可以将各行规范化为“总计为1”并重新绘图

normed_subset = count_subset.div(count_subset.sum(1),axis=0)
normed_subset.plot(kind='barh',stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1d7cae55c18> #还是那个语句
plt.show()

按windows和非windows用户比例统计的最常出现的时区

最后:

这里所用到的所有方法都会在本书后续的章节中详细讲解

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值