读书笔记数据科学入门————数据科学导论

数据科学导论

数据的用处:在数据中寻找隐藏问题的答案
数据科学是什么:就是从数据中剥离出真理。
在实际生活中数据的重要性,例如Facebook上的家乡居住地信息,不仅可以帮助朋友找到你的位置,同时网站可以分析地理信息来
研究全球移民以及不同球队粉丝分布。
通过解决实际工作中问题学习数据科学思想。


问题1:从给定人物关系中寻找关键联系人

关系序列:friendships=[(0,1),(0,2),(1,2),(1,3),(2,3),(3,4),(4,5),(5,6),(5,7),(6,8),(7,8),(8,9)]
每个属性列表:users=[{"id":0,"name":"Hero"},{"id":1,"name":"Dunn"},{"id":2,"name":"Sue"},{"id":3,"name":"Chi"},{"id":4,"name":"Thor"},{"id":5,"name":"Clive"},{"id":6,"name":"Hicks"},{"id":7,"name":"Devin"},{"id":8,"name":"Kate"},{"id":9,"name":"Klein"}]

为每个用户增加一个朋友列表:
for user in users:
user["friends"]=[]

利用friendship填充:
for i,j in friendships:
users[i]["friends"].append(users[j])
users[j]["friends"].append(users[i])

查找全部的联系数目,就需要多所有用户的friend列表长度求和:
def number_of_friends(user):
"how many friends does_user_have?"
return len(user["friends"])
获得总共的联系数目:total_connections = sum(number_of_friends(user) for user in users)
用这个值除以用户个数:
avg_connections = total_connections/num_users

由于用户不多可以方便按照朋友数目排序
num_friends_by_id = [(user["id"],number_of_friends(user)) for user in users]
列表如下:
[(0, 2), (1, 3), (2, 3), (3, 3), (4, 2), (5, 3), (6, 2), (7, 2), (8, 3), (9, 1)]
进行排序:
sorted(num_friends_by_id,key = lambda(user_id,num_friends):num_friends,reverse=True)
结果如下:[(1, 3), (2, 3), (3, 3), (5, 3), (8, 3), (0, 2), (4, 2), (6, 2), (7, 2), (9, 1)]

至于如何选出中心联系人需要进一步考察


问题2:进一步在给定数据关心中找到兴趣爱好

同时数据挖掘朋友的朋友有哪些:
def friends_of_friend_ids_bad(user):
return [foaf["id"] for friend in user["friends"]
for foaf in friend["friends"]]

数据可以继续分析。对于有共同兴趣的人可以进行统计,先给出兴趣关系图:
interests = [(0,"Hadoop"),(0,"Big Data"),(0,"HBase"),(0,"JAVA"),
(0,"SPARK"),(0,"STORM"),(0,"CASSANDRA"),
(1,"NOSQL"),(1,"MONGONDB"),(1,"CASSANDRA"),(1,"HBASE"),
(1,"POSTGRES"),(2,"PYTHON"),(2,"SCIKIT-LEARN"),(2,"SCIPY"),
(3,"R"),(3,"PYTHON"),(3,"STATISTICS"),(3,"regression"),
(4,"machine learning"),(4,"regression"),(5,"python"),(5,"R"),
(5,"Java"),(5,"C++"),(6,"statistics"),(6,"theory"),
(7,"neural networdks"),(8,"deep learning"),(8,"Big Data"),
(9,"Hadoop"),(9,"Java"),(9,"Big Data")
]

如果需要找出有某种事物共同爱好用户
def data_scientists_who_like(target_interest):
return [user_id for user_id,user_interest in interests
if user_interest==target_interest]

上述算法每次要遍历整个兴趣列表,可以建立兴趣到用户的索引直接搜索:
from collections import defaultdict
#键是Interest,值是带这个interest的user_id列表
user_ids_by_interest = defaultdict(list)
for user_id,interest in interests:
user_ids_by_interest[interest].append(user_id)
可以得到如下的索引表:
defaultdict(<type 'list'>, {'Java': [5, 9], 'PYTHON': [2, 3], 'Hadoop': [0, 9], 'regression': [3, 4], 'neural networdks': [7], 'theory': [6], 'statistics': [6], 'deep learning': [8], 'SCIPY': [2], 'SPARK': [0], 'POSTGRES': [1], 'python': [5], 'SCIKIT-LEARN': [2], 'C++': [5], 'R': [3, 5], 'STORM': [0], 'HBase': [0], 'MONGONDB': [1], 'STATISTICS': [3], 'JAVA': [0], 'Big Data': [0, 8, 9], 'NOSQL': [1], 'machine learning': [4], 'HBASE': [1], 'CASSANDRA': [0, 1]})

也可以同理转换为用户到兴趣的索引:
interests_by_user_id = defaultdict(list)
for user_id,interest in interests:
interests_by_user_id[user_id].append(interest)
得到如下索引:
defaultdict(<type 'list'>, {0: ['Hadoop', 'Big Data', 'HBase', 'JAVA', 'SPARK', 'STORM', 'CASSANDRA'], 1: ['NOSQL', 'MONGONDB', 'CASSANDRA', 'HBASE', 'POSTGRES'], 2: ['PYTHON', 'SCIKIT-LEARN', 'SCIPY'], 3: ['R', 'PYTHON', 'STATISTICS', 'regression'], 4: ['machine learning', 'regression'], 5: ['python', 'R', 'Java', 'C++'], 6: ['statistics', 'theory'], 7: ['neural networdks'], 8: ['deep learning', 'Big Data'], 9: ['Hadoop', 'Java', 'Big Data']})

下面就可以记录每种兴趣的科学家人数 以及针对每种兴趣统计人数

问题3:工资和工作年限数据统计

当是否可以提供一些数据科学家收入的有趣数据,对于工资数据敏感统计如下:
salaries_and_tenures = [(83000,8.7),(88000,8.1),
(48000,0.7),(76000,6),
(69000,6.5),(76000,7.5),
(60000,2.5),(83000,10),
(48000,1.9),(63000,4.2)]

针对年份对工作年限的工资列表列出:
>>> from collections import defaultdict
>>> salary_by_tenure = defaultdict(list)
for salary,tenure in salaries_and_tenures:
salary_by_tenure[tenure].append(salary)
得到键值为year的字典:
defaultdict(<type 'list'>, {6.5: [69000], 7.5: [76000], 6: [76000], 10: [83000], 8.1: [88000], 4.2: [63000], 0.7: [48000], 8.7: [83000], 1.9: [48000], 2.5: [60000]})
同时可以统计每年的平均收入:
averge_salary_by_tenure={
tenure:sum(salaries)/len(salaries)
for tenure,salaries in salary_by_tenure.items()
}

当然可以考虑分组按照工作年限分组
def tenure_bucket(tenure):
if tenure<2:
return "less than two"
elif tenure<5:
return "between two and five"
else:
return "more than five"
salary_by_tenure_bucket = defaultdict(list)
for salary,tenure in salaries_and_tenures:
bucket = tenure_bucket(tenure)
salary_by_tenure_bucket[bucket].append(salary)
那么就可以获得相应的按年限分组的工资了。
最后计算每组平均工资
>>> averge_salary_by_bucket={
tenure_bucket:sum(salaries)/len(salaries)
for tenure_bucket,salaries in salary_by_tenure_bucket.iteritems()
}
>>> averge_salary_by_bucket
{'more than five': 79166, 'between two and five': 61500, 'less than two': 48000}


问题4:统计有多少个兴趣的词汇

那么对于大小写不规范的兴趣如何统计呢肯定需要先统一为全是小写的兴趣
>>> from collections import Counter
>>> words_and_counts = Counter(word
for user,interest in interests
for word in interest.lower().split())
然后列出大于1以上词汇:
for word,count in words_and_counts.most_common():
if count>1:
print word,count

java 3
python 3
big 3
data 3
learning 2
hbase 2
regression 2
statistics 2
hadoop 2
cassandra 2
r 2


总结:这一章主要是熟悉如何用python对数据进行处理,对于感兴趣的数据进行统计,从而在数据统计分析中找到更多的信息。






  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值