python使用spark-sql读取数据并可视化_用Spark Python进行数据处理和特征提取

最新推荐文章于 2024-06-28 10:31:12 发布

weixin_39683176

最新推荐文章于 2024-06-28 10:31:12 发布

阅读量1.2k

点赞数

文章标签： python使用spark-sql读取数据并可视化

本文链接：https://blog.csdn.net/weixin_39683176/article/details/111986243

版权

下面用“|”字符来分隔各行数据。这将生成一个RDD,其中每一个记录对应一个Python列表,各列表由用户ID(user ID)、年龄(age)、性别(gender)、职业(occupation)和邮编(ZIP code)五个属性构成。4之后再统计用户、性别、职业和邮编的数目。这可通过如下代码实现。该数据集不大,故这里并未缓存它。

user_fields = user_data.map(lambda line: line.split('|'))

num_users = user_fields.map(lambda fields: fields[0]).count() #统计用户数

num_genders = user_fields.map(lambda fields : fields[2]).distinct().count() #统计性别个数

num_occupations = user_fields.map(lambda fields: fields[3]).distinct().count() #统计职业个数

num_zipcodes = user_fields.map(lambda fields: fields[4]).distinct().count() #统计邮编个数

print "Users:%d, genders:%d, occupations:%d, ZIP codes:%d"%(num_users,num_genders,num_occupations,num_zipcodes)

输出结果：Users: 943, genders: 2, occupations: 21, ZIP codes: 795

画出用户的年龄分布图：

%matplotlib inline

import matplotlib.pyplot as plt

from matplotlib.pyplot import hist

ages = user_fields.map(lambda x: int(x[1])).collect()

hist(ages, bins=20, color='lightblue',normed=True)

fig = plt.gcf()

fig.set_size_inches(12,6)

plt.show()

画出用户的职业的分布图：

#画出用户的职业的分布图：

import numpy as np

count_by_occupation = user_fields.map(lambda fields: (fields[3],1)).reduceByKey(lambda x,y:x+y).collect()

print count_by_occupation

x_axis1 = np.array([c[0] for c in count_by_occupation])

y_axis1 = np.array([c[1] for c in count_by_occupation])

x_axis = x_axis1[np.argsort(y_axis1)]

y_axis = y_axis1[np.argsort(y_axis1)]

pos = np.arange(len(x_axis))

width = 1.0

ax = plt.axes()

ax.set_xticks(pos+(width)/2)

ax.set_xticklabels(x_axis)

plt.bar(pos, y_axis, width, color='lightblue')

plt.xticks(rotation=30)

fig = plt.gcf()

fig.set_size_inches(12,6)

plt.show()

输出结果：

[(u'administrator', 79), (u'retired', 14), (u'lawyer', 12), (u'none', 9), (u'student', 196), (u'technician', 27), (u'programmer', 66), (u'salesman', 12), (u'homemaker', 7), (u'executive', 32), (u'doctor', 7), (u'entertainment', 18), (u'marketing', 26), (u'writer', 45), (u'scientist', 31), (u'educator', 95), (u'healthcare', 16), (u'librarian', 51), (u'artist', 28), (u'other', 105), (u'engineer', 67)]

Spark对RDD提供了一个名为countByValue的便捷函数。它会计算RDD里各不同值所分别出现的次数,并将其以Pythondict函数的形式(或是Scala、Java下的Map函数)返回给驱动程序:

count_by_occupation2 = user_fields.map(lambda fields: fields[3]).countByValue()

print "Map-reduce approach:"

print dict(count_by_occupation2)

print "========================"

print "countByValue approach:"

print dict(count_by_occupation)输出结果：

Map-reduce approach:

{u'administrator': 79, u'retired': 14, u'lawyer': 12, u'healthcare': 16, u'marketing': 26, u'executive': 32, u'scientist': 31, u'student': 196, u'technician': 27, u'librarian': 51, u'programmer': 66, u'salesman': 12, u'homemaker': 7, u'engineer': 67, u'none': 9, u'doctor': 7, u'writer': 45, u'entertainment': 18, u'other': 105, u'educator': 95, u'artist': 28}

========================

countByValue approach:

{u'administrator': 79, u'writer': 45, u'retired': 14, u'lawyer': 12, u'doctor': 7, u'marketing': 26, u'executive': 32, u'none': 9, u'entertainment': 18, u'healthcare': 16, u'scientist': 31, u'student': 196, u'educator': 95, u'technician': 27, u&#

最低0.47元/天解锁文章

weixin_39683176

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
python使用spark-sql读取数据并可视化_用Spark Python进行数据处理和特征提取

下面用“|”字符来分隔各行数据。这将生成一个RDD,其中每一个记录对应一个Python列表,各列表由用户ID(user ID)、年龄(age)、性别(gender)、职业(occupation)和邮编(ZIP code)五个属性构成。4之后再统计用户、性别、职业和邮编的数目。这可通过如下代码实现。该数据集不大,故这里并未缓存它。user_fields = user_data.map(lambda ...
复制链接

扫一扫