pyspark 总结

1、数据处理

导入需要的库

from pyspark import StorageLevel
from pyspark.sql import functions as F
from pyspark.sql.types import StringType,IntegerType
from pyspark.sql import HiveContext
from pyspark.context import SparkContext
from pyspark.sql.functions import col

2、导入数据(parquet、csv、json、等格式)
导入parquet

sc = SparkContext("local", "test")
hive_context = HiveContext(sc)
local_path ="hdfs路径"

total_data = hive_context.read.load(local_path)

导入csv

sc = SparkContext("local", "test2")
hive_context = HiveContext(sc)
local_path = 'hdfs路径'
total_data = hive_context.read.csv(local_path
                             ,header=True
                            )

3、数据简单处理
以kaggle中 旧金山犯罪记录分类数据为例,数据下载地址:
https://www.kaggle.com/c/sf-crime/data。

#查看前5条数据
data.show(5)

在这里插入图片描述

#查看数据结构
data.printSchema()

在这里插入图片描述

#查看数据等列名、行数、列数

data.columns
data.count()
len(data.columns)

在这里插入图片描述

# 描述制定列 describe
#如果我们要看一下数据框中某指定列的概要信息,我们会用describe方法。
#这个方法会提供我们指定列的统计概要信息,如果没有指定列名,它会提供这个数据框对象的统计信息。

ata.describe('Category').show()
data.describe().show()

在这里插入图片描述
在这里插入图片描述

#查询多列   可以使用select
data.select('Category','Descript').show()

在这里插入图片描述

# 查询不重复的多列组合
data.select('Category','Descript').distinct().show()

在这里插入图片描述

#过滤数据  filter  

data.filter(data.Category == 'WEAPON LAWS').show()
#一共多少条记录被筛选出来
data.filter(data.Category == 'WEAPON LAWS').count()
# 基于多个条件过滤
data.filter((data.Category == 'WEAPON LAWS')& (data.DayOfWeek == 'Wednesday')).show()

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

#分组数据 groupby
data.groupBy("Category").count().show()

在这里插入图片描述

#数据排序 orderBy
#默认 升序
data.groupBy("Category") \
    .count() \
    .orderBy(col("count")) \
    .show()

#可以降序
data.groupBy("Category") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

在这里插入图片描述

Build machine learning models, natural language processing applications, and recommender systems with PySpark to solve various business challenges. This book starts with the fundamentals of Spark and its evolution and then covers the entire spectrum of traditional machine learning algorithms along with natural language processing and recommender systems using PySpark. Machine Learning with PySpark shows you how to build supervised machine learning models such as linear regression, logistic regression, decision trees, and random forest. You'll also see unsupervised machine learning models such as K-means and hierarchical clustering. A major portion of the book focuses on feature engineering to create useful features with PySpark to train the machine learning models. The natural language processing section covers text processing, text mining, and embedding for classification. After reading this book, you will understand how to use PySpark's machine learning library to build and train various machine learning models. Additionally you'll become comfortable with related PySpark components, such as data ingestion, data processing, and data analysis, that you can use to develop data-driven intelligent applications. What You Will Learn Build a spectrum of supervised and unsupervised machine learning algorithms Implement machine learning algorithms with Spark MLlib libraries Develop a recommender system with Spark MLlib libraries Handle issues related to feature engineering, class balance, bias and variance, and cross validation for building an optimal fit model Who This Book Is For Data science and machine learning professionals.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值