1.Your first data program in PySpark
Data-driven applications, no matter how complex, all boil down to what we can
think of as three meta steps, which are easy to distinguish in a program:
1 We start by loading or reading the data we wish to work with.
2 We transform the data, either via a few simple instructions or a very complex
machine learning model.
3 We then export (or sink) the resulting data, either into a file or by summariz-
ing our findings into a visualization.
NOTE REPL stands for read, evaluate, print, and loop. In the case of Python, it represents the interactive prompt in which we input commands and read results.
1.Configuring how chatty spark is : the log level
2.The DataFrameReader object
3.Splitting our lines of text into lists of words
4.Renaming a column with two ways
5.Exploding a column of arrays into rows of elements
6.lower the case of the words in the data frame
7.Using regexp_extract to keep what looks like a word
8.Filtering rows in your data frame using where or filter
倘若您觉得我写的好,那么请您动动你的小手粉一下我,你的小小鼓励会带来更大的动力。Thanks.