Azure Databricks in Action
文章平均质量分 54
Azure Databricks的一些主要用法。
语法API与Spark几乎一致。也适用其他云平台,如AWS等
wangyanglongcc
数仓工程师,多年数据处理、分析经验。擅长数仓ETL,数仓模型设计建设。
对微软云产品较为熟悉,如Azure Data Factory,Azure Databricks,SqlServer等。
对Python,Sql,Excel等较为熟悉。
展开
-
Optimize and Vacuum Delta Tables
Optimize and Vacuum Delta Tables。原创 2023-05-05 13:34:03 · 626 阅读 · 1 评论 -
Logger to record sqlcmd status
Logger to record sqlcmd status。原创 2023-05-01 21:51:46 · 255 阅读 · 0 评论 -
Install Python Library in Databricks
【代码】Install Python Library in Databricks。原创 2023-05-01 21:08:24 · 347 阅读 · 0 评论 -
常见数据类型与spark中数据类型映射
定义一个数据类型映射字典 def __init__(self): self.DataTypeMappings = { 'STRING_TYPES': ['char', 'comment', 'nchar', 'ntext', 'nvarchar', 'shortdescription', 'string', 'tokencode', 'varchar', StringType()],原创 2023-04-19 16:57:58 · 255 阅读 · 0 评论 -
pyspark离线数据处理常用方法
【代码】pyspark离线数据处理常用方法。原创 2023-04-14 17:56:20 · 83 阅读 · 0 评论 -
DBFS CLI : 03-Load and Use Secrets
Load and Use Secrets列出当前的secretsdatabricks secrets list-scopes创建一个scopedatabricks secrets create-scope —scope + scope名称databricks secrets create-scope --scope demo-scope删除scopedatabricks secrets delete-scope --scope demo-scope创建或更新secretdatab原创 2022-04-23 19:40:23 · 374 阅读 · 0 评论 -
DBFS CLI : 02-文件操作相关常用命令
DBFS CLI查看DBFS上都有哪些文件databricks fs lsdatabricks fs lsdatabricks fs ls dbfs:/mnt查看一个文件的具体内容 databricks fs catdatabricks fs cat dbfs:/tmp/my-file.txtApache Spark is awesome!文件复制databricks fs cpdatabricks fs cp dbfs:/tmp/your_file.txt dbfs:/parent原创 2022-04-23 19:39:42 · 621 阅读 · 0 评论 -
DBFS CLI : 01-Setting up the CLI
Setting up the CLI安装pip install databricks-cli设置连接到databricks从portal上获取host的url ,一般类似于这样https://adb-79479539573402579589.1.databricks.azure.cn/创建一个tokenSettings > User Settings > Access Tokens > Generate New Token通过如下命令,然后把host和toke原创 2022-04-22 11:49:41 · 215 阅读 · 0 评论 -
14-Sprak设置自动分区
说明首先调整配置信息spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")在写入分区表的时候,一定要注意字段顺序,需要把分区字段放到最后,且如果有多个字段分区的话,顺序也要对应。def re_arrange_partition_columns(df,partition_columns): ''' df : 输入的df,spark.DataFrame类型 partition_columns :原创 2022-04-22 11:32:14 · 1098 阅读 · 0 评论 -
13-Set Time Zone
常用语法SET TIME ZONE LOCALSET TIME ZONE 'timezone_value'SET TIME ZONE INTERVAL interval_literal参数解释LOCALSet the time zone to the one specified in the java user.timezone property, or to the environment variable TZ if user.timezone is undefined, or to原创 2022-04-22 11:30:26 · 2456 阅读 · 0 评论 -
12-Delta Lake
Create/Write a Delta TableeventsDF = spark.read.parquet(eventsPath)Convert data to a Delta table using the schema provided by the DataFrame使用save到路径保存delta tabledeltaPath = f"{delta_path}/delta-events"eventsDF.write.format("delta").mode("overwrite原创 2022-04-22 11:29:36 · 213 阅读 · 0 评论 -
11-Aggregating Streams
Reading Datadisplay(dbutils.fs.ls('/mnt/training/ecommerce/events/events-2020-07-03.json'))schema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_pre原创 2022-04-22 11:28:50 · 164 阅读 · 0 评论 -
10-Streaming Query
readStreamschema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_previous_timestamp BIGINT, event_timestamp BIGINT, geo STRUCT<city: STRING, state: S原创 2022-04-22 11:27:12 · 225 阅读 · 0 评论 -
09-Partitioning
Get partitions and coresUse an rdd method to get the number of DataFrame partitionsdf = spark.read.parquet(eventsPath)df.rdd.getNumPartitions()Access SparkContext through SparkSession to get the number of cores or slotsSparkContext is also provided原创 2022-04-22 11:24:32 · 1577 阅读 · 1 评论 -
08-UDFs
User-Defined FunctionsDefine a functionCreate and apply UDFRegister UDF to use in SQLUse Decorator Syntax (Python Only)Use Vectorized UDF (Python Only)MethodsUDF Registration (spark.udf): registerBuilt-In Functions : udfPython UD原创 2022-04-22 11:22:35 · 2440 阅读 · 0 评论 -
07-Complex Types
Extract item detailsdf = spark.read.parquet(salesPath).select('email','items')display(df)这里的字段items是一个列表,里面装了一个或多个字典explore: 将一个列表展开,相当于把数据一行转多行了。split:将一个文件按分隔符拆开。from pyspark.sql.functions import *detailsDF = (df.withColumn("items", explode原创 2022-04-22 11:19:56 · 225 阅读 · 0 评论 -
06-Datetimes
Datetime FunctionsCurrent Date/TimestampCast to timestampFormat datetimesExtract from timestampConvert to dateManipulate datetimesMethodsColumn : castBuilt-In Functions : date_format, to_date, date_add, year, month, dayofweek, mi原创 2022-04-22 01:15:00 · 230 阅读 · 0 评论 -
05-Aggregation
Grouping dataUse the DataFrame groupBy method to create a grouped data objectThis grouped data object is called RelationalGroupedDataset in Scala and GroupedData in Pythondf.groupBy("geo.state", "geo.city")Grouped data methodsVarious aggregate metho原创 2022-04-22 01:00:00 · 167 阅读 · 0 评论 -
04-Functions
join两个DataFrame根据某个条件进行关联。类似的还有crossJoin返回一个笛卡尔积表ParametersotherDataFrameRight side of the joinonstr, list or Column, optionala string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a str原创 2022-04-22 00:30:00 · 140 阅读 · 0 评论 -
03-DataFrame & Column
Construct columnsA column is a logical construction that will be computed based on the data in a DataFrame using an expressionConstruct a new column based on the input columns existing in a DataFramefrom pyspark.sql.functions import colcol("device")d原创 2022-04-21 17:37:42 · 330 阅读 · 0 评论 -
02-SparkSQL
常用方法spark.sql执行sql语句sqlstr = """select store_code,store_name,locationfrom storewhere country = 'CN'order by id"""spark.sql(sqlstr)2. show & display查看数据showdf.show()display(推荐)display(df)3. 从现有表创建一个DataFrame。spark.table & spar原创 2022-04-21 17:36:33 · 712 阅读 · 0 评论 -
01-Read&Write
ReaderRead from CSV filesspark.read.csv也可以读取csv文件,而且更常用。Read from CSV with DataFrameReader’s csv method and the following options:Tab separator, use first line as header, infer schemafile_csv = "/mnt/training/ecommerce/users/users-500k.csv"df = (sp原创 2022-04-21 17:31:43 · 774 阅读 · 0 评论