Spark Load Data and Export Data

Load_Data_Command

一、导入数据-加载csv文件数据作为spark 临时表DataSource(不需要提前创建表,方便数据分析)

该命令将csv文件导入到临时表中,命令格式为

load data '文件路径' table [表名] [options(key=value)]

# sql作业中,两个sql 一起选中执行
load data '/user/datacompute/platformtool/resources/169/latest/dc_load_data.csv' table tdl_spark_test options(header=true, inferSchema=true, delimiter=',');
select * from tdl_spark_test where type='login';
注意:文件路径为hdfs上面的路径,同时csv文件的列名不能包含空格

options为可选属性,其中的取值和spark.read中的option取值相同,

  • header: when set to true the first line of files will be used to name columns and will not be included in data. All * types will be assumed string. Default value is false.
  • delimiter: by default columns are delimited using ,, but delimiter can be set to any character
  • quote: by default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored
  • escape: by default the escape character is \, but can be set to any character. Escaped quote characters are ignored
  • parserLib: by default it is "commons" can be set to "univocity" to use that library for CSV parsing.
  • mode: determines the parsing mode. By default it is PERMISSIVE. Possible values are:
    • PERMISSIVE: tries to parse all lines: nulls are inserted for missing tokens and extra tokens are ignored.
    • DROPMALFORMED: drops lines which have fewer or more tokens than expected or tokens which do not match the schema
    • FAILFAST: aborts with a RuntimeException if encounters any malformed line
  • charset: defaults to 'UTF-8' but can be set to other valid charset names
  • inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default
  • comment: skip lines beginning with this character. Default is "#". Disable comments by setting this to null.
  • nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame
  • dateFormat: specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default, it is null which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().

参考文档 https://github.com/databricks/spark-csv

 

 

二、导出表数据为csv文件

export 语法:

export table tablename [PARTITION (part_column="value"[, ...])] TO 'export_file_name.csv' [options(key=value)]

其中的表既可以为hive表,也可以是同一个作业中通过createOrReplaceTempView创建的临时表,保存后文件在资源管理->我的资源 目录下面,导出csv文件默认是以逗号作为字段分割,如果需要指定分割符,可以在options选项中指定delimiter,options可选属性同上

example:

export table raw_activity_flat PARTITION (year=2018, month=3, day=12) TO 'activity_20180312.csv' options(delimiter=';')
#这段代码只能在python 作业中使用
sparkSession.sql("select email, idnumber, wifi from raw_activity_flat where year=2018 and month=3 and partnerCode='qunaer' ").createOrReplaceTempView("tdl_raw_activity_qunaer");
sparkSession.sql("export table tdl_raw_activity_qunaer TO 'activity_20180312.csv'")

 

转载于:https://my.oschina.net/u/3204727/blog/2251083

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值