pyspark 数据写入hive,将行列表保存到pyspark中的Hive表中

I have a pyspark app. I copied a hive table to my hdfs directory, & in python I sqlContext.sql a query on this table. Now this variable is a dataframe I call rows. I need to randomly shuffle the rows, so I had to convert them to a list of rows rows_list = rows.collect(). So then I shuffle(rows_list) which shuffles the lists in place. I take the amount of random rows I need x:

for r in range(x):

allrows2add.append(rows_list[r])

Now I want to save allrows2add as a hive table OR append an existing hive table (whichever is easier to do). The problem is that I can not do this:

all_df = sc.parallelize(allrows2add).toDF() Cant do this, schema can't be inferred

ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling

without putting in the whole schema. The schema of rows has 117 columns, so I don't want to type them out. Is there a way to extract the schema of rows to help me make allrows2add a dataframe or somehow save as a hive table?

I can do

rows.printSchema() but not sure how to get it into a schema format as a variable to pass toDF() without having to parse all of that text

Thanks

Adding for loop info

#Table is a List of Rows from small Hive table I loaded using

#query = "SELECT * FROM Table"

#Table = sqlContext.sql(query).collect()

for i in range(len(Table)):

rows = sqlContext.sql(qry)

val1 = Table[i][0]

val2 = Table[i][1]

count = Table[i][2]

x = 100 - count

#hivetemp is a table that I copied from Hive to my hfs using:

#create external table IF NOT EXISTS hive temp LIKE hivetableIwant2copy LOCATION "/user/name/hiveBackup";

#INSERT OVERWRITE TABLE hivetemp SELECT * FROM hivetableIwant2copy;

query = "SELECT * FROM hivetemp WHERE col1<>\""+val1+"\" AND col2 ==\""+val2+"\" ORDER BY RAND() LIMIT "+str(x)

rows = sqlContext.sql(query)

rows = rows.withColumn("col4", lit(10))

rows = rows.withColumn("col5", lit(some_string))

#writing to parquet is heck slow AND I can't work with pandas due to the library not installed on the server

rows.saveAsParquetFile("rows"+str(i)+".parquet")

#tried this before and heck slow also

#rows_list = rows.collect()

#shuffle(rows_list)

解决方案

When the schema can't be inferred, there's usually a reason. toDF is syntactic sugar for the createDataFrame function, which by default only uses the first 100 rows (despite the docs saying it only uses the first row) to determine what the schema should be. To change this, you can increase the sampling ratio to look at a greater percentage of your data:

df = rdd.toDF(sampleRatio=0.2)

# or...

df = sqlContext.createDataFrame(rdd, samplingRatio=0.2)

It's also possible that your random sample happened to only take rows with empty values for some particular columns. If this is the case, you can either create a schema from scratch like so:

from pyspark.sql.types import *

# all DataFrame rows are StructType

# can create a new StructType with combinations of StructField

schema = StructType([

StructField("column_1", StringType(), True),

StructField("column_2", IntegerType(), True),

# etc.

])

df = sqlContext.createDataFrame(rdd, schema=schema)

Or, you can get the schema from the previous DataFrame you created by accessing the schema value:

df2 = sqlContext.createDataFrame(rdd, schema=df1.schema)

Note that if your RDD's rows aren't StructType (a.k.a. Row) objects instead of dictionaries or lists, you won't be able to create a data frame from them. If your RDD rows are dictionaries, you can convert them to Row objects like this:

rdd = rdd.map(lambda x: pyspark.sql.Row(**x))

# ** is to unpack the dictionary since the Row constructor

# only takes keyword arguments

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值