python加载csv文件去重软件_python - 使用Sp加载CSV文件

最新推荐文章于 2021-06-23 17:48:25 发布

Rucaz

最新推荐文章于 2021-06-23 17:48:25 发布

阅读量228

点赞数

文章标签： python加载csv文件去重软件

本文链接：https://blog.csdn.net/weixin_30722051/article/details/111959302

版权

python - 使用Sp加载CSV文件

我是Spark的新手，我正在尝试使用Spark从文件中读取CSV数据。这就是我在做的事情：

sc.textFile('file.csv')

.map(lambda line: (line.split(',')[0], line.split(',')[1]))

.collect()

我希望这个调用能给我一个我文件的两个第一列的列表，但是我收到了这个错误：

File "", line 1, in

IndexError: list index out of range

虽然我的CSV文件不止一列。

10个解决方案

126 votes

Spark 2.0.0+

您可以直接使用内置csv数据源：

spark.read.csv(

"some_input_file.csv", header=True, mode="DROPMALFORMED", schema=schema

)

要么

(spark.read

.schema(schema)

.option("header", "true")

.option("mode", "DROPMALFORMED")

.csv("some_input_file.csv"))

不包括任何外部依赖项。

火花＆lt;2.0.0：

而不是手动解析，这在一般情况下远非微不足道，我建议DataFrameReader：

确保路径中包含Spark CSV(DataFrameReader,--jars,--driver-class-path)

并按如下方式加载数据：

(df = sqlContext

.read.format("com.databricks.spark.csv")

.option("header", "true")

.option("inferschema", "true")

.option("mode", "DROPMALFORMED")

.load("some_input_file.csv"))

它可以处理加载，模式推断，丢弃格式错误的行，并且不需要将数据从Python传递到JVM。

注意：

如果您知道模式，最好避免模式推断并将其传递给DataFrameReader.假设您有三列 - 整数，双精度和字符串：

from pyspark.sql.types import StructType, StructField

from pyspark.sql.types import DoubleType, IntegerType, StringType

schema = StructType([

StructField("A", IntegerType()),

StructField("B", DoubleType()),

StructField("C", StringType())

])

(sqlContext

.read

.format("com.databricks.spark.csv")

.schema(schema)

.option("header", "true")

.option("mode", "DROPMALFORMED")

.load("some_input_file.csv"))

zero323 answered 2019-04-12T12:32:17Z

51 votes

你确定所有的行都至少有2列吗？你可以尝试一下，只是为了检查？：

sc.textFile("file.csv") \

.map(lambda line: line.split(",")) \

.filter(lambda line: len(line)>1) \

.map(lambda line: (line[0],line[1])) \

.collect()

或者，你可以打印罪魁祸首(如果有的话)：

sc.textFile("file.csv") \

.map(lambda line: line.split(",")) \

.filter(lambda line: len(line)<=1) \

.collect()

G Quintana answered 2019-04-12T12:30:28Z

12 votes

简单地用逗号分割也会分割字段内的逗号(例如header = rdd.first(); rdd = rdd.filter(lambda x: x != header))，所以不推荐使用逗号。如果你想使用DataFrames API，zero323的答案是好的，但是如果你想坚持基础Spark，你可以用csv模块解析基础Python中的csvs：

# works for both python 2 and 3

import csv

rdd = sc.textFile("file.csv")

rdd = rdd.mapPartitions(lambda x: csv.reader(x))

编辑：正如@muon在评论中提到的，这将像任何其他行一样处理标题，因此您需要手动提取它。例如，header = rdd.first(); rdd = rdd.filter(lambda x: x != header)(确保在筛选器计算之前不要修改header)。但是在这一点上，你最好使用内置的csv解析器。

Galen Long answered 2019-04-12T12:32:53Z

11 votes

还有另一种选择，包括使用Pandas读取CSV文件，然后将Pandas DataFrame导入Spark。

例如：

from pyspark import SparkContext

from pyspark.sql import SQLContext

import pandas as pd

sc = SparkContext('local','example') # if using locally

sql_sc = SQLContext(sc)

pandas_df = pd.read_csv('file.csv') # assuming the file contains a header

# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) # if no header

s_df = sql_sc.createDataFrame(pandas_df)

JP Mercier answered 2019-04-12T12:33:23Z

11 votes

from pyspark.sql import SparkSession

spark = SparkSession \

.builder \

.appName("Python Spark SQL basic example") \

.config("spark.some.config.option", "some-value") \

.getOrCreate()

df = spark.read.csv("/home/stp/test1.csv",header=True,separator="|");

print(df.collect())

y durga prasad answered 2019-04-12T12:33:45Z

3 votes

这与JP Mercier最初建议使用Pandas的内容一致，但有一个重大修改：如果您将数据读入Pandas的块中，它应该更具有可塑性。这意味着，您可以解析比Pandas实际处理的文件大得多的文件，并以较小的尺寸将其传递给Spark。 (这也回答了关于为什么人们想要使用Spark的评论，如果他们可以将所有内容加载到Pandas中。)

from pyspark import SparkContext

from pyspark.sql import SQLContext

import pandas as pd

sc = SparkContext('local','example') # if using locally

sql_sc = SQLContext(sc)

Spark_Full = sc.emptyRDD()

chunk_100k = pd.read_csv("Your_Data_File.csv", chunksize=100000)

# if you have headers in your csv file:

headers = list(pd.read_csv("Your_Data_File.csv", nrows=0).columns)

for chunky in chunk_100k:

Spark_Full += sc.parallelize(chunky.values.tolist())

YourSparkDataFrame = Spark_Full.toDF(headers)