50-100G大文件的处理办法

1. 使用分布式框架处理,如上次介绍的spark

这种情况下集群才有优势,local单机版只能使用8G内存,rdd的优势也没发挥出来,好在是多patition和多任务。

2. 使用pandas chunk, 不比单机版的spark慢

import pandas as pd
df_chunk = pd.read_json('F://total.json', chunksize=1000000, lines=True,encoding='utf-8')
chunk_list = []  # append each chunk df here
i =1
#%%
# Each chunk is in df format
for chunk in df_chunk:
    # perform data filtering
    # chunk_filter = chunk_preprocessing(chunk)

    # Once the data filtering is done, append the chunk to list
    # chunk_list.append(chunk_filter)
    chunk_list.append(chunk)
    print("当前chunnk:{}".format(i))
    i += 1

# concat the list into dataframe
df_concat = pd.concat(chunk_list)

每块100万跑满16G内存。上述方法用到list,也就是处理后的数据list不能超过你电脑的内存,有局限性。

3. 使用dask pandas , 分布式的pandas

import dask
import dask.dataframe as dd
from dask.distributed import Client
client = Client(processes=False, threads_per_worker=4, n_workers=4, memory_limit='12GB')
#%%
df = dd.read_csv("F://total2.csv", blocksize=25e6,encoding='utf-8',dtype='object')
#%%
for i in df.columns:
    print("{}".format(df.head(1)[i]))

#%%
logs =  'Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.\n\n+----------------------------------+--------+----------+\n| Column                           | Found  | Expected |\n+----------------------------------+--------+----------+\n| check.0.reportorphone            | object | float64  |\n| damagetypecode                   | object | float64  |\n| lossmain.0.handlercode           | object | float64  |\n| lossmain.0.repairbrandcode       | object | float64  |\n| lossmain.0.repairbrandname       | object | float64  |\n| lossmain.0.repairfactorycode     | object | float64  |\n| lossmain.0.repairfactoryname     | object | float64  |\n| lossthirdparty.0.insurecomcode   | object | float64  |\n| lossthirdparty.0.losscarkindname | object | float64  |\n| lossthirdparty.0.thirdcarlinker  | object | float64  |\n| lossthirdparty.0.vinno           | object | float64  |\n| phonenumber                      | object | int64    |\n| prplcitemcar.0.brandid           | object | float64  |\n| prplcitemcar.0.brandname1  '
print(logs)

上述log错误的接解决方法:dtype=‘object’

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值