python在界面上打开csv,如何使用python在zip中打开CSV中的csv？

最新推荐文章于 2023-12-02 13:28:41 发布

张大新

最新推荐文章于 2023-12-02 13:28:41 发布

阅读量144

点赞数

文章标签： python在界面上打开csv

I have been using a user-defined function to open CSV files contained within a ZIP file, which has been working very well for me.

Now I am trying to open a CSV file which is contained within a ZIP, which is contained in another ZIP, and have run into some trouble.

Instead of getting the expected output of a dataframe with the data from a CSV, I am getting this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte

which sort of makes sense because I am trying to open a zip file with read_csv()

import pandas as pd

def fetch_multi_csv_zip_from_url(url, filenames=(), *args, **kwargs):

assert kwargs.get('compression') is None

req = urlopen(url)

zip_file = zipfile.ZipFile(BytesIO(req.read()))

if filenames:

names = zip_file.namelist()

for filename in filenames:

if filename not in names:

raise ValueError(

'filename {} not in {}'.format(filename, names))

else:

filenames = zip_file.namelist()

return {name: pd.read_csv(zip_file.open(name), *args, **kwargs)

for name in filenames}

try:

from urllib.request import urlopen

except ImportError:

from urllib2 import urlopen

from io import BytesIO

import zipfile

final_links_list =['http://www.nemweb.com.au/REPORTS/ARCHIVE/Dispatch_SCADA/PUBLIC_DISPATCHSCADA_20170523.zip', 'http://www.nemweb.com.au/REPORTS/ARCHIVE/Dispatch_SCADA/PUBLIC_DISPATCHSCADA_20170524.zip']

l = len(final_links_list)

for j in range(0,l):

print(j)

dfs = fetch_multi_csv_zip_from_url(final_links_list[j])

This is the code that I have been using, and I gather that I have to change the line starting with:

return {name: pd.read_csv(zip_file.open(name)

as it no longer returns a csv file, but a zip file.

解决方案

This could be done with a bit of recursion. If a file inside a ZIP is found to be a ZIP file, then make a recursive call to extract CSV files:

try:

from urllib.request import urlopen

except ImportError:

from urllib2 import urlopen

from io import BytesIO

import zipfile

import pandas as pd

# Dictionary holding all the dataframes from all zip/zip/csvs

dfs = {}

def zip_to_dfs(data):

zip_file = zipfile.ZipFile(BytesIO(data))

for name in zip_file.namelist():

if name.lower().endswith('.csv'):

dfs[name] = pd.read_csv(zip_file.open(name))

elif name.lower().endswith('.zip'):

zip_to_dfs(zip_file.open(name).read())

def get_zip_data_from_url(url):

req = urlopen(url)

zip_to_dfs(req.read())

final_links_list = [

'http://www.nemweb.com.au/REPORTS/ARCHIVE/Dispatch_SCADA/PUBLIC_DISPATCHSCADA_20170523.zip',

'http://www.nemweb.com.au/REPORTS/ARCHIVE/Dispatch_SCADA/PUBLIC_DISPATCHSCADA_20170524.zip']

for link in final_links_list:

print(link)

get_zip_data_from_url(link)

# Display the first couple of dataframes

for name, df in sorted(dfs.items())[:2]:

print('\n', name, '\n')

print(df)

This would display the following:

http://www.nemweb.com.au/REPORTS/ARCHIVE/Dispatch_SCADA/PUBLIC_DISPATCHSCADA_20170524.zip

PUBLIC_DISPATCHSCADA_201705240010_0000000283857084.CSV

C NEMP.WORLD DISPATCHSCADA AEMO PUBLIC 2017/05/24 \

0 I DISPATCH UNIT_SCADA 1.0 SETTLEMENTDATE DUID

1 D DISPATCH UNIT_SCADA 1.0 2017/05/24 00:10:00 BARCSF1

2 D DISPATCH UNIT_SCADA 1.0 2017/05/24 00:10:00 BUTLERSG

.. .. ... ... ... ... ...

263 D DISPATCH UNIT_SCADA 1.0 2017/05/24 00:10:00 YWPS3

264 D DISPATCH UNIT_SCADA 1.0 2017/05/24 00:10:00 YWPS4

265 C END OF REPORT 267 NaN NaN NaN

00:05:08 0000000283857084 DISPATCHSCADA.1 0000000283857078

0 SCADAVALUE NaN NaN NaN

1 0 NaN NaN NaN

2 8.299998 NaN NaN NaN

.. ... ... ... ...

263 388.745570 NaN NaN NaN

264 391.568360 NaN NaN NaN

265 NaN NaN NaN NaN

[266 rows x 10 columns]

PUBLIC_DISPATCHSCADA_201705240015_0000000283857169.CSV

C NEMP.WORLD DISPATCHSCADA AEMO PUBLIC 2017/05/24 \

0 I DISPATCH UNIT_SCADA 1.0 SETTLEMENTDATE DUID

1 D DISPATCH UNIT_SCADA 1.0 2017/05/24 00:15:00 BARCSF1

2 D DISPATCH UNIT_SCADA 1.0 2017/05/24 00:15:00 BUTLERSG

.. .. ... ... ... ... ...

263 D DISPATCH UNIT_SCADA 1.0 2017/05/24 00:15:00 YWPS3

264 D DISPATCH UNIT_SCADA 1.0 2017/05/24 00:15:00 YWPS4

265 C END OF REPORT 267 NaN NaN NaN

00:10:08 0000000283857169 DISPATCHSCADA.1 0000000283857163

0 SCADAVALUE NaN NaN NaN

1 0 NaN NaN NaN

2 8.299998 NaN NaN NaN

.. ... ... ... ...

263 386.205080 NaN NaN NaN

264 389.592410 NaN NaN NaN

265 NaN NaN NaN NaN

[266 rows x 10 columns]

张大新

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫