python 合并dataframe有多条key对应只保留,Python Pandas：仅旋转DataFrame中的某些列，同时保留其他列...

最新推荐文章于 2024-09-05 20:27:39 发布

Yangtze扬子

最新推荐文章于 2024-09-05 20:27:39 发布

阅读量120

点赞数

文章标签： python 合并dataframe有多条key对应只保留

I am trying to re-arrange a DataFrame that I automatically read in from a json using Pandas. I've searched but have had no success.

I have the following json (saved as a string for copy/paste convenience) with a bunch of json objects/dictionarys under the tag 'value'

json_str = '''{"preferred_timestamp": "internal_timestamp",

"internal_timestamp": 3606765503.684,

"stream_name": "ctdpf_j_cspp_instrument",

"values": [{

"value_id": "temperature",

"value": 9.8319

}, {

"value_id": "conductivity",

"value": 3.58847

}, {

"value_id": "pressure",

"value": 22.963

}]

}'''

I use the function 'json_normalize' in order to load the json into a flattened Pandas dataframe.

>>> from pandas.io.json import json_normalize

>>> import simplejson as json

>>> df = json_normalize(json.loads(json_str), 'values', ['preferred_timestamp', 'stream_name', 'internal_timestamp'])

>>> df

value value_id preferred_timestamp internal_timestamp \

0 9.83190 temperature internal_timestamp 3.606766e+09

1 3.58847 conductivity internal_timestamp 3.606766e+09

2 22.96300 pressure internal_timestamp 3.606766e+09

3 32.89470 salinity internal_timestamp 3.606766e+09

stream_name

0 ctdpf_j_cspp_instrument

1 ctdpf_j_cspp_instrument

2 ctdpf_j_cspp_instrument

3 ctdpf_j_cspp_instrument

Here is where I am stuck. I want to take the value and value_id columns and pivot these into new columns based off of value_id.

I want the dataframe to look like the following:

stream_name preferred_timestamp internal_timestamp conductivity pressure salinity temperature

ctdpf_j_cspp_instrument internal_timestamp 3.606766e+09 3.58847 22.96300 32.89470 9.83190

I've tried both the pivot and pivot_table Pandas functions and even tried to manually pivot the tables by using 'set_index' and 'stack' but it's not quite how I want it.

>>> df.pivot_table(values='value', index=['stream_name', 'preferred_timestamp', 'internal_timestamp', 'value_id'])

stream_name preferred_timestamp internal_timestamp value_id

ctdpf_j_cspp_instrument internal_timestamp 3.606766e+09 conductivity 3.58847

pressure 22.96300

salinity 32.89470

temperature 9.83190

Name: value, dtype: float64

This is close, but it didn't seem to pivot the values in 'value_id' into separate columns.

and

>>> df.pivot('stream_name', 'value_id', 'value')

value_id conductivity pressure salinity temperature

stream_name

ctdpf_j_cspp_instrument 3.58847 22.963 32.8947 9.8319

Close again, but it lacks the other columns that I want to be associated with this line.

I'm stuck here. Is there an elegant way of doing this or should I split the DataFrames and re-merge them to how I want?

解决方案

Your first attempt was nearly correct, just use columns='value_id' instead of including it in the index.

# Perform the pivot.

df = df.pivot_table(

values='value',

index=['stream_name', 'preferred_timestamp', 'internal_timestamp'],

columns='value_id'

)

# Formatting.

df.reset_index(inplace=True)

df.columns.name = None

This isn't an issue in your example data, but keep in mind that pivot_table will aggregate values if multiple values are pivoted to the same position (taking the mean by default).