python按列合并两个csv文件_如何使用熊猫PYTHON按列中的值合并两个CSV文件

最新推荐文章于 2023-05-10 10:27:00 发布

weixin_39966941

最新推荐文章于 2023-05-10 10:27:00 发布

阅读量374

点赞数

文章标签： python按列合并两个csv文件

I have 2 csv files price and performance.

Here is the data layout of each

Price:

Performance:

I import them into python using:

import pandas as pd

price = pd.read_csv("cpu.csv")

performance = pd.read_csv("geekbench.csv")

This works as intended, however I am unsure on how to create a new csv file with matches between Price[brand + model] and Performance[name]

I want to take:

Cores, tdp and price from Price

Score, multicore_score and name from Performance

Create a new csv file using these parameters above. Problems I've been having a finding a good way to match which ignores minor differences such as capitalization I was looking into algorithms such as fuzzy string matching but was not sure what the best option is.

This is my current attempt which throws errors;

for i in range(len(price.index)):

brand = (price.iloc[i, 0])

model = (price.iloc[i, 1])

print(model)

print(performance)

print(performance.query('name == brand+model'))

Thanks

解决方案

I suggest the following :

import nltk

import pandas as pd

tokenizer = nltk.RegexpTokenizer(r'\w+')

price = pd.DataFrame({"brand": ["AMD", "AMD", "AMD", "AMD"],

"model" : ["2650", "3800", "5150", "4200"],

"cores" : [2,4,4,4],

"tdp" : [25,25,25,25]})

performance = pd.DataFrame({"name": ["AMD Athlon 64 3200+",

"AMD Athlon 64 X2 3800+",

"AMD Athlon 64 X2 4000+",

"AMD Athlon 64 X2 4200+"],

"score" : [6,5,6,18]})

# I break down the name in performance and suppress capital letters

performance["tokens"] = (performance["name"].str.lower()

.apply(tokenizer.tokenize))

# And the same for price

price["tokens"] = price.loc[:,"brand"].values + " " + \

price.loc[:,"model"].values

price["tokens"] = (price["tokens"].str.lower()

.apply(tokenizer.tokenize))

# cartesian product

price["key"] = 1

performance["key"] = 1

df = pd.merge(price,performance, on = "key")

# define my criteria for match

n_match = 2

df['intersection'] =\

[len(list(set(a).intersection(set(b))))

for a, b in zip(df.tokens_x,

df.tokens_y)]

df = df.loc[df["intersection"]>=n_match,:]

I redefined your datasets so that in this example we would have some matches. Here is what I have as a result:

brand model cores ... score tokens_y intersection

5 AMD 3800 4 ... 5 [amd, athlon, 64, x2, 3800] 2

15 AMD 4200 4 ... 18 [amd, athlon, 64, x2, 4200] 2

[2 rows x 10 columns]

You can redefine your criteria for n_match I put two because it seemed that it was what was required by the dataset.

Hope it helps

weixin_39966941

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。