尊敬的读者您好:笔者很高兴自己的文章能被阅读,但原创与编辑均不易,所以转载请必须注明本文出处并附上本文地址超链接以及博主博客地址:https://blog.csdn.net/vensmallzeng。若觉得本文对您有益处还请帮忙点个赞鼓励一下,笔者在此感谢每一位读者,如需联系笔者,请记下邮箱:zengzenghe@gmail.com,谢谢合作!
一、字符串中插入字符串
原始表all_available_features_plus_new.txt结构:用户id \001 特征...\001特征...特征 \001 label
问题:在原始特征表all_available_features_plus_new.txt中追加特征,即在label前一个特征后插入待追加特征,然后将构成好的新样本写入新特征表all_available_features_plus_new_add.txt中。
with open("all_available_features_plus_new.txt", 'r', encoding='utf-8') as f:
lines = f.readlines()
lines_new = ""
for line in lines:
line_tmp = ""
line_cut = line.split('\001')
l1 = len(line.strip())
#获取所有待加入特征,并拼接成字符串line_tmp
for i in user_time_new[line_cut[0]]:
# print(str(i))
line_tmp = line_tmp + str(i) + '\001'
##将line_tmp字符串插入到label与前一个特征之间,形成新字符串line_new
#先变成list,因为list才有insert功能
line = list(line)
#print(line_tmp)
#插入line_tmp字符串
line.insert(l1-1, line_tmp)
#将list变回新字符串line_new
line_new = ''.join(line)
#构造追加新特征后的所有样本
lines_new = lines_new + line_new
f.close()
将构成好的新样本写入新特征表all_available_features_plus_new_add.txt中
file_handle = open('all_available_features_plus_new_add.txt', 'w', encoding='utf-8')
file_handle.write(lines_new)
file_handle.close()
二、箱线图的两种画法
最近需要用箱线图来画关于城市订单的一个分布情况,笔者使用了两种方法,下面就将源码放置在此以便下次使用。
1、各自单独画(这里使用的是plt.boxplot)
该方法代码:
#-*-coding:utf-8-*-
import ch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
ch.set_ch()
filename = 'consume_order_city_top20.txt'
line_tmp = {}
with open(filename, 'r', encoding='UTF-8') as f:
lines = f.readlines()
for line in lines:
# 分成三段city_name,city_order_num, mhotel_avg_nights_price
line_cuts = line.strip().split('\001')
line_cuts_2 = line_cuts[2].split('&')
line_cuts_3 = list(map(eval, line_cuts_2))
line_tmp[line_cuts[0]] = line_cuts_3
for k, v in line_tmp.items():
print(k, v)
labels = k
plt.xlabel(k, fontsize=20.0)
plt.boxplot(v)
plt.savefig(k + ".png")
plt.show()
该方法效果:
2、两两对比画(这里使用的是df.plot.box)
该方法代码:
#-*-coding:utf-8-*-
import ch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
ch.set_ch()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def read_txt(filename):
line_tmp = {}
dict = {}
with open(filename, 'r', encoding='UTF-8') as f:
lines = f.readlines()
for line in lines:
# 分成三段city_name,city_order_num, mhotel_avg_nights_price
line_cuts = line.strip().split('\001')
line_cuts_2 = line_cuts[2].split('&')
line_cuts_3 = list(map(eval, line_cuts_2))
dict[line_cuts[0]] = line_cuts[1]
line_tmp[line_cuts[0]] = line_cuts_3
return line_tmp, dict
#订单-酒店-城市是方法1
#订单-城市是方法2
filename1 = 'consume_order_hotel_city_tail20.txt'
filename2 = 'consume_order_city_tail20.txt '
line_tmp1, dict1 = read_txt(filename1)
line_tmp2, dict2 = read_txt(filename2)
line_tmp_big = {}
color = dict(boxes = 'DarkGreen',whiskers = 'DarkOrange',medians = 'DarkBlue',caps ='Gray')
for k, v in line_tmp1.items():
s1 = pd.Series(np.array(v))
s2 = pd.Series(np.array(line_tmp2[k]))
data = pd.DataFrame({"方法1": s1, "方法2": s2})
df = pd.DataFrame(data)
df.plot.box(title= "单量为" + dict1[k],
#ylim=[0, 1.2], # y轴刻度范围
vert=True,
positions=[4, 6], # 箱型图占位 相当于箱体之间的间隔
grid=True,
# color=color, # color 样式填充
)
plt.ylabel("平均间夜价", fontsize=12.0)
plt.ylim(0, 500)
plt.xlabel(k, fontsize=12.0)
plt.grid(linestyle="--", alpha=0.2)
plt.savefig(k + ".png")
plt.show()
该方法效果:
3、总结几点建议:
① 当坐标轴中文标签无法正常显示时,建议参考"https://blog.csdn.net/qingchunlizhi/article/details/59481608";
② 当箱线图的上下四位点区间太小不易区分时,建议使用plt.ylim(*, *)来调整y的取值范围以便呈现上下四位点区间;
日积月累,与君共进,增增小结,未完待续。