继续进行股票概念的聚类工作,将个股根据概念分类,然后对各个特征求均值,然后形成一张表,记录每个概念的情况
import tushare as ts
from pandas import Series,DataFrame
import pandas as pd
from sqlalchemy import create_engine
import numpy as np
from sqlalchemy.types import VARCHAR
engine = create_engine('mysql+pymysql://root:123456@localhost:3306/stock?charset=utf8')
# 从数据库中读列表
concept = pd.read_sql_query(''' select * from concept; ''' , engine)
totalBasic = pd.read_sql_query(''' select * from totalBasic; ''' , engine)
t=totalBasic
t.drop(['index'],axis=1,inplace=True)
#提取概念
c=concept.iloc[:,3].drop_duplicates(keep='first',inplace=False)
arrs=c.values
df=DataFrame(arrs)
#建一个表放概念的信息
arrscode=((concept.loc[concept['c_name']=='核电核能']).iloc[:,1]).values
'''
codes=concept.loc[concept['c_name']=='核电核能']
codesonly=codes.iloc[:,1]
arrscode=codesonly.values
'''
data=t[t["code"].isin(arrscode)]
data.drop(['code'],axis=1,inplace=True)
mean=DataFrame(data.mean())
#循环每个概念
for index, row in df.iterrows():
# 提取每个概念下的股票code
codes2=concept.loc[concept['c_name']==row[0]]
codesonly2=codes2.iloc[:,1]
arrscode2=codesonly2.values
#获得该概念下股票的数据,求个特征的均值
data2=t[t["code"].isin(arrscode2)]
data2.drop(['code'],axis=1,inplace=True)
mean2=data2.mean()
#放入总表中
mean[row[0]]=mean2
#对总表处理,删除重复的第一列,并转置,存入数据库
mean.drop([0],axis=1,inplace=True)
totalConcept1=mean.T
engine = create_engine('mysql+pymysql://root:123456@localhost:3306/stock?charset=utf8')
totalConcept1.to_sql('totalConcept1',engine,if_exists='replace')
根据一个数组的值找到另一个表中行,谢谢这位伙伴的分享
https://blog.csdn.net/a19990412/article/details/79302501?utm_source=blogxgwz6
上面的代码有一些报错,最后会导致存不进数据库,我就先存成csv,然后在读取,再存到数据库中,存不进的原因没有找着
totalConcept=pd.read_csv(‘F:/Result.csv’)
totalConcept.rename(columns={‘Unnamed: 0’:‘concept’}, inplace = True)
engine = create_engine(‘mysql+pymysql://root:123456@localhost:3306/stock?charset=utf8’)
totalConcept.to_sql(‘totalConcept’,engine,if_exists=‘replace’)
但不是影响,我现在就开始根据totalconcept这个表做聚类了
使用kmeans做聚类,以下代码参考了许多人的分享,就不给出来了
#聚类
engine = create_engine('mysql+pymysql://root:123456@localhost:3306/stock?charset=utf8')
totalConcept = pd.read_sql_query(''' select * from totalConcept; ''' , engine)
totalConcept.drop(['index'],axis=1,inplace=True)
from sklearn.model_selection import train_test_split
#分割数据
X, y = totalConcept.iloc[:, 1:].values, totalConcept.iloc[:, 0].values
from sklearn import preprocessing
#正则化
min_max_scaler = preprocessing.MinMaxScaler()
X_norm = min_max_scaler.fit_transform(X)
#kmeans聚类
#inertia样本到最近的聚类中心的距离总和
#肘部法则
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
distortions = []
for i in range(1, 11):
km = KMeans(n_clusters=i,
init='k-means++',
n_init=10,
max_iter=300,
random_state=0)
km.fit(X)
distortions.append(km.inertia_)
plt.plot(range(1,11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()
最终得到这个图像
emmmmm,按这个图,肘部法则应该选3,但是感觉不对劲,下一步对这个kmeans聚类算法做一个研究