股票相似度分类
数据来源:Tushare大数据社区
协方差分类
股票选取了上证50中的35支股票,股票代码如下:
‘600036sh’, ‘600031sh’, ‘601166sh’, ‘600104sh’, ‘600030sh’, ‘601628sh’, ‘601766sh’, ‘601857sh’, ‘601398sh’, ‘601390sh’, ‘600029sh’, ‘600028sh’, ‘601111sh’, ‘600837sh’, ‘600887sh’, ‘600690sh’, ‘600519sh’, ‘600016sh’, ‘601988sh’, ‘601601sh’, ‘600019sh’, ‘601186sh’, ‘600703sh’, ‘600196sh’, ‘601318sh’, ‘600050sh’, ‘600309sh’, ‘600048sh’, ‘600276sh’, ‘601088sh’, ‘600585sh’, ‘600000sh’, ‘601328sh’, ‘601939sh’, ‘600340sh’
在scikit-learin众多可用的聚类技术中,我们采用Affinity Propagation(近邻传播);因为它不强求相同大小的类,并且能从数据中自动确定类的数目。
收盘价-开盘价,作为信息载体。通过调整样本选取的时间,来观察一段时间内各股票的相似性
# 收盘价-开盘价,作为信息载体
variation = close_prices - open_prices
for symbol in symbols:
print('Fetching quote history for %r' % symbol, file=sys.stderr)
url = ('D:/那四年/项目实训/暑期实训/git/sz50/{}.csv')
#nrows为时间片段,一行为一天,从最新一天开始往前算
quotes.append(pd.read_csv(url.format(symbol),nrows=500))
针对不同的时间片段,股票的聚类情况都会发生变化。
时间选取过去500天:
时间选取过去100天: