2020年05月_韭浪

原创 PCA 鸢尾花

import numpy as npimport pandas as pdimport mathimport matplotlib.pyplot as plt%matplotlib inline# 列名cols = [ 'sepal length in cm', 'sepal width in cm', 'petal length in cm', 'petal width in cm', 'class label']df = pd.read_csv

2020-05-30 13:38:44 340

转载 LDA 鸢尾花

导语在模式分类和机器学习实践中，线性判别分析（Linear Discriminant Analysis, LDA）方法常被用于数据预处理中的降维（dimensionality reduction）步骤。LDA在保证良好的类别区分度的前提下，将数据集向更低维空间投影，以求在避免过拟合（“维数灾难”）的同时，减小计算消耗。Ronald A. Fisher 在1936年（The Use of Multiple Measurements in Taxonomic Problems）提出了线性判别（Linear

2020-05-29 10:47:21 2419

原创机器学习用到的数据集

共享在onedrive，可能要科学上网https://1drv.ms/u/s!AraXLAZSHa2okTclNU_hkV-yNKvS?e=E9Fxs6

2020-05-27 17:32:20 173

原创 numpy.corrcoef 计算相关系数

numpy.corrcoef(x, y=None, rowvar=True)x：(array_like)，rowvar=True时，行为特征，列为记录。rowvar=False相反y：(array_like，可选)，一组额外的特征和值，数组形状与x相同import numpy as npimport pandas as pdimport matplotlib.pyplot as pltbeer = pd.read_csv(r"...\data\beer_data.txt", sep='

2020-05-27 16:28:02 3628

原创 Kmeans和DBSCAN聚类算法实战

无监督问题KMeans算法优化目标min∑i=1K∑x∈Cidist(ci,x)2min\sum\limits^K_{i=1} \sum\limits_{x \in C_i} dist(c_i, x)^2mini=1∑Kx∈Ci∑dist(ci,x)2

2020-05-27 14:11:12 603

原创拼写检查器

import re, collectionsdef get_words(text): """提取所有单词, 转换小写""" return re.findall('[a-z]+', text.lower())def train(words): """统计词频""" model = collections.defaultdict(lambda: 1) # 没有的默认value=1 for w in words: model[w] += 1

2020-05-26 15:14:04 258

原创机器学习项目实战——新闻分类任务

语料清洗（停用词、重复）词频(Term Frequency)TF=某个词在文章出现的次数文章长度TF = \frac{某个词在文章出现的次数}{文章长度}TF=文章长度某个词在文章出现的次数逆文档频率(Inverse Document Frequency)IDF=log⁡(语料库的文档总数包含该词的文档数+1)IDF = \log(\frac{语料库的文档总数}{包含该词的文档数 + 1})IDF=log(包含该词的文档数+1语料库的文档总数)tf-idf = TF x IDF相似度..

2020-05-26 15:02:28 962

原创机器学习项目实战——集成预测政治献金

# 导入相关库import numpy as npimport pandas as pdimport matplotlib.pyplot as plt%matplotlib inlineimport osos.environ['PATH'] += ';...\\Graphviz2.38\\bin' # Graphviz 临时环境变量数据预处理### 导入数据# 设置随机种子SEED = 222np.random.seed(SEED)df = pd.read_csv(r'..

2020-05-24 21:20:44 690

原创机器学习项目实战——泰坦尼克号获救预测

导入相关库import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom matplotlib.pyplot import imshow%matplotlib inline数据预处理# 读取数据集df = pd.read_csv(r'...\data\titanic_train.csv')# df.head()# 中位数填充缺失值df['Age'] = df['Age'].fillna(df['Ag

2020-05-23 14:55:22 512

原创 pandas缺失值相关处理

# 删除全部为NaN的行df.dropna(axis=0, how='all') # 删除含有NaN的行df.dropna(axis=0,how='any')# 删除全部为NaN的列df.dropna(axis=1,how='all') # 删除含有NaN的列df.dropna(axis=1,how='any')

2020-05-23 12:49:30 268 1

原创机器学习之决策树模型学习笔记

求熵H(x)=−∑p(i)∗log(p(i))H(x) = -\sum p(i) *log(p(i))H(x)=−∑p(i)∗log(p(i))当p=0或p=1时，H=0 熵值最小；当p=0.5时，H=1 熵值最大。信息增益表示特征X使得类Y的不确定性减少的程度。假如原来熵值等于10，经过一次决策过后，熵值降低为8，那么信息增益值就等于2，那么我们可以遍历所有特征的熵值，看下哪个特征使我们的信息增益值最大，那么这个特征就是根节点。依次类推，再在剩下的特征中继续寻找信息增益值最大的特征，那么这个特征就

2020-05-21 19:39:44 315

原创机器学习项目实战——信用卡欺诈检测(过采样代码)

import pandas as pdfrom imblearn.over_sampling import SMOTE # pip install imblearnfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import confusion_matrix, recall_scorefrom sklearn.model_selection import train_test_splitimport

2020-05-21 18:08:45 558

转载 LogiReg_data.txt

34.62365962451697,78.0246928153624,030.28671076822607,43.89499752400101,035.84740876993872,72.90219802708364,060.18259938620976,86.30855209546826,179.0327360507101,75.3443764369103,145.08327747668339,56.3163717815305,061.10666453684766,96.51142588489

2020-05-18 13:58:33 634

原创后台启动celery

使用supervisorvi supervisord.conf[program:celeryd]command=/home/.../python3/bin/celery -A celery_tasks.tasks worker -P eventlet --loglevel=INFO --concurrency=15stdout_logfile={项目目录}/celeryd.logstderr_logfile={项目目录}/celeryd.logautostart=trueautorestart

2020-05-18 13:55:28 864 1

原创机器学习项目实战——信用卡欺诈检测

模型评估方法召回率：Recall=TPTP+FNRecall = \frac{TP}{TP+FN}Recall=TP+FNTPTP(true positives): 正类判定为正类TF(false positives): 负类判定为正类FN(false negatives): 正类判定为负类TN(true negatives): 负类判定为负类正则化惩罚尽量使模型的浮动差异更小，浮动大容易过度拟合（过度拟合：训练集表达效果好，测试集表达效果差）。正则化可以通过大力度惩罚浮动大的模型降低浮

2020-05-15 22:43:54 488 4

原创机器学习项目实战——预测学生是否被录取

要完成的模块sigmoid: 映射到概率的函数model: 返回预测结果值(预测函数)cost: 根据参数计算损失gradient: 计算每个参数的梯度反向descent: 进行参数更新accuracy: 计算精度 sigmoid函数g(z)=11+e−zg(z) = \frac{1}{1 + e^{-z}}g(z)=1+e−z1def sigmoid(z): return 1 / (1 + np.exp(-z)) 预测函数hθ(x)=g(θTx)=1

2020-05-15 22:43:39 1692 1

原创 scrapy入门学习笔记

一. 新建项目scrapy startproject mySpider# 项目的目录结构mySpider/ scrapy.cfg mySpider/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py二. 明确目标vi mySpider/items.pyimport scrapyclass

2020-05-15 22:43:14 67

原创 pyspider 爬取bing壁纸

目标页面 https://bing.gifposter.com/list/new/desc/classic.html?p=1#!/usr/bin/env python# -*- encoding: utf-8 -*-# Created on 2020-05-13 09:56:10# Project: bingfrom pyspider.libs.base_handler import *from time import strftime, strptimeimport pymysqlcl

2020-05-15 22:43:03 145

原创 centos6初始安装记录

进入rootsudo su rootpasswd新建用户useradd -g root mjroot@: passwd mjusermod -a -G root mj # 将用户加入管理权限组vi /etc/sudoersmj ALL=(ALL:ALL) NOPASSWD: ALLmj@: sudo chmod 777 -R mj # 修改用户目录权限命令行只有**$*...

2020-05-15 22:42:58 117

转载 python 时间格式转换 time datetime

2020-05-15 22:42:53 218

转载常用xpath选择器和css选择器总结

xpath选择器表达式说明article选取所有article元素的所有子节点/article选取根元素articlearticle/a选取所有属于article的子元素的a元素//div选取所有div子元素（不论出现在文档任何地方）article//div选取所有属于article元素的后代的div元素，不管它出现在article之下的任何位置//@class选取所有名为class的属性/article/div[1]选取属于artic

2020-05-15 22:42:48 231

转载 pyspider在Linux的安装和基本使用

安装pyspiderpython==2.7升级pippip install --upgrade pip安装pyspiderpip install pyspider安装依赖yum install bzip2yum install fontconfigyum install curlpip install mysql-connectorpip install redis开放端口phantomjs 25555pyspider 5000安装phantomjshttps://ph

2020-05-15 22:42:36 864

原创用iptables-save 和 iptables-restore开放端口

# 生成配置文件/usr/sbin/iptables-save > /etc/sysconfig/iptables-config# iptables-config包含五部分 nat, mangle, security, raw, filter# vi /etc/sysconfig/iptables-config# 在filter部分的最后加入-A IN_public_allow -p tcp -m tcp --dport {{开放端口}} -m conntrack --ctstate NE

2020-05-15 22:42:29 877

转载 Python中的单例模式的几种实现方式的及优化

单例模式单例模式（Singleton Pattern）是一种常用的软件设计模式，该模式的主要目的是确保某一个类只有一个实例存在。当你希望在整个系统中，某个类只能出现一个实例时，单例对象就能派上用场。比如，某个服务器程序的配置信息存放在一个文件中，客户端通过一个 AppConfig 的类来读取配置文件的信息。如果在程序运行期间，有很多地方都需要使用配置文件的内容，也就是说，很多地方都需要创建...

2020-05-15 22:42:19 151

原创 crontab 基本操作

# crontab日志位置/var/log/cron# 系统任务/etc/crontab# 修改 crontab 文件crontab -e# 显示 crontab 文件crontab -l# 更改后重启crontabcrontab -u root /var/spool/cron/rootservice crond restart# error: /bin/sh^M:...

2020-05-15 22:42:10 166

转载图片按容器大小缩放，上下左右居中

.test { width: 350px; height: 350px; background-origin: content-box; /*从content区域开始显示背景*/ background-size: contain; /*图像按比例缩放在容器内*/ background-position: 50% 50%; /*图片上下左右居中*...

2020-05-15 22:42:03 266

原创 python 上传文件至FastDFS

clien.conf 主要配置base_path=/home/mj/fastdfs/trackertracker_server=120.26.176.89:22122from fdfs_client.client import Fdfs_clientclient = Fdfs_client(r'<path>/client.conf')ret = client.upload...

2020-05-15 22:41:56 703

原创完美安装mysql

Linux下安装mysql

2020-05-15 22:41:49 155

原创 centos6 安装nginx+uwsgi

环境配置su root# 安装 gcc,g++,makeapt-get install build-essential# 安装 zlib, zlib-develapt-get install zlib1g zlib1g.dev# 安装 openssl-devapt-get install libssl-dev# 安装旧版openssl1.0.1cd /usr/local/src...

2020-05-15 22:41:41 188

原创 python 任意进制的转换

python 任意进制的转换本例是为了熟悉递归的使用，实现不同进制之间的转换结果仅供参考，因为16 进制的字母没有完成def count(num, from_, to): s = [] o_num = sum([int(i) * from_ ** n for n, i in enumerate(num[::-1])]) print("{}进制:{} --> 10进制:{}".fo...

2020-05-15 22:41:28 3885

原创【python.threading/queue】多线程基本使用

import queueimport threadingimport timeexitFlag = 0class myThread (threading.Thread): def __init__(self, threadID, name, q): threading.Thread.__init__(self) self.threadID = threadID self.name = name self.q = q

2020-05-15 22:41:03 89

weixin_43326122的博客