sklearn数据预处理
数据预处理
弎见
这个作者很懒,什么都没留下…
展开
-
协同过滤与隐语义模型推荐系统实例1: 数据处理
构建一个音乐推荐系统import pandas as pdimport numpy as npimport timeimport sqlite3data_home = 'F:/51学习/study/机器学习进阶/第14章Python从零开始构建音乐推荐系统/Python实现音乐推荐系统/'triplet_dataset = pd.read_csv(filepath_or_buffe...原创 2020-02-15 18:13:36 · 638 阅读 · 1 评论 -
LightGBM预测饭店流量1: 数据处理
饭店流量数据import pandas as pdair_visit = pd.read_csv('air_visit_data.csv')air_visit.index = pd.to_datetime(air_visit['visit_date'])air_visit.head()# 按天来算air_visit = air_visit.groupby('air_store_...原创 2019-12-22 23:21:33 · 1082 阅读 · 2 评论 -
sklearn文本特征预处理2:Similarity, 聚类, LDA, word2vec
接上一篇<sklearn文本特征预处理1: WordPunctTokenizer, CountVectorizer, TF-IDF>五. Similarity特征# 余弦相似度from sklearn.metrics.pairwise import cosine_similaritysimilarity_matrix = cosine_similarity(tv_matrix...原创 2019-11-15 22:07:11 · 907 阅读 · 0 评论 -
sklearn文本特征预处理1: WordPunctTokenizer, CountVectorizer, TF-IDF
构造一个文本数据集import pandas as pdimport numpy as npcorpus = ['The sky is blue and beautiful.', 'Love this blue and beautiful sky!', 'The quick brown fox jumps over the lazy dog.', ...原创 2019-11-15 21:57:01 · 1193 阅读 · 2 评论 -
sklearn数值特征之时间处理
import pandas as pdimport numpy as npimport datetimefrom dateutil.parser import parse # parse根据字符串解析成datetime,字符串可以很随意,可用时间日期的英文单词,可用横线,逗号,空格等做分隔符import pytz # 时区time_stamps = ['2015-03-08 10:3...原创 2019-11-11 21:36:13 · 1688 阅读 · 0 评论 -
sklearn数值特征连续值处理3: 对数变换COX-BOX
import pandas as pdimport numpy as npfcc_survey_df = pd.read_csv('fcc_2016_coder_survey_subset.csv',encoding='utf-8')fcc_survey_df['Income_log'] = np.log(1 + fcc_survey_df['Income']) # 对数变换fcc_s...原创 2019-11-11 18:27:49 · 1746 阅读 · 0 评论 -
sklearn数值特征连续值处理2: 分位数切分quantile
import pandas as pdfcc_survey_df = pd.read_csv('fcc_2016_coder_survey_subset.csv',encoding='utf-8')fcc_survey_df[['ID.x','Age','Income']].iloc[2:7]import matplotlib.pyplot as pltimport matplotl...原创 2019-11-11 18:22:02 · 1546 阅读 · 0 评论 -
sklearn数值特征连续值处理1: Binning based on rounding
import pandas as pdfcc_survey_df = pd.read_csv('fcc_2016_coder_survey_subset.csv',encoding='utf-8')fcc_survey_df[['ID.x','EmploymentField','Age','Income']].head()import matplotlib.pyplot as plt...原创 2019-11-11 18:07:12 · 633 阅读 · 0 评论 -
sklearn数值特征多项式处理: PolynomialFeatures
import pandas as pdpoke_df = pd.read_csv('Pokemon.csv', encoding='utf-8')atk_df = poke_df[['Attack', 'Defense']]atk_df.head()from sklearn.preprocessing import PolynomialFeaturespf = Polynomial...原创 2019-11-11 11:34:55 · 1165 阅读 · 0 评论 -
sklearn数值特征二值化处理: Binarizer
import pandas as pdpopsong_df = pd.read_csv('song_views.csv',encoding='utf-8')popsong_df.head(10)首先直接用numpy实现二值化:import numpy as npwatched = np.array(popsong_df['listen_count'])watched[watch...原创 2019-11-11 10:57:45 · 2167 阅读 · 0 评论 -
sklearn数值特征离散值处理4: get_dummies()
import pandas as pdpoke_df = pd.read_csv('Pokemon.csv', encoding='utf-8')poke_df[['Name','Generation']].iloc[4:10]gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True) #去掉第...原创 2019-11-11 00:07:40 · 983 阅读 · 0 评论 -
sklearn数值特征离散值处理3: One-hot Encoding
import numpy as npimport pandas as pdpoke_df = pd.read_csv('Pokemon.csv', encoding='utf-8')poke_df[['Name', 'Generation', 'Legendary']].iloc[4:10]from sklearn.preprocessing import LabelEncoder...原创 2019-11-10 23:50:37 · 1021 阅读 · 0 评论 -
sklearn数值特征离散值处理2: Map
import numpy as npimport pandas as pdpoke_df = pd.read_csv('Pokemon.csv', encoding='utf-8')poke_df.head(10)# 随机抽样poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)# pandas...原创 2019-11-10 22:07:26 · 852 阅读 · 0 评论 -
sklearn数值特征离散值处理1: LabelEncoder
import pandas as pdimport numpy as npvg_df = pd.read_csv('vgsales.csv', encoding = 'ISO-8859-1')vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]genres = np.unique(vg_df['Genre...原创 2019-11-10 21:45:19 · 913 阅读 · 0 评论 -
sklearn数据预处理之标准化和归一化 学习笔记
数据预处理--标准化与归一化注意点:1: 标准化和归一化代码:2: plt.tight_layout()3: 如何在plot画图中输入数学符号????和????数据预处理实例对原始数据进行处理 , 有两种方法 :1: 标准化standardization ( 或者叫做Z-score normalization ) , 均值 ????=0 , 标准差 ????=1???? = (????−????) / ????2: 归一化M...原创 2019-09-13 23:48:59 · 734 阅读 · 0 评论