机器学习数据预处理

最新推荐文章于 2024-06-11 10:58:03 发布

置顶 sunchengquan

最新推荐文章于 2024-06-11 10:58:03 发布

阅读量1.9k

点赞数 1

分类专栏： python数据挖掘

本文链接：https://blog.csdn.net/sunchengquan/article/details/104663672

版权

python数据挖掘专栏收录该内容

18 篇文章 8 订阅

订阅专栏

文章目录

Preparing data is time-consuming
Data in the real world is dirty
Preprocessing data to avoid "garbage in, garbage out"
Data Preprocessing
- 标准化
Feature Engineering

The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools…

Preparing data is time-consuming

在这里插入图片描述
What data scientists spend the most time doing

Messy data is by far the most time-consuming aspect of the typical data scientist’s work flow

Data in the real world is dirty

在这里插入图片描述
Data is rarely clean and often you can have data quality issues

The typical data quality issues that arise are:

Incomplete: Data lacks attributes or containing missing values.
Noisy: Data contains erroneous records or outliers.
Inconsistent: Data contains conflicting records or discrepancies.

Preprocessing data to avoid “garbage in, garbage out”

在这里插入图片描述

Preprocessing data - Clean your data

Why to deal with missing values?

Missing values in a dataset can be due to error or because observations that were not recorded
When missing value are present, certain algorithms may not work or you may not have the desired result.
Missing data affects some models more than others1
Even for models that handle missing data, they can be sensitive to it (missing data for certain variables can result in poor predictions)2

1&2 Source: https://channel9.msdn.com/Events/OpenSourceTW/DevDays-Asia-2017/AI12

How to deal with missing values?

Typical missing value handling methods are:

Deletion
Remove records with missing values
Dummy substitution
Replace missing values with a dummy value: e.g, unknown for categorical or 0 for numerical values.
Mean substitution
If the missing data is numerical, replace the missing values with the mean.
Frequent substitution
If the missing data is categorical, replace the missing values with the most frequent item
Regression substitution
Use a regression method to replace missing values with regressed values.

在这里插入图片描述

What you should know about the outliers/anomalies

Outliers may bring about problems by distorting the predictive model.
What’s an outlier is somewhat subjective.1
Outliers can be very common in multidimensional data.2
Some models are less sensitive (more robust ) to outliers than others.3
Outliers can be result of bad data collection, or they can legitimate extreme (or unusual) values.4
Sometimes outliers are the interesting data points we want to model, and other times they just get in the way.5

在这里插入图片描述

1,2,3,4 &5 Source: https://channel9.msdn.com/Events/OpenSourceTW/DevDays-Asia-2017/AI12

Cause of outliers

在这里插入图片描述
How to deal with outliers?

The choice of how to deal with an outlier should depend on the cause.

Keep outliers
Outliers should not necessarily be omitted from the analysis as they may be genuine observations in the data.

 In many applications, outliers provide crucial information. For example, in a credit card fraud detection app, they indicated purchases that fall outside a customer’s usual buying patterns.1

Exclude outliers
- There are two common approaches to exclude outliers 2
  - Trimming/Truncation: Trimming discards the outliers
  - Winsorising: Winsorising replaces the outliers with the nearest “non-suspect” data.

1 source : Mathworks Box Plot’s source: http://www.physics.csbsju.edu/stats/box2.html

2 source: https://en.wikipedia.org/wiki/Outlier#Working_with_outliers

Examples on how to deal with outliers

Preprocessing data - Data normalization

How to normalize data?

Data normalization re-scales numerical values to a specified range. Popular data normalization methods include:

在这里插入图片描述

Preprocessing data - Data discretization

How to discretize data?

A numeric variable may have many different values and for some algorithms this may lead to very complex models. You can convert continuous attributes by “binning” to categorical attributes for ease of use with certain machine learning methods.
在这里插入图片描述

Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns1.

Binning

Binning helps to improve model performance. It captures non-linear behavior of continuous variables.
It minimizes the impact of outliers. It removes “noise” from large numbers of distinct values
It makes the models more explainable – grouped values are easier to display and understand. It improves model build speed – predictive algorithms build much faster as the number of distinct values decreases.

Preprocessing data - Data reduction

How to reduce data ?

There are various methods to reduce data size for easier data handling. Depending on data size and the domain, the following methods can be applied:

Record Sampling
Sample the data records and only choose the representative subset from the data.
Attribute Sampling
Select only a subset of the most important attributes from the data.
Aggregation
Divide the data into groups and store the numbers for each group. For example, the daily revenue numbers of a restaurant chain over the past 20 years can be aggregated to monthly revenue to reduce the size of the data

Sample Data

If the dataset you plan to analyze is large, it’s usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. This facilitates data understanding, exploration, and feature engineering.

More data can result in much longer running times for algorithms and larger computational and memory requirements. You can take a smaller representative sample of the selected data that may be much faster for exploring and prototyping solutions before considering the whole dataset.

Preprocessing data - Text cleaning

How to clean text data ?

Improper text encoding handling while writing/reading text leads to information loss, inadvertent introduction of unreadable characters, e.g., nulls, and may also affect text parsing.

Unstructured text such as tweets, product reviews, or search queries usually requires some preprocessing before it can be analyzed.

For example:

replacing special characters and punctuation marks with spaces
normalizing case
removing duplicate characters
removing user-defined or built-in stop-words
word stemming
Data Preprocessing
Feature Engineering
Feature Selection

Data Preprocessing

标准化

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np
import pandas as pd

views = pd.DataFrame([1295., 25., 19000., 5., 1., 300.], columns=['views'])
views

	views
0	1295.0
1	25.0
2	19000.0
3	5.0
4	1.0
5	300.0

Standard Scaler $\frac{x_i - \mu}{\sigma}$

ss = StandardScaler()
views['zscore'] = ss.fit_transform(views[['views']])
views

	views	zscore
0	1295.0	-0.307214
1	25.0	-0.489306
2	19000.0	2.231317
3	5.0	-0.492173
4	1.0	-0.492747
5	300.0	-0.449877

vw = np.array(views['views'])
(vw[0] - np.mean(vw)) / np.std(vw)

-0.30721413311687235

Min-Max Scaler $\frac{x_i - min(x)}{max(x) - min(x)}$

mms = MinMaxScaler()
views['minmax'] = mms.fit_transform(views[['views']])
views

	views	zscore	minmax
0	1295.0	-0.307214	0.068109
1	25.0	-0.489306	0.001263
2	19000.0	2.231317	1.000000
3	5.0	-0.492173	0.000211
4	1.0	-0.492747	0.000000
5	300.0	-0.449877	0.015738

(vw[0] - np.min(vw)) / (np.max(vw) - np.min(vw))

0.06810884783409653

Robust Scaler $\frac{x_i - median(x)}{IQR_{(1,3)}(x)}$

rs = RobustScaler()
views['robust'] = rs.fit_transform(views[['views']])
views

	views	zscore	minmax	robust
0	1295.0	-0.307214	0.068109	1.092883
1	25.0	-0.489306	0.001263	-0.132690
2	19000.0	2.231317	1.000000	18.178528
3	5.0	-0.492173	0.000211	-0.151990
4	1.0	-0.492747	0.000000	-0.155850
5	300.0	-0.449877	0.015738	0.132690

quartiles = np.percentile(vw, (25., 75.))
iqr = quartiles[1] - quartiles[0]
(vw[0] - np.median(vw)) / iqr

1.0928829915560916

Feature Engineering

Feature Engineering is the key task in machine learning

There is no formal definition of feature engineering. It means different things to different people. In Google‘s definition, the process of extracting features from raw data is called feature engineering. In Microsoft’s documentation feature engineering is more about feature construction.

Feature Engineering is a sort of art

Feature engineers requires a creative combination of domain expertise and insights obtained from the data exploration step.

This is a balancing act of finding and including informative variables while avoiding too many unrelated variables.

Informative variables improve our result; unrelated variables introduce unnecessary noise into the model

What’s feature
在这里插入图片描述

Feature Engineering can augment your data

在这里插入图片描述

Why should you perform Feature Selection ?

Curse of dimensionality

The curse of dimensionality refers to how certain learning algorithms may perform poorly in high-dimensional data.

For example, after a certain point, increasing the dimensionality of the problem by adding new features would actually degrade the performance of our classifier. This is illustrated by the following figure, and is often referred to as ‘The Curse of Dimensionality’ 2.

在这里插入图片描述

Summary: the modern approaches for Feature Selection

在这里插入图片描述

Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use cross validation.
Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.
Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.
Using the subset of features from the wrapper methods make the model more prone to overfitting as compared to using subset of features from the filter methods.
Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods.

数值特征

离散值处理

import re
import nltk
import pytz
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import scipy.stats as spstats
import datetime
from dateutil.parser import parse
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

%matplotlib inline
mpl.style.reload_library()
mpl.style.use('classic')
mpl.rcParams['figure.facecolor'] = (1, 1, 1, 0)
# mpl.rcParams['figure.figsize'] = [8.0, 5.0]
mpl.rcParams['figure.dpi'] = 100
%config InlineBackend.figure_format = 'retina'

vg_df = pd.read_csv('datasets/vgsales.csv', encoding = "ISO-8859-1")
vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]

	Name	Platform	Year	Genre	Publisher
1	Super Mario Bros.	NES	1985.0	Platform	Nintendo
2	Mario Kart Wii	Wii	2008.0	Racing	Nintendo
3	Wii Sports Resort	Wii	2009.0	Sports	Nintendo
4	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	Nintendo
5	Tetris	GB	1989.0	Puzzle	Nintendo
6	New Super Mario Bros.	DS	2006.0	Platform	Nintendo

genres = np.unique(vg_df['Genre'])
genres

array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
       'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
       'Strategy'], dtype=object)

LabelEncoder

gle = LabelEncoder()
genre_labels = gle.fit_transform(vg_df['Genre'])
genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
genre_mappings

{0: 'Action',
 1: 'Adventure',
 2: 'Fighting',
 3: 'Misc',
 4: 'Platform',
 5: 'Puzzle',
 6: 'Racing',
 7: 'Role-Playing',
 8: 'Shooter',
 9: 'Simulation',
 10: 'Sports',
 11: 'Strategy'}

genre_labels

array([10,  4,  6, ...,  6,  5,  4])

vg_df['GenreLabel'] = genre_labels
vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]

	Name	Platform	Year	Genre	GenreLabel
1	Super Mario Bros.	NES	1985.0	Platform	4
2	Mario Kart Wii	Wii	2008.0	Racing	6
3	Wii Sports Resort	Wii	2009.0	Sports	10
4	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	7
5	Tetris	GB	1989.0	Puzzle	5
6	New Super Mario Bros.	DS	2006.0	Platform	4

Map

poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)

np.unique(poke_df['Generation'])

array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)

gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 
               'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}

poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10]

	Name	Generation	GenerationLabel
4	Octillery	Gen 2	2
5	Helioptile	Gen 6	6
6	Dialga	Gen 4	4
7	DeoxysDefense Forme	Gen 3	3
8	Rapidash	Gen 1	1
9	Swanna	Gen 5	5

对于数据，我们同样可以给它反变换回去

ord_gen_map = {v:k for k,v in gen_ord_map.items()}
poke_df['InvGeneration'] = poke_df['GenerationLabel'].map(ord_gen_map)
poke_df[['Name', 'Generation', 'InvGeneration']].head()

	Name	Generation	InvGeneration
0	CharizardMega Charizard Y	Gen 1	Gen 1
1	Abomasnow	Gen 4	Gen 4
2	Sentret	Gen 2	Gen 2
3	Litleo	Gen 6	Gen 6
4	Octillery	Gen 2	Gen 2

poke_df[['Name', 'Generation', 'Legendary']].iloc[4:10]

	Name	Generation	Legendary
4	Octillery	Gen 2	False
5	Helioptile	Gen 6	False
6	Dialga	Gen 4	True
7	DeoxysDefense Forme	Gen 3	True
8	Rapidash	Gen 1	False
9	Swanna	Gen 5	False

# transform and map pokemon generations
gen_le = LabelEncoder()
gen_labels = gen_le.fit_transform(poke_df['Generation'])
poke_df['Gen_Label'] = gen_labels

# transform and map pokemon legendary status
leg_le = LabelEncoder()
leg_labels = leg_le.fit_transform(poke_df['Legendary'])
poke_df['Lgnd_Label'] = leg_labels

poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]
poke_df_sub.iloc[4:10]

	Name	Generation	Gen_Label	Legendary	Lgnd_Label
4	Octillery	Gen 2	1	False	0
5	Helioptile	Gen 6	5	False	0
6	Dialga	Gen 4	3	True	1
7	DeoxysDefense Forme	Gen 3	2	True	1
8	Rapidash	Gen 1	0	False	0
9	Swanna	Gen 5	4	False	0

One-hot Encoding

# encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder()
gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()
gen_feature_labels = list(gen_le.classes_)
print (gen_feature_labels)
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)

# encode legendary status labels using one-hot encoding scheme
leg_ohe = OneHotEncoder()
leg_feature_arr = leg_ohe.fit_transform(poke_df[['Lgnd_Label']]).toarray()
leg_feature_labels = ['Legendary_'+str(cls_label) for cls_label in leg_le.classes_]
print (leg_feature_labels)
leg_features = pd.DataFrame(leg_feature_arr, columns=leg_feature_labels)

['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6']
['Legendary_False', 'Legendary_True']

poke_df_ohe = pd.concat([poke_df_sub, gen_features, leg_features], axis=1)
columns = sum([['Name', 'Generation', 'Gen_Label'],gen_feature_labels,
              ['Legendary', 'Lgnd_Label'],leg_feature_labels], [])
poke_df_ohe[columns].iloc[4:10]

	Name	Generation	Gen_Label	Gen 1	Gen 2	Gen 3	Gen 4	Gen 5	Gen 6	Legendary	Lgnd_Label	Legendary_False	Legendary_True
4	Octillery	Gen 2	1	0.0	1.0	0.0	0.0	0.0	0.0	False	0	1.0	0.0
5	Helioptile	Gen 6	5	0.0	0.0	0.0	0.0	0.0	1.0	False	0	1.0	0.0
6	Dialga	Gen 4	3	0.0	0.0	0.0	1.0	0.0	0.0	True	1	0.0	1.0
7	DeoxysDefense Forme	Gen 3	2	0.0	0.0	1.0	0.0	0.0	0.0	True	1	0.0	1.0
8	Rapidash	Gen 1	0	1.0	0.0	0.0	0.0	0.0	0.0	False	0	1.0	0.0
9	Swanna	Gen 5	4	0.0	0.0	0.0	0.0	1.0	0.0	False	0	1.0	0.0

Get Dummy

Pandas库中同样有类似的操作，使用get_dummies也可以得到相应的特征

gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True)
pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]

	Name	Generation	Gen 2	Gen 3	Gen 4	Gen 5	Gen 6
4	Octillery	Gen 2	1	0	0	0	0
5	Helioptile	Gen 6	0	0	0	0	1
6	Dialga	Gen 4	0	0	1	0	0
7	DeoxysDefense Forme	Gen 3	0	1	0	0	0
8	Rapidash	Gen 1	0	0	0	0	0
9	Swanna	Gen 5	0	0	0	1	0

gen_onehot_features = pd.get_dummies(poke_df['Generation'])
pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[4:10]

	Name	Generation	Gen 1	Gen 2	Gen 3	Gen 4	Gen 5	Gen 6
4	Octillery	Gen 2	0	1	0	0	0	0
5	Helioptile	Gen 6	0	0	0	0	0	1
6	Dialga	Gen 4	0	0	0	1	0	0
7	DeoxysDefense Forme	Gen 3	0	0	1	0	0	0
8	Rapidash	Gen 1	1	0	0	0	0	0
9	Swanna	Gen 5	0	0	0	0	1	0

poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
poke_df.head()

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	Gen 1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	Gen 1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	Gen 1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	Gen 1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	Gen 1	False

poke_df[['HP', 'Attack', 'Defense']].head()

	HP	Attack	Defense
0	45	49	49
1	60	62	63
2	80	82	83
3	80	100	123
4	39	52	43

poke_df[['HP', 'Attack', 'Defense']].describe()

	HP	Attack	Defense
count	800.000000	800.000000	800.000000
mean	69.258750	79.001250	73.842500
std	25.534669	32.457366	31.183501
min	1.000000	5.000000	5.000000
25%	50.000000	55.000000	50.000000
50%	65.000000	75.000000	70.000000
75%	80.000000	100.000000	90.000000
max	255.000000	190.000000	230.000000

popsong_df = pd.read_csv('datasets/song_views.csv', encoding='utf-8')
popsong_df.head(10)

	user_id	song_id	title	listen_count
0	b6b799f34a204bd928ea014c243ddad6d0be4f8f	SOBONKR12A58A7A7E0	You're The One	2
1	b41ead730ac14f6b6717b9cf8859d5579f3f8d4d	SOBONKR12A58A7A7E0	You're The One	0
2	4c84359a164b161496d05282707cecbd50adbfc4	SOBONKR12A58A7A7E0	You're The One	0
3	779b5908593756abb6ff7586177c966022668b06	SOBONKR12A58A7A7E0	You're The One	0
4	dd88ea94f605a63d9fc37a214127e3f00e85e42d	SOBONKR12A58A7A7E0	You're The One	0
5	68f0359a2f1cedb0d15c98d88017281db79f9bc6	SOBONKR12A58A7A7E0	You're The One	0
6	116a4c95d63623a967edf2f3456c90ebbf964e6f	SOBONKR12A58A7A7E0	You're The One	17
7	45544491ccfcdc0b0803c34f201a6287ed4e30f8	SOBONKR12A58A7A7E0	You're The One	0
8	e701a24d9b6c59f5ac37ab28462ca82470e27cfb	SOBONKR12A58A7A7E0	You're The One	68
9	edc8b7b1fd592a3b69c3d823a742e1a064abec95	SOBONKR12A58A7A7E0	You're The One	0

二值特征

watched = np.array(popsong_df['listen_count']) 
watched[watched >= 1] = 1
popsong_df['watched'] = watched
popsong_df.head(10)

	user_id	song_id	title	listen_count	watched
0	b6b799f34a204bd928ea014c243ddad6d0be4f8f	SOBONKR12A58A7A7E0	You're The One	2	1
1	b41ead730ac14f6b6717b9cf8859d5579f3f8d4d	SOBONKR12A58A7A7E0	You're The One	0	0
2	4c84359a164b161496d05282707cecbd50adbfc4	SOBONKR12A58A7A7E0	You're The One	0	0
3	779b5908593756abb6ff7586177c966022668b06	SOBONKR12A58A7A7E0	You're The One	0	0
4	dd88ea94f605a63d9fc37a214127e3f00e85e42d	SOBONKR12A58A7A7E0	You're The One	0	0
5	68f0359a2f1cedb0d15c98d88017281db79f9bc6	SOBONKR12A58A7A7E0	You're The One	0	0
6	116a4c95d63623a967edf2f3456c90ebbf964e6f	SOBONKR12A58A7A7E0	You're The One	17	1
7	45544491ccfcdc0b0803c34f201a6287ed4e30f8	SOBONKR12A58A7A7E0	You're The One	0	0
8	e701a24d9b6c59f5ac37ab28462ca82470e27cfb	SOBONKR12A58A7A7E0	You're The One	68	1
9	edc8b7b1fd592a3b69c3d823a742e1a064abec95	SOBONKR12A58A7A7E0	You're The One	0	0

# > 0.9 变成1， 小于0.9 变成0
bn = Binarizer(threshold=0.9) 
pd_watched = bn.transform([popsong_df['listen_count']])[0]
popsong_df['pd_watched'] = pd_watched
popsong_df.head(11)

	user_id	song_id	title	listen_count	watched	pd_watched
0	b6b799f34a204bd928ea014c243ddad6d0be4f8f	SOBONKR12A58A7A7E0	You're The One	2	1	1
1	b41ead730ac14f6b6717b9cf8859d5579f3f8d4d	SOBONKR12A58A7A7E0	You're The One	0	0	0
2	4c84359a164b161496d05282707cecbd50adbfc4	SOBONKR12A58A7A7E0	You're The One	0	0	0
3	779b5908593756abb6ff7586177c966022668b06	SOBONKR12A58A7A7E0	You're The One	0	0	0
4	dd88ea94f605a63d9fc37a214127e3f00e85e42d	SOBONKR12A58A7A7E0	You're The One	0	0	0
5	68f0359a2f1cedb0d15c98d88017281db79f9bc6	SOBONKR12A58A7A7E0	You're The One	0	0	0
6	116a4c95d63623a967edf2f3456c90ebbf964e6f	SOBONKR12A58A7A7E0	You're The One	17	1	1
7	45544491ccfcdc0b0803c34f201a6287ed4e30f8	SOBONKR12A58A7A7E0	You're The One	0	0	0
8	e701a24d9b6c59f5ac37ab28462ca82470e27cfb	SOBONKR12A58A7A7E0	You're The One	68	1	1
9	edc8b7b1fd592a3b69c3d823a742e1a064abec95	SOBONKR12A58A7A7E0	You're The One	0	0	0
10	fb41d1c374d093ab643ef3bcd70eeb258d479076	SOBONKR12A58A7A7E0	You're The One	1	1	1

多项式特征

atk_def = poke_df[['Attack', 'Defense']]
atk_def.head()

	Attack	Defense
0	49	49
1	62	63
2	82	83
3	100	123
4	52	43

pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
res = pf.fit_transform(atk_def)
res

array([[   49.,    49.,  2401.,  2401.,  2401.],
       [   62.,    63.,  3844.,  3906.,  3969.],
       [   82.,    83.,  6724.,  6806.,  6889.],
       ...,
       [  110.,    60., 12100.,  6600.,  3600.],
       [  160.,    60., 25600.,  9600.,  3600.],
       [  110.,   120., 12100., 13200., 14400.]])

intr_features = pd.DataFrame(res, columns=['Attack', 'Defense', 'Attack^2', 'Attack x Defense', 'Defense^2'])
intr_features.head(5)

	Attack	Defense	Attack^2	Attack x Defense	Defense^2
0	49.0	49.0	2401.0	2401.0	2401.0
1	62.0	63.0	3844.0	3906.0	3969.0
2	82.0	83.0	6724.0	6806.0	6889.0
3	100.0	123.0	10000.0	12300.0	15129.0
4	52.0	43.0	2704.0	2236.0	1849.0

binning特征

连续值做离散化操作

fcc_survey_df = pd.read_csv('datasets/fcc_2016_coder_survey_subset.csv', encoding='utf-8')
fcc_survey_df[['ID.x', 'EmploymentField', 'Age', 'Income']].head()

	ID.x	EmploymentField	Age	Income
0	cef35615d61b202f1dc794ef2746df14	office and administrative support	28.0	32000.0
1	323e5a113644d18185c743c241407754	food and beverage	22.0	15000.0
2	b29a1027e5cd062e654a63764157461d	finance	19.0	48000.0
3	04a11e4bcb573a1261eb0d9948d32637	arts, entertainment, sports, or media	26.0	43000.0
4	9368291c93d5d5f5c8cdb1a575e18bec	education	20.0	6000.0

fig, ax = plt.subplots()
fcc_survey_df['Age'].hist(color='#A9C5D3')
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)

Text(0, 0.5, 'Frequency')

在这里插入图片描述

Binning based on rounding

Age Range: Bin
---------------
 0 -  9  : 0
10 - 19  : 1
20 - 29  : 2
30 - 39  : 3
40 - 49  : 4
50 - 59  : 5
60 - 69  : 6
  ... and so on

fcc_survey_df['Age_bin_round'] = np.array(np.floor(np.array(fcc_survey_df['Age']) / 10.))
fcc_survey_df[['ID.x', 'Age', 'Age_bin_round']].iloc[1071:1076]

	ID.x	Age	Age_bin_round
1071	6a02aa4618c99fdb3e24de522a099431	17.0	1.0
1072	f0e5e47278c5f248fe861c5f7214c07a	38.0	3.0
1073	6e14f6d0779b7e424fa3fdd9e4bd3bf9	21.0	2.0
1074	c2654c07dc929cdf3dad4d1aec4ffbb3	53.0	5.0
1075	f07449fc9339b2e57703ec7886232523	35.0	3.0

分位数切分

fcc_survey_df[['ID.x', 'Age', 'Income']].iloc[4:9]

	ID.x	Age	Income
4	9368291c93d5d5f5c8cdb1a575e18bec	20.0	6000.0
5	dd0e77eab9270e4b67c19b0d6bbf621b	34.0	40000.0
6	7599c0aa0419b59fd11ffede98a3665d	23.0	32000.0
7	6dff182db452487f07a47596f314bddc	35.0	40000.0
8	9dc233f8ed1c6eb2432672ab4bb39249	33.0	80000.0

fig, ax = plt.subplots()
fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
ax.set_title('Developer Income Histogram', fontsize=12)
ax.set_xlabel('Developer Income', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)

Text(0, 0.5, 'Frequency')

在这里插入图片描述

quantile_list = [0, .25, .5, .75, 1.]
quantiles = fcc_survey_df['Income'].quantile(quantile_list)
quantiles

0.00      6000.0
0.25     20000.0
0.50     37000.0
0.75     60000.0
1.00    200000.0
Name: Income, dtype: float64

fig, ax = plt.subplots()
fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')

for quantile in quantiles:
    qvl = plt.axvline(quantile, color='r')
ax.legend([qvl], ['Quantiles'], fontsize=10)

ax.set_title('Developer Income Histogram with Quantiles', fontsize=12)
ax.set_xlabel('Developer Income', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)

Text(0, 0.5, 'Frequency')

在这里插入图片描述

quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'], 
                                                 q=quantile_list)
fcc_survey_df['Income_quantile_label'] = pd.qcut(fcc_survey_df['Income'], 
                                                 q=quantile_list, labels=quantile_labels)
fcc_survey_df[['ID.x', 'Age', 'Income', 
               'Income_quantile_range', 'Income_quantile_label']].iloc[4:9]

	ID.x	Age	Income	Income_quantile_range	Income_quantile_label
4	9368291c93d5d5f5c8cdb1a575e18bec	20.0	6000.0	(5999.999, 20000.0]	0-25Q
5	dd0e77eab9270e4b67c19b0d6bbf621b	34.0	40000.0	(37000.0, 60000.0]	50-75Q
6	7599c0aa0419b59fd11ffede98a3665d	23.0	32000.0	(20000.0, 37000.0]	25-50Q
7	6dff182db452487f07a47596f314bddc	35.0	40000.0	(37000.0, 60000.0]	50-75Q
8	9dc233f8ed1c6eb2432672ab4bb39249	33.0	80000.0	(60000.0, 200000.0]	75-100Q

对数变换 COX-BOX

log下什么都不写默认是自然对数

偏度值更低，更能满足正太分布的要求

fcc_survey_df['Income_log'] = np.log((1+ fcc_survey_df['Income']))
fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log']].iloc[4:9]

	ID.x	Age	Income	Income_log
4	9368291c93d5d5f5c8cdb1a575e18bec	20.0	6000.0	8.699681
5	dd0e77eab9270e4b67c19b0d6bbf621b	34.0	40000.0	10.596660
6	7599c0aa0419b59fd11ffede98a3665d	23.0	32000.0	10.373522
7	6dff182db452487f07a47596f314bddc	35.0	40000.0	10.596660
8	9dc233f8ed1c6eb2432672ab4bb39249	33.0	80000.0	11.289794

income_log_mean = np.round(np.mean(fcc_survey_df['Income_log']), 2)

fig, ax = plt.subplots()
fcc_survey_df['Income_log'].hist(bins=30, color='#A9C5D3')
plt.axvline(income_log_mean, color='r')
ax.set_title('Developer Income Histogram after Log Transform', fontsize=12)
ax.set_xlabel('Developer Income (log scale)', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.text(11.5, 450, r'$\mu$='+str(income_log_mean), fontsize=10)

Text(11.5, 450, '$\\mu$=10.43')

在这里插入图片描述

日期相关特征

time_stamps = ['2015-03-08 10:30:00.360000+00:00', '2017-07-13 15:45:05.755000-07:00',
               '2012-01-20 22:30:00.254000+05:30', '2016-12-25 00:30:00.000000+10:00']
df = pd.DataFrame(time_stamps, columns=['Time'])
df

	Time
0	2015-03-08 10:30:00.360000+00:00
1	2017-07-13 15:45:05.755000-07:00
2	2012-01-20 22:30:00.254000+05:30
3	2016-12-25 00:30:00.000000+10:00

ts_objs = np.array([pd.Timestamp(item) for item in np.array(df.Time)])
df['TS_obj'] = ts_objs
ts_objs

array([Timestamp('2015-03-08 10:30:00.360000+0000', tz='UTC'),
       Timestamp('2017-07-13 15:45:05.755000-0700', tz='pytz.FixedOffset(-420)'),
       Timestamp('2012-01-20 22:30:00.254000+0530', tz='pytz.FixedOffset(330)'),
       Timestamp('2016-12-25 00:30:00+1000', tz='pytz.FixedOffset(600)')],
      dtype=object)

df['Year'] = df['TS_obj'].apply(lambda d: d.year)
df['Month'] = df['TS_obj'].apply(lambda d: d.month)
df['Day'] = df['TS_obj'].apply(lambda d: d.day)
df['DayOfWeek'] = df['TS_obj'].apply(lambda d: d.dayofweek)
df['DayOfYear'] = df['TS_obj'].apply(lambda d: d.dayofyear)
df['WeekOfYear'] = df['TS_obj'].apply(lambda d: d.weekofyear)
df['Quarter'] = df['TS_obj'].apply(lambda d: d.quarter)

df[['Time', 'Year', 'Month', 'Day', 'Quarter', 
    'DayOfWeek', 'DayOfYear', 'WeekOfYear']]

	Time	Year	Month	Day	Quarter	DayOfWeek	DayOfYear	WeekOfYear
0	2015-03-08 10:30:00.360000+00:00	2015	3	8	1	6	67	10
1	2017-07-13 15:45:05.755000-07:00	2017	7	13	3	3	194	28
2	2012-01-20 22:30:00.254000+05:30	2012	1	20	1	4	20	3
3	2016-12-25 00:30:00.000000+10:00	2016	12	25	4	6	360	51

时间相关特征

df['Hour'] = df['TS_obj'].apply(lambda d: d.hour)
df['Minute'] = df['TS_obj'].apply(lambda d: d.minute)
df['Second'] = df['TS_obj'].apply(lambda d: d.second)
df['MUsecond'] = df['TS_obj'].apply(lambda d: d.microsecond)   #毫秒
df['UTC_offset'] = df['TS_obj'].apply(lambda d: d.utcoffset()) #UTC时间位移

df[['Time', 'Hour', 'Minute', 'Second', 'MUsecond', 'UTC_offset']]

	Time	Hour	Minute	Second	MUsecond	UTC_offset
0	2015-03-08 10:30:00.360000+00:00	10	30	0	360000	00:00:00
1	2017-07-13 15:45:05.755000-07:00	15	45	5	755000	-1 days +17:00:00
2	2012-01-20 22:30:00.254000+05:30	22	30	0	254000	05:30:00
3	2016-12-25 00:30:00.000000+10:00	0	30	0	0	10:00:00

按照早晚切分时间

hour_bins = [-1, 5, 11, 16, 21, 23]
bin_names = ['Late Night', 'Morning', 'Afternoon', 'Evening', 'Night']
df['TimeOfDayBin'] = pd.cut(df['Hour'], 
                            bins=hour_bins, labels=bin_names)
df[['Time', 'Hour', 'TimeOfDayBin']]

	Time	Hour	TimeOfDayBin
0	2015-03-08 10:30:00.360000+00:00	10	Morning
1	2017-07-13 15:45:05.755000-07:00	15	Afternoon
2	2012-01-20 22:30:00.254000+05:30	22	Night
3	2016-12-25 00:30:00.000000+10:00	0	Late Night

文本特征

构造一个文本数据集

corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]
labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 
                          'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

	Document	Category
0	The sky is blue and beautiful.	weather
1	Love this blue and beautiful sky!	weather
2	The quick brown fox jumps over the lazy dog.	animals
3	The brown fox is quick and the blue dog is lazy!	animals
4	The sky is very blue and the sky is very beaut...	weather
5	The dog is lazy but the brown fox is quick!	animals

基本预处理

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/sunchengquan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.





True

#词频与停用词
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')
print (stop_words)
def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

norm_corpus = normalize_corpus(corpus)
norm_corpus
#The sky is blue and beautiful.

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog', 'brown fox quick blue dog lazy',
       'sky blue sky beautiful today', 'dog lazy brown fox quick'],
      dtype='<U30')

词袋模型

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

cv = CountVectorizer(min_df=0., max_df=1.)
cv.fit(norm_corpus)
print (cv.get_feature_names())
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix

['beautiful', 'blue', 'brown', 'dog', 'fox', 'jumps', 'lazy', 'love', 'quick', 'sky', 'today']





array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0],
       [0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1],
       [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]])

vocab = cv.get_feature_names()
pd.DataFrame(cv_matrix, columns=vocab)

	beautiful	blue	brown	dog	fox	jumps	lazy	love	quick	sky	today
0	1	1	0	0	0	0	0	0	0	1	0
1	1	1	0	0	0	0	0	1	0	1	0
2	0	0	1	1	1	1	1	0	1	0	0
3	0	1	1	1	1	0	1	0	1	0	0
4	1	1	0	0	0	0	0	0	0	2	1
5	0	0	1	1	1	0	1	0	1	0	0

N-Grams模型

bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)

	beautiful sky	beautiful today	blue beautiful	blue dog	blue sky	brown fox	dog lazy	fox jumps	fox quick	jumps lazy	lazy brown	lazy dog	love blue	quick blue	quick brown	sky beautiful	sky blue
0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1
1	1	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0
2	0	0	0	0	0	1	0	1	0	1	0	1	0	0	1	0	0
3	0	0	0	1	0	1	1	0	1	0	0	0	0	1	0	0	0
4	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	1	1
5	0	0	0	0	0	1	1	0	1	0	1	0	0	0	0	0	0

TF-IDF 模型

TF-IDF（term frequency–inverse document frequency）是一种用于信息检索与数据挖掘的常用加权技术。TF意思是词频(Term Frequency)，IDF意思是逆文本频率指数(Inverse Document Frequency), 整个词在语料库中是否稀有，在某篇文章却出现频繁。

from sklearn.feature_extraction.text import TfidfVectorizer #中国 蜜蜂 养殖 它们的片频数都是20次
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

	beautiful	blue	brown	dog	fox	jumps	lazy	love	quick	sky	today
0	0.60	0.52	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.60	0.00
1	0.46	0.39	0.00	0.00	0.00	0.00	0.00	0.66	0.00	0.46	0.00
2	0.00	0.00	0.38	0.38	0.38	0.54	0.38	0.00	0.38	0.00	0.00
3	0.00	0.36	0.42	0.42	0.42	0.00	0.42	0.00	0.42	0.00	0.00
4	0.36	0.31	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.72	0.52
5	0.00	0.00	0.45	0.45	0.45	0.00	0.45	0.00	0.45	0.00	0.00

Similarity特征

from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

	0	1	2	3	4	5
0	1.000000	0.753128	0.000000	0.185447	0.807539	0.000000
1	0.753128	1.000000	0.000000	0.139665	0.608181	0.000000
2	0.000000	0.000000	1.000000	0.784362	0.000000	0.839987
3	0.185447	0.139665	0.784362	1.000000	0.109653	0.933779
4	0.807539	0.608181	0.000000	0.109653	1.000000	0.000000
5	0.000000	0.000000	0.839987	0.933779	0.000000	1.000000

聚类特征

from sklearn.cluster import KMeans

km = KMeans(n_clusters=2)
km.fit_transform(similarity_df)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)

	Document	Category	ClusterLabel
0	The sky is blue and beautiful.	weather	0
1	Love this blue and beautiful sky!	weather	0
2	The quick brown fox jumps over the lazy dog.	animals	1
3	The brown fox is quick and the blue dog is lazy!	animals	1
4	The sky is very blue and the sky is very beaut...	weather	0
5	The dog is lazy but the brown fox is quick!	animals	1

主题模型LDA

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=2, max_iter=100, random_state=42)
dt_matrix = lda.fit_transform(tv_matrix)
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])
features

	T1	T2
0	0.190548	0.809452
1	0.176804	0.823196
2	0.846184	0.153816
3	0.814863	0.185137
4	0.180516	0.819484
5	0.839172	0.160828

主题和词的权重

tt_matrix = lda.components_
for topic_weights in tt_matrix:
    topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]
    topic = sorted(topic, key=lambda x: -x[1])
    topic = [item for item in topic if item[1] > 0.6]
    print(topic)
    print()

[('brown', 1.7273638692668467), ('dog', 1.7273638692668467), ('fox', 1.7273638692668467), ('lazy', 1.7273638692668467), ('quick', 1.7273638692668467), ('jumps', 1.0328325272484777), ('blue', 0.7731573162915626)]

[('sky', 2.264386643135622), ('beautiful', 1.9068269319456903), ('blue', 1.7996282104933266), ('love', 1.148127242397004), ('today', 1.0068251160429935)]

词嵌入模型（词向量模型），首选

每个向量有个实际的真实含义

喜欢：词向量和爱：词向量是相似的。

from gensim.models import word2vec

wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]

# Set values for various parameters
feature_size = 10    # Word vector dimensionality  
window_context = 10          # Context window size                                                                                    
min_word_count = 1   # Minimum word count                        
sample = 1e-3   # Downsample setting for frequent words

w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, 
                          window=window_context, min_count = min_word_count,
                          sample=sample)

w2v_model.wv['sky']

array([ 0.0051514 ,  0.03431731,  0.02772758,  0.03358332, -0.00501862,
        0.02063181,  0.00937138,  0.04554451, -0.02628018,  0.0292932 ],
      dtype=float32)

平均向量模型不是最优，还有别的好方法？ LSTM

def average_word_vectors(words, model, vocabulary, num_features):
    
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    
    for word in words:
        if word in vocabulary: 
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
    
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector
    
def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
                                             num_features=feature_size)
pd.DataFrame(w2v_feature_array) #lstm

/home/sunchengquan/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  if __name__ == '__main__':

	0	1	2	3	4	5	6	7	8	9
0	0.024004	0.009261	-0.000253	0.004115	-0.019091	-0.005033	-0.002342	-0.014454	0.002763	0.026265
1	0.010011	0.019334	-0.010439	0.010748	-0.012463	-0.011657	-0.007812	-0.007604	0.001624	0.012943
2	-0.011372	-0.000493	0.009532	0.002820	-0.008391	0.001193	0.003332	-0.010040	0.007316	-0.010241
3	-0.012263	-0.002740	-0.000849	-0.003037	-0.005497	-0.002302	-0.011215	-0.009798	0.013508	-0.002283
4	0.021322	0.015926	-0.000250	0.011049	-0.002766	0.007039	-0.000213	-0.006395	0.003781	0.017381
5	-0.021207	0.006429	0.004595	0.001258	-0.001805	0.006114	-0.004712	-0.002169	0.014164	-0.003220

图像特征

import skimage
from skimage import io
from skimage import color
#opencv tensorflow

图像shape

cat = io.imread('./images/cat.png')
dog = io.imread('./images/dog.png')
df = pd.DataFrame(['Cat', 'Dog'], columns=['Image'])


print(cat.shape, dog.shape)

(168, 300, 3) (168, 300, 3)

#0-255,越小的值代表越暗，越大的值越亮
cat

array([[[114, 105,  90],
        [113, 104,  89],
        [112, 103,  88],
        ...,
        [127, 130, 121],
        [130, 133, 124],
        [133, 136, 127]],

       [[113, 104,  89],
        [112, 103,  88],
        [111, 102,  87],
        ...,
        [129, 132, 125],
        [132, 135, 128],
        [135, 138, 131]],

       [[111, 102,  87],
        [111, 102,  87],
        [110, 101,  86],
        ...,
        [132, 134, 133],
        [136, 138, 137],
        [139, 141, 140]],

       ...,

       [[ 32,  26,  28],
        [ 32,  26,  28],
        [ 30,  24,  26],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]],

       [[ 33,  27,  29],
        [ 32,  26,  28],
        [ 31,  25,  27],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]],

       [[ 33,  27,  29],
        [ 32,  26,  28],
        [ 31,  25,  27],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]]], dtype=uint8)

#coffee = skimage.transform.resize(coffee, (300, 451), mode='reflect')
fig = plt.figure(figsize = (8, 2.5))
ax1 = fig.add_subplot(1,2, 1)
ax1.imshow(cat)
ax2 = fig.add_subplot(1,2, 2)
ax2.imshow(dog)

<matplotlib.image.AxesImage at 0x7fa047fc3278>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VAFJQi9a-1583330158843)(output_104_1.png)]

dog_r = dog.copy() # Red Channel
dog_r[:,:,1] = dog_r[:,:,2] = 0 # set G,B pixels = 0
dog_g = dog.copy() # Green Channel
dog_g[:,:,0] = dog_r[:,:,2] = 0 # set R,B pixels = 0
dog_b = dog.copy() # Blue Channel
dog_b[:,:,0] = dog_b[:,:,1] = 0 # set R,G pixels = 0

plot_image = np.concatenate((dog_r, dog_g, dog_b), axis=1)
plt.figure(figsize = (12,2.5))
plt.imshow(plot_image)

<matplotlib.image.AxesImage at 0x7fa047ef4208>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-D1mp7YBj-1583330158844)(output_105_1.png)]

dog_r[1,1]

array([160,   0,   0], dtype=uint8)

灰度图

fig = plt.figure(figsize = (8,4))
ax1 = fig.add_subplot(2,2, 1)
ax1.imshow(color.rgb2gray(cat), cmap="gray" )
ax2 = fig.add_subplot(2,2, 2)
ax2.imshow(color.rgb2gray(dog), cmap='gray')

<matplotlib.image.AxesImage at 0x7fa0469ff2b0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zSB9EWbC-1583330158845)(output_108_1.png)]

参考：

梁劲-机器学习学习笔记
唐宇坤机器学习课程

sunchengquan

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
机器学习数据预处理

文章目录Preparing data is time-consumingData in the real world is dirtyPreprocessing data to avoid "garbage in, garbage out"Preprocessing data - Clean your dataPreprocessing data - Data normalizationPrepr...
复制链接

扫一扫