文章目录
- Preparing data is time-consuming
- Data in the real world is dirty
- Preprocessing data to avoid "garbage in, garbage out"
- Data Preprocessing
- 标准化
- Standard Scaler x i − μ σ \frac{x_i - \mu}{\sigma} σxi−μ
- Min-Max Scaler x i − m i n ( x ) m a x ( x ) − m i n ( x ) \frac{x_i - min(x)}{max(x) - min(x)} max(x)−min(x)xi−min(x)
- Robust Scaler x i − m e d i a n ( x ) I Q R ( 1 , 3 ) ( x ) \frac{x_i - median(x)}{IQR_{(1,3)}(x)} IQR(1,3)(x)xi−median(x)
- Feature Engineering
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/770da8870ec72cfe09350089058e5624.png)
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools…
Preparing data is time-consuming
What data scientists spend the most time doing
Messy data is by far the most time-consuming aspect of the typical data scientist’s work flow
Data in the real world is dirty
Data is rarely clean and often you can have data quality issues
The typical data quality issues that arise are:
- Incomplete: Data lacks attributes or containing missing values.
- Noisy: Data contains erroneous records or outliers.
- Inconsistent: Data contains conflicting records or discrepancies.
Preprocessing data to avoid “garbage in, garbage out”
Preprocessing data - Clean your data
Why to deal with missing values?
-
Missing values in a dataset can be due to error or because observations that were not recorded
-
When missing value are present, certain algorithms may not work or you may not have the desired result.
-
Missing data affects some models more than others1
-
Even for models that handle missing data, they can be sensitive to it (missing data for certain variables can result in poor predictions)2
1&2 Source: https://channel9.msdn.com/Events/OpenSourceTW/DevDays-Asia-2017/AI12
How to deal with missing values?
Typical missing value handling methods are:
- Deletion
Remove records with missing values - Dummy substitution
Replace missing values with a dummy value: e.g, unknown for categorical or 0 for numerical values. - Mean substitution
If the missing data is numerical, replace the missing values with the mean. - Frequent substitution
If the missing data is categorical, replace the missing values with the most frequent item - Regression substitution
Use a regression method to replace missing values with regressed values.
What you should know about the outliers/anomalies
- Outliers may bring about problems by distorting the predictive model.
- What’s an outlier is somewhat subjective.1
- Outliers can be very common in multidimensional data.2
- Some models are less sensitive (more robust ) to outliers than others.3
- Outliers can be result of bad data collection, or they can legitimate extreme (or unusual) values.4
- Sometimes outliers are the interesting data points we want to model, and other times they just get in the way.5
1,2,3,4 &5 Source: https://channel9.msdn.com/Events/OpenSourceTW/DevDays-Asia-2017/AI12
Cause of outliers
How to deal with outliers?
The choice of how to deal with an outlier should depend on the cause.
-
Keep outliers
Outliers should not necessarily be omitted from the analysis as they may be genuine observations in the data.In many applications, outliers provide crucial information. For example, in a credit card fraud detection app, they indicated purchases that fall outside a customer’s usual buying patterns.1
-
Exclude outliers
-
There are two common approaches to exclude outliers 2
-
Trimming/Truncation: Trimming discards the outliers
-
Winsorising: Winsorising replaces the outliers with the nearest “non-suspect” data.
-
-
1 source : Mathworks Box Plot’s source: http://www.physics.csbsju.edu/stats/box2.html
2 source: https://en.wikipedia.org/wiki/Outlier#Working_with_outliers
Examples on how to deal with outliers
Preprocessing data - Data normalization
How to normalize data?
Data normalization re-scales numerical values to a specified range. Popular data normalization methods include:
Preprocessing data - Data discretization
How to discretize data?
A numeric variable may have many different values and for some algorithms this may lead to very complex models. You can convert continuous attributes by “binning” to categorical attributes for ease of use with certain machine learning methods.
Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns1.
Binning
-
Binning helps to improve model performance. It captures non-linear behavior of continuous variables.
-
It minimizes the impact of outliers. It removes “noise” from large numbers of distinct values
-
It makes the models more explainable – grouped values are easier to display and understand. It improves model build speed – predictive algorithms build much faster as the number of distinct values decreases.
Preprocessing data - Data reduction
How to reduce data ?
There are various methods to reduce data size for easier data handling. Depending on data size and the domain, the following methods can be applied:
- Record Sampling
Sample the data records and only choose the representative subset from the data. - Attribute Sampling
Select only a subset of the most important attributes from the data. - Aggregation
Divide the data into groups and store the numbers for each group. For example, the daily revenue numbers of a restaurant chain over the past 20 years can be aggregated to monthly revenue to reduce the size of the data
Sample Data
If the dataset you plan to analyze is large, it’s usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. This facilitates data understanding, exploration, and feature engineering.
More data can result in much longer running times for algorithms and larger computational and memory requirements. You can take a smaller representative sample of the selected data that may be much faster for exploring and prototyping solutions before considering the whole dataset.
Preprocessing data - Text cleaning
How to clean text data ?
Improper text encoding handling while writing/reading text leads to information loss, inadvertent introduction of unreadable characters, e.g., nulls, and may also affect text parsing.
Unstructured text such as tweets, product reviews, or search queries usually requires some preprocessing before it can be analyzed.
For example:
-
replacing special characters and punctuation marks with spaces
-
normalizing case
-
removing duplicate characters
-
removing user-defined or built-in stop-words
-
word stemming
-
Data Preprocessing
-
Feature Engineering
-
Feature Selection
Data Preprocessing
标准化
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np
import pandas as pd
views = pd.DataFrame([1295., 25., 19000., 5., 1., 300.], columns=['views'])
views
views | |
---|---|
0 | 1295.0 |
1 | 25.0 |
2 | 19000.0 |
3 | 5.0 |
4 | 1.0 |
5 | 300.0 |
Standard Scaler x i − μ σ \frac{x_i - \mu}{\sigma} σxi−μ
ss = StandardScaler()
views['zscore'] = ss.fit_transform(views[['views']])
views
views | zscore | |
---|---|---|
0 | 1295.0 | -0.307214 |
1 | 25.0 | -0.489306 |
2 | 19000.0 | 2.231317 |
3 | 5.0 | -0.492173 |
4 | 1.0 | -0.492747 |
5 | 300.0 | -0.449877 |
vw = np.array(views['views'])
(vw[0] - np.mean(vw)) / np.std(vw)
-0.30721413311687235
Min-Max Scaler x i − m i n ( x ) m a x ( x ) − m i n ( x ) \frac{x_i - min(x)}{max(x) - min(x)} max(x)−min(x)xi−min(x)
mms = MinMaxScaler()
views['minmax'] = mms.fit_transform(views[['views']])
views
views | zscore | minmax | |
---|---|---|---|
0 | 1295.0 | -0.307214 | 0.068109 |
1 | 25.0 | -0.489306 | 0.001263 |
2 | 19000.0 | 2.231317 | 1.000000 |
3 | 5.0 | -0.492173 | 0.000211 |
4 | 1.0 | -0.492747 | 0.000000 |
5 | 300.0 | -0.449877 | 0.015738 |
(vw[0] - np.min(vw)) / (np.max(vw) - np.min(vw))
0.06810884783409653
Robust Scaler x i − m e d i a n ( x ) I Q R ( 1 , 3 ) ( x ) \frac{x_i - median(x)}{IQR_{(1,3)}(x)} IQR(1,3)(x)xi−median(x)
rs = RobustScaler()
views['robust'] = rs.fit_transform(views[['views']])
views
views | zscore | minmax | robust | |
---|---|---|---|---|
0 | 1295.0 | -0.307214 | 0.068109 | 1.092883 |
1 | 25.0 | -0.489306 | 0.001263 | -0.132690 |
2 | 19000.0 | 2.231317 | 1.000000 | 18.178528 |
3 | 5.0 | -0.492173 | 0.000211 | -0.151990 |
4 | 1.0 | -0.492747 | 0.000000 | -0.155850 |
5 | 300.0 | -0.449877 | 0.015738 | 0.132690 |
quartiles = np.percentile(vw, (25., 75.))
iqr = quartiles[1] - quartiles[0]
(vw[0] - np.median(vw)) / iqr
1.0928829915560916
Feature Engineering
Feature Engineering is the key task in machine learning
There is no formal definition of feature engineering. It means different things to different people. In Google‘s definition, the process of extracting features from raw data is called feature engineering. In Microsoft’s documentation feature engineering is more about feature construction.
Feature Engineering is a sort of art
Feature engineers requires a creative combination of domain expertise and insights obtained from the data exploration step.
This is a balancing act of finding and including informative variables while avoiding too many unrelated variables.
Informative variables improve our result; unrelated variables introduce unnecessary noise into the model
What’s feature
Feature Engineering can augment your data
Why should you perform Feature Selection ?
Curse of dimensionality
The curse of dimensionality refers to how certain learning algorithms may perform poorly in high-dimensional data.
For example, after a certain point, increasing the dimensionality of the problem by adding new features would actually degrade the performance of our classifier. This is illustrated by the following figure, and is often referred to as ‘The Curse of Dimensionality’ 2.
Summary: the modern approaches for Feature Selection
-
Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use cross validation.
-
Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.
-
Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.
-
Using the subset of features from the wrapper methods make the model more prone to overfitting as compared to using subset of features from the filter methods.
-
Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods.
数值特征
离散值处理
import re
import nltk
import pytz
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import scipy.stats as spstats
import datetime
from dateutil.parser import parse
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
%matplotlib inline
mpl.style.reload_library()
mpl.style.use('classic')
mpl.rcParams['figure.facecolor'] = (1, 1, 1, 0)
# mpl.rcParams['figure.figsize'] = [8.0, 5.0]
mpl.rcParams['figure.dpi'] = 100
%config InlineBackend.figure_format = 'retina'
vg_df = pd.read_csv('datasets/vgsales.csv', encoding = "ISO-8859-1")
vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]
Name | Platform | Year | Genre | Publisher | |
---|---|---|---|---|---|
1 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo |
2 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo |
3 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo |
4 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo |
5 | Tetris | GB | 1989.0 | Puzzle | Nintendo |
6 | New Super Mario Bros. | DS | 2006.0 | Platform | Nintendo |
genres = np.unique(vg_df['Genre'])
genres
array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
'Strategy'], dtype=object)
LabelEncoder
gle = LabelEncoder()
genre_labels = gle.fit_transform(vg_df['Genre'])
genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
genre_mappings
{0: 'Action',
1: 'Adventure',
2: 'Fighting',
3: 'Misc',
4: 'Platform',
5: 'Puzzle',
6: 'Racing',
7: 'Role-Playing',
8: 'Shooter',
9: 'Simulation',
10: 'Sports',
11: 'Strategy'}
genre_labels
array([10, 4, 6, ..., 6, 5, 4])
vg_df['GenreLabel'] = genre_labels
vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]
Name | Platform | Year | Genre | GenreLabel | |
---|---|---|---|---|---|
1 | Super Mario Bros. | NES | 1985.0 | Platform | 4 |
2 | Mario Kart Wii | Wii | 2008.0 | Racing | 6 |
3 | Wii Sports Resort | Wii | 2009.0 | Sports | 10 |
4 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | 7 |
5 | Tetris | GB | 1989.0 | Puzzle | 5 |
6 | New Super Mario Bros. | DS | 2006.0 | Platform | 4 |
Map
poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)
np.unique(poke_df['Generation'])
array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)
gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3,
'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10]
Name | Generation | GenerationLabel | |
---|---|---|---|
4 | Octillery | Gen 2 | 2 |
5 | Helioptile | Gen 6 | 6 |
6 | Dialga | Gen 4 | 4 |
7 | DeoxysDefense Forme | Gen 3 | 3 |
8 | Rapidash | Gen 1 | 1 |
9 | Swanna | Gen 5 | 5 |
对于数据,我们同样可以给它反变换回去
ord_gen_map = {v:k for k,v in gen_ord_map.items()}
poke_df['InvGeneration'] = poke_df['GenerationLabel'].map(ord_gen_map)
poke_df[['Name', 'Generation', 'InvGeneration']].head()
Name | Generation | InvGeneration | |
---|---|---|---|
0 | CharizardMega Charizard Y | Gen 1 | Gen 1 |
1 | Abomasnow | Gen 4 | Gen 4 |
2 | Sentret | Gen 2 | Gen 2 |
3 | Litleo | Gen 6 | Gen 6 |
4 | Octillery | Gen 2 | Gen 2 |
poke_df[['Name', 'Generation', 'Legendary']].iloc[4:10]
Name | Generation | Legendary | |
---|---|---|---|
4 | Octillery | Gen 2 | False |
5 | Helioptile | Gen 6 | False |
6 | Dialga | Gen 4 | True |
7 | DeoxysDefense Forme | Gen 3 | True |
8 | Rapidash | Gen 1 | False |
9 | Swanna | Gen 5 | False |
# transform and map pokemon generations
gen_le = LabelEncoder()
gen_labels = gen_le.fit_transform(poke_df['Generation'])
poke_df['Gen_Label'] = gen_labels
# transform and map pokemon legendary status
leg_le = LabelEncoder()
leg_labels = leg_le.fit_transform(poke_df['Legendary'])
poke_df['Lgnd_Label'] = leg_labels
poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]
poke_df_sub.iloc[4:10]
Name | Generation | Gen_Label | Legendary | Lgnd_Label | |
---|---|---|---|---|---|
4 | Octillery | Gen 2 | 1 | False | 0 |
5 | Helioptile | Gen 6 | 5 | False | 0 |
6 | Dialga | Gen 4 | 3 | True | 1 |
7 | DeoxysDefense Forme | Gen 3 | 2 | True | 1 |
8 | Rapidash | Gen 1 | 0 | False | 0 |
9 | Swanna | Gen 5 | 4 | False | 0 |
One-hot Encoding
# encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder()
gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()
gen_feature_labels = list(gen_le.classes_)
print (gen_feature_labels)
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)
# encode legendary status labels using one-hot encoding scheme
leg_ohe = OneHotEncoder()
leg_feature_arr = leg_ohe.fit_transform(poke_df[['Lgnd_Label']]).toarray()
leg_feature_labels = ['Legendary_'+str(cls_label) for cls_label in leg_le.classes_]
print (leg_feature_labels)
leg_features = pd.DataFrame(leg_feature_arr, columns=leg_feature_labels)
['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6']
['Legendary_False', 'Legendary_True']
poke_df_ohe = pd.concat([poke_df_sub, gen_features, leg_features], axis=1)
columns = sum([['Name', 'Generation', 'Gen_Label'],gen_feature_labels,
['Legendary', 'Lgnd_Label'],leg_feature_labels], [])
poke_df_ohe[columns].iloc[4:10]
Name | Generation | Gen_Label | Gen 1 | Gen 2 | Gen 3 | Gen 4 | Gen 5 | Gen 6 | Legendary | Lgnd_Label | Legendary_False | Legendary_True | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | Octillery | Gen 2 | 1 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | False | 0 | 1.0 | 0.0 |
5 | Helioptile | Gen 6 | 5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | False | 0 | 1.0 | 0.0 |
6 | Dialga | Gen 4 | 3 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | True | 1 | 0.0 | 1.0 |
7 | DeoxysDefense Forme | Gen 3 | 2 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | True | 1 | 0.0 | 1.0 |
8 | Rapidash | Gen 1 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | False | 0 | 1.0 | 0.0 |
9 | Swanna | Gen 5 | 4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | False | 0 | 1.0 | 0.0 |
Get Dummy
Pandas库中同样有类似的操作,使用get_dummies也可以得到相应的特征
gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True)
pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]
Name | Generation | Gen 2 | Gen 3 | Gen 4 | Gen 5 | Gen 6 | |
---|---|---|---|---|---|---|---|
4 | Octillery | Gen 2 | 1 | 0 | 0 | 0 | 0 |
5 | Helioptile | Gen 6 | 0 | 0 | 0 | 0 | 1 |
6 | Dialga | Gen 4 | 0 | 0 | 1 | 0 | 0 |
7 | DeoxysDefense Forme | Gen 3 | 0 | 1 | 0 | 0 | 0 |
8 | Rapidash | Gen 1 | 0 | 0 | 0 | 0 | 0 |
9 | Swanna | Gen 5 | 0 | 0 | 0 | 1 | 0 |
gen_onehot_features = pd.get_dummies(poke_df['Generation'])
pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[4:10]
Name | Generation | Gen 1 | Gen 2 | Gen 3 | Gen 4 | Gen 5 | Gen 6 | |
---|---|---|---|---|---|---|---|---|
4 | Octillery | Gen 2 | 0 | 1 | 0 | 0 | 0 | 0 |
5 | Helioptile | Gen 6 | 0 | 0 | 0 | 0 | 0 | 1 |
6 | Dialga | Gen 4 | 0 | 0 | 0 | 1 | 0 | 0 |
7 | DeoxysDefense Forme | Gen 3 | 0 | 0 | 1 | 0 | 0 | 0 |
8 | Rapidash | Gen 1 | 1 | 0 | 0 | 0 | 0 | 0 |
9 | Swanna | Gen 5 | 0 | 0 | 0 | 0 | 1 | 0 |
poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
poke_df.head()
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | Gen 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | Gen 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | Gen 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | Gen 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | Gen 1 | False |
poke_df[['HP', 'Attack', 'Defense']].head()
HP | Attack | Defense | |
---|---|---|---|
0 | 45 | 49 | 49 |
1 | 60 | 62 | 63 |
2 | 80 | 82 | 83 |
3 | 80 | 100 | 123 |
4 | 39 | 52 | 43 |
poke_df[['HP', 'Attack', 'Defense']].describe()
HP | Attack | Defense | |
---|---|---|---|
count | 800.000000 | 800.000000 | 800.000000 |
mean | 69.258750 | 79.001250 | 73.842500 |
std | 25.534669 | 32.457366 | 31.183501 |
min | 1.000000 | 5.000000 | 5.000000 |
25% | 50.000000 | 55.000000 | 50.000000 |
50% | 65.000000 | 75.000000 | 70.000000 |
75% | 80.000000 | 100.000000 | 90.000000 |
max | 255.000000 | 190.000000 | 230.000000 |
popsong_df = pd.read_csv('datasets/song_views.csv', encoding='utf-8')
popsong_df.head(10)
user_id | song_id | title | listen_count | |
---|---|---|---|---|
0 | b6b799f34a204bd928ea014c243ddad6d0be4f8f | SOBONKR12A58A7A7E0 | You're The One | 2 |
1 | b41ead730ac14f6b6717b9cf8859d5579f3f8d4d | SOBONKR12A58A7A7E0 | You're The One | 0 |
2 | 4c84359a164b161496d05282707cecbd50adbfc4 | SOBONKR12A58A7A7E0 | You're The One | 0 |
3 | 779b5908593756abb6ff7586177c966022668b06 | SOBONKR12A58A7A7E0 | You're The One | 0 |
4 | dd88ea94f605a63d9fc37a214127e3f00e85e42d | SOBONKR12A58A7A7E0 | You're The One | 0 |
5 | 68f0359a2f1cedb0d15c98d88017281db79f9bc6 | SOBONKR12A58A7A7E0 | You're The One | 0 |
6 | 116a4c95d63623a967edf2f3456c90ebbf964e6f | SOBONKR12A58A7A7E0 | You're The One | 17 |
7 | 45544491ccfcdc0b0803c34f201a6287ed4e30f8 | SOBONKR12A58A7A7E0 | You're The One | 0 |
8 | e701a24d9b6c59f5ac37ab28462ca82470e27cfb | SOBONKR12A58A7A7E0 | You're The One | 68 |
9 | edc8b7b1fd592a3b69c3d823a742e1a064abec95 | SOBONKR12A58A7A7E0 | You're The One | 0 |
二值特征
watched = np.array(popsong_df['listen_count'])
watched[watched >= 1] = 1
popsong_df['watched'] = watched
popsong_df.head(10)
user_id | song_id | title | listen_count | watched | |
---|---|---|---|---|---|
0 | b6b799f34a204bd928ea014c243ddad6d0be4f8f | SOBONKR12A58A7A7E0 | You're The One | 2 | 1 |
1 | b41ead730ac14f6b6717b9cf8859d5579f3f8d4d | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
2 | 4c84359a164b161496d05282707cecbd50adbfc4 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
3 | 779b5908593756abb6ff7586177c966022668b06 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
4 | dd88ea94f605a63d9fc37a214127e3f00e85e42d | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
5 | 68f0359a2f1cedb0d15c98d88017281db79f9bc6 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
6 | 116a4c95d63623a967edf2f3456c90ebbf964e6f | SOBONKR12A58A7A7E0 | You're The One | 17 | 1 |
7 | 45544491ccfcdc0b0803c34f201a6287ed4e30f8 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
8 | e701a24d9b6c59f5ac37ab28462ca82470e27cfb | SOBONKR12A58A7A7E0 | You're The One | 68 | 1 |
9 | edc8b7b1fd592a3b69c3d823a742e1a064abec95 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
# > 0.9 变成1, 小于0.9 变成0
bn = Binarizer(threshold=0.9)
pd_watched = bn.transform([popsong_df['listen_count']])[0]
popsong_df['pd_watched'] = pd_watched
popsong_df.head(11)
user_id | song_id | title | listen_count | watched | pd_watched | |
---|---|---|---|---|---|---|
0 | b6b799f34a204bd928ea014c243ddad6d0be4f8f | SOBONKR12A58A7A7E0 | You're The One | 2 | 1 | 1 |
1 | b41ead730ac14f6b6717b9cf8859d5579f3f8d4d | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
2 | 4c84359a164b161496d05282707cecbd50adbfc4 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
3 | 779b5908593756abb6ff7586177c966022668b06 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
4 | dd88ea94f605a63d9fc37a214127e3f00e85e42d | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
5 | 68f0359a2f1cedb0d15c98d88017281db79f9bc6 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
6 | 116a4c95d63623a967edf2f3456c90ebbf964e6f | SOBONKR12A58A7A7E0 | You're The One | 17 | 1 | 1 |
7 | 45544491ccfcdc0b0803c34f201a6287ed4e30f8 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
8 | e701a24d9b6c59f5ac37ab28462ca82470e27cfb | SOBONKR12A58A7A7E0 | You're The One | 68 | 1 | 1 |
9 | edc8b7b1fd592a3b69c3d823a742e1a064abec95 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
10 | fb41d1c374d093ab643ef3bcd70eeb258d479076 | SOBONKR12A58A7A7E0 | You're The One | 1 | 1 | 1 |
多项式特征
atk_def = poke_df[['Attack', 'Defense']]
atk_def.head()
Attack | Defense | |
---|---|---|
0 | 49 | 49 |
1 | 62 | 63 |
2 | 82 | 83 |
3 | 100 | 123 |
4 | 52 | 43 |
pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
res = pf.fit_transform(atk_def)
res
array([[ 49., 49., 2401., 2401., 2401.],
[ 62., 63., 3844., 3906., 3969.],
[ 82., 83., 6724., 6806., 6889.],
...,
[ 110., 60., 12100., 6600., 3600.],
[ 160., 60., 25600., 9600., 3600.],
[ 110., 120., 12100., 13200., 14400.]])
intr_features = pd.DataFrame(res, columns=['Attack', 'Defense', 'Attack^2', 'Attack x Defense', 'Defense^2'])
intr_features.head(5)
Attack | Defense | Attack^2 | Attack x Defense | Defense^2 | |
---|---|---|---|---|---|
0 | 49.0 | 49.0 | 2401.0 | 2401.0 | 2401.0 |
1 | 62.0 | 63.0 | 3844.0 | 3906.0 | 3969.0 |
2 | 82.0 | 83.0 | 6724.0 | 6806.0 | 6889.0 |
3 | 100.0 | 123.0 | 10000.0 | 12300.0 | 15129.0 |
4 | 52.0 | 43.0 | 2704.0 | 2236.0 | 1849.0 |
binning特征
连续值做离散化操作
fcc_survey_df = pd.read_csv('datasets/fcc_2016_coder_survey_subset.csv', encoding='utf-8')
fcc_survey_df[['ID.x', 'EmploymentField', 'Age', 'Income']].head()
ID.x | EmploymentField | Age | Income | |
---|---|---|---|---|
0 | cef35615d61b202f1dc794ef2746df14 | office and administrative support | 28.0 | 32000.0 |
1 | 323e5a113644d18185c743c241407754 | food and beverage | 22.0 | 15000.0 |
2 | b29a1027e5cd062e654a63764157461d | finance | 19.0 | 48000.0 |
3 | 04a11e4bcb573a1261eb0d9948d32637 | arts, entertainment, sports, or media | 26.0 | 43000.0 |
4 | 9368291c93d5d5f5c8cdb1a575e18bec | education | 20.0 | 6000.0 |
fig, ax = plt.subplots()
fcc_survey_df['Age'].hist(color='#A9C5D3')
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Text(0, 0.5, 'Frequency')
Binning based on rounding
Age Range: Bin
---------------
0 - 9 : 0
10 - 19 : 1
20 - 29 : 2
30 - 39 : 3
40 - 49 : 4
50 - 59 : 5
60 - 69 : 6
... and so on
fcc_survey_df['Age_bin_round'] = np.array(np.floor(np.array(fcc_survey_df['Age']) / 10.))
fcc_survey_df[['ID.x', 'Age', 'Age_bin_round']].iloc[1071:1076]
ID.x | Age | Age_bin_round | |
---|---|---|---|
1071 | 6a02aa4618c99fdb3e24de522a099431 | 17.0 | 1.0 |
1072 | f0e5e47278c5f248fe861c5f7214c07a | 38.0 | 3.0 |
1073 | 6e14f6d0779b7e424fa3fdd9e4bd3bf9 | 21.0 | 2.0 |
1074 | c2654c07dc929cdf3dad4d1aec4ffbb3 | 53.0 | 5.0 |
1075 | f07449fc9339b2e57703ec7886232523 | 35.0 | 3.0 |
分位数切分
fcc_survey_df[['ID.x', 'Age', 'Income']].iloc[4:9]
ID.x | Age | Income | |
---|---|---|---|
4 | 9368291c93d5d5f5c8cdb1a575e18bec | 20.0 | 6000.0 |
5 | dd0e77eab9270e4b67c19b0d6bbf621b | 34.0 | 40000.0 |
6 | 7599c0aa0419b59fd11ffede98a3665d | 23.0 | 32000.0 |
7 | 6dff182db452487f07a47596f314bddc | 35.0 | 40000.0 |
8 | 9dc233f8ed1c6eb2432672ab4bb39249 | 33.0 | 80000.0 |
fig, ax = plt.subplots()
fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
ax.set_title('Developer Income Histogram', fontsize=12)
ax.set_xlabel('Developer Income', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Text(0, 0.5, 'Frequency')
quantile_list = [0, .25, .5, .75, 1.]
quantiles = fcc_survey_df['Income'].quantile(quantile_list)
quantiles
0.00 6000.0
0.25 20000.0
0.50 37000.0
0.75 60000.0
1.00 200000.0
Name: Income, dtype: float64
fig, ax = plt.subplots()
fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
for quantile in quantiles:
qvl = plt.axvline(quantile, color='r')
ax.legend([qvl], ['Quantiles'], fontsize=10)
ax.set_title('Developer Income Histogram with Quantiles', fontsize=12)
ax.set_xlabel('Developer Income', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Text(0, 0.5, 'Frequency')
quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'],
q=quantile_list)
fcc_survey_df['Income_quantile_label'] = pd.qcut(fcc_survey_df['Income'],
q=quantile_list, labels=quantile_labels)
fcc_survey_df[['ID.x', 'Age', 'Income',
'Income_quantile_range', 'Income_quantile_label']].iloc[4:9]
ID.x | Age | Income | Income_quantile_range | Income_quantile_label | |
---|---|---|---|---|---|
4 | 9368291c93d5d5f5c8cdb1a575e18bec | 20.0 | 6000.0 | (5999.999, 20000.0] | 0-25Q |
5 | dd0e77eab9270e4b67c19b0d6bbf621b | 34.0 | 40000.0 | (37000.0, 60000.0] | 50-75Q |
6 | 7599c0aa0419b59fd11ffede98a3665d | 23.0 | 32000.0 | (20000.0, 37000.0] | 25-50Q |
7 | 6dff182db452487f07a47596f314bddc | 35.0 | 40000.0 | (37000.0, 60000.0] | 50-75Q |
8 | 9dc233f8ed1c6eb2432672ab4bb39249 | 33.0 | 80000.0 | (60000.0, 200000.0] | 75-100Q |
对数变换 COX-BOX
log下什么都不写默认是自然对数
偏度值更低, 更能满足正太分布的要求
fcc_survey_df['Income_log'] = np.log((1+ fcc_survey_df['Income']))
fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log']].iloc[4:9]
ID.x | Age | Income | Income_log | |
---|---|---|---|---|
4 | 9368291c93d5d5f5c8cdb1a575e18bec | 20.0 | 6000.0 | 8.699681 |
5 | dd0e77eab9270e4b67c19b0d6bbf621b | 34.0 | 40000.0 | 10.596660 |
6 | 7599c0aa0419b59fd11ffede98a3665d | 23.0 | 32000.0 | 10.373522 |
7 | 6dff182db452487f07a47596f314bddc | 35.0 | 40000.0 | 10.596660 |
8 | 9dc233f8ed1c6eb2432672ab4bb39249 | 33.0 | 80000.0 | 11.289794 |
income_log_mean = np.round(np.mean(fcc_survey_df['Income_log']), 2)
fig, ax = plt.subplots()
fcc_survey_df['Income_log'].hist(bins=30, color='#A9C5D3')
plt.axvline(income_log_mean, color='r')
ax.set_title('Developer Income Histogram after Log Transform', fontsize=12)
ax.set_xlabel('Developer Income (log scale)', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.text(11.5, 450, r'$\mu$='+str(income_log_mean), fontsize=10)
Text(11.5, 450, '$\\mu$=10.43')
日期相关特征
time_stamps = ['2015-03-08 10:30:00.360000+00:00', '2017-07-13 15:45:05.755000-07:00',
'2012-01-20 22:30:00.254000+05:30', '2016-12-25 00:30:00.000000+10:00']
df = pd.DataFrame(time_stamps, columns=['Time'])
df
Time | |
---|---|
0 | 2015-03-08 10:30:00.360000+00:00 |
1 | 2017-07-13 15:45:05.755000-07:00 |
2 | 2012-01-20 22:30:00.254000+05:30 |
3 | 2016-12-25 00:30:00.000000+10:00 |
ts_objs = np.array([pd.Timestamp(item) for item in np.array(df.Time)])
df['TS_obj'] = ts_objs
ts_objs
array([Timestamp('2015-03-08 10:30:00.360000+0000', tz='UTC'),
Timestamp('2017-07-13 15:45:05.755000-0700', tz='pytz.FixedOffset(-420)'),
Timestamp('2012-01-20 22:30:00.254000+0530', tz='pytz.FixedOffset(330)'),
Timestamp('2016-12-25 00:30:00+1000', tz='pytz.FixedOffset(600)')],
dtype=object)
df['Year'] = df['TS_obj'].apply(lambda d: d.year)
df['Month'] = df['TS_obj'].apply(lambda d: d.month)
df['Day'] = df['TS_obj'].apply(lambda d: d.day)
df['DayOfWeek'] = df['TS_obj'].apply(lambda d: d.dayofweek)
df['DayOfYear'] = df['TS_obj'].apply(lambda d: d.dayofyear)
df['WeekOfYear'] = df['TS_obj'].apply(lambda d: d.weekofyear)
df['Quarter'] = df['TS_obj'].apply(lambda d: d.quarter)
df[['Time', 'Year', 'Month', 'Day', 'Quarter',
'DayOfWeek', 'DayOfYear', 'WeekOfYear']]
Time | Year | Month | Day | Quarter | DayOfWeek | DayOfYear | WeekOfYear | |
---|---|---|---|---|---|---|---|---|
0 | 2015-03-08 10:30:00.360000+00:00 | 2015 | 3 | 8 | 1 | 6 | 67 | 10 |
1 | 2017-07-13 15:45:05.755000-07:00 | 2017 | 7 | 13 | 3 | 3 | 194 | 28 |
2 | 2012-01-20 22:30:00.254000+05:30 | 2012 | 1 | 20 | 1 | 4 | 20 | 3 |
3 | 2016-12-25 00:30:00.000000+10:00 | 2016 | 12 | 25 | 4 | 6 | 360 | 51 |
时间相关特征
df['Hour'] = df['TS_obj'].apply(lambda d: d.hour)
df['Minute'] = df['TS_obj'].apply(lambda d: d.minute)
df['Second'] = df['TS_obj'].apply(lambda d: d.second)
df['MUsecond'] = df['TS_obj'].apply(lambda d: d.microsecond) #毫秒
df['UTC_offset'] = df['TS_obj'].apply(lambda d: d.utcoffset()) #UTC时间位移
df[['Time', 'Hour', 'Minute', 'Second', 'MUsecond', 'UTC_offset']]
Time | Hour | Minute | Second | MUsecond | UTC_offset | |
---|---|---|---|---|---|---|
0 | 2015-03-08 10:30:00.360000+00:00 | 10 | 30 | 0 | 360000 | 00:00:00 |
1 | 2017-07-13 15:45:05.755000-07:00 | 15 | 45 | 5 | 755000 | -1 days +17:00:00 |
2 | 2012-01-20 22:30:00.254000+05:30 | 22 | 30 | 0 | 254000 | 05:30:00 |
3 | 2016-12-25 00:30:00.000000+10:00 | 0 | 30 | 0 | 0 | 10:00:00 |
按照早晚切分时间
hour_bins = [-1, 5, 11, 16, 21, 23]
bin_names = ['Late Night', 'Morning', 'Afternoon', 'Evening', 'Night']
df['TimeOfDayBin'] = pd.cut(df['Hour'],
bins=hour_bins, labels=bin_names)
df[['Time', 'Hour', 'TimeOfDayBin']]
Time | Hour | TimeOfDayBin | |
---|---|---|---|
0 | 2015-03-08 10:30:00.360000+00:00 | 10 | Morning |
1 | 2017-07-13 15:45:05.755000-07:00 | 15 | Afternoon |
2 | 2012-01-20 22:30:00.254000+05:30 | 22 | Night |
3 | 2016-12-25 00:30:00.000000+10:00 | 0 | Late Night |
文本特征
构造一个文本数据集
corpus = ['The sky is blue and beautiful.',
'Love this blue and beautiful sky!',
'The quick brown fox jumps over the lazy dog.',
'The brown fox is quick and the blue dog is lazy!',
'The sky is very blue and the sky is very beautiful today',
'The dog is lazy but the brown fox is quick!'
]
labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus,
'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df
Document | Category | |
---|---|---|
0 | The sky is blue and beautiful. | weather |
1 | Love this blue and beautiful sky! | weather |
2 | The quick brown fox jumps over the lazy dog. | animals |
3 | The brown fox is quick and the blue dog is lazy! | animals |
4 | The sky is very blue and the sky is very beaut... | weather |
5 | The dog is lazy but the brown fox is quick! | animals |
基本预处理
nltk.download("stopwords")
[nltk_data] Downloading package stopwords to
[nltk_data] /home/sunchengquan/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
True
#词频与停用词
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')
print (stop_words)
def normalize_document(doc):
# lower case and remove special characters\whitespaces
doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
doc = doc.lower()
doc = doc.strip()
# tokenize document
tokens = wpt.tokenize(doc)
# filter stopwords out of document
filtered_tokens = [token for token in tokens if token not in stop_words]
# re-create document from filtered tokens
doc = ' '.join(filtered_tokens)
return doc
normalize_corpus = np.vectorize(normalize_document)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
norm_corpus = normalize_corpus(corpus)
norm_corpus
#The sky is blue and beautiful.
array(['sky blue beautiful', 'love blue beautiful sky',
'quick brown fox jumps lazy dog', 'brown fox quick blue dog lazy',
'sky blue sky beautiful today', 'dog lazy brown fox quick'],
dtype='<U30')
词袋模型
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer
cv = CountVectorizer(min_df=0., max_df=1.)
cv.fit(norm_corpus)
print (cv.get_feature_names())
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix
['beautiful', 'blue', 'brown', 'dog', 'fox', 'jumps', 'lazy', 'love', 'quick', 'sky', 'today']
array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0],
[0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1],
[0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]])
vocab = cv.get_feature_names()
pd.DataFrame(cv_matrix, columns=vocab)
beautiful | blue | brown | dog | fox | jumps | lazy | love | quick | sky | today | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
2 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 |
3 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
4 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 |
5 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
N-Grams模型
bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)
beautiful sky | beautiful today | blue beautiful | blue dog | blue sky | brown fox | dog lazy | fox jumps | fox quick | jumps lazy | lazy brown | lazy dog | love blue | quick blue | quick brown | sky beautiful | sky blue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
5 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
TF-IDF 模型
TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。TF意思是词频(Term Frequency),IDF意思是逆文本频率指数(Inverse Document Frequency), 整个词在语料库中是否稀有,在某篇文章却出现频繁。
from sklearn.feature_extraction.text import TfidfVectorizer #中国 蜜蜂 养殖 它们的片频数都是20次
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)
beautiful | blue | brown | dog | fox | jumps | lazy | love | quick | sky | today | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.60 | 0.52 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.60 | 0.00 |
1 | 0.46 | 0.39 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.66 | 0.00 | 0.46 | 0.00 |
2 | 0.00 | 0.00 | 0.38 | 0.38 | 0.38 | 0.54 | 0.38 | 0.00 | 0.38 | 0.00 | 0.00 |
3 | 0.00 | 0.36 | 0.42 | 0.42 | 0.42 | 0.00 | 0.42 | 0.00 | 0.42 | 0.00 | 0.00 |
4 | 0.36 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.72 | 0.52 |
5 | 0.00 | 0.00 | 0.45 | 0.45 | 0.45 | 0.00 | 0.45 | 0.00 | 0.45 | 0.00 | 0.00 |
Similarity特征
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 1.000000 | 0.753128 | 0.000000 | 0.185447 | 0.807539 | 0.000000 |
1 | 0.753128 | 1.000000 | 0.000000 | 0.139665 | 0.608181 | 0.000000 |
2 | 0.000000 | 0.000000 | 1.000000 | 0.784362 | 0.000000 | 0.839987 |
3 | 0.185447 | 0.139665 | 0.784362 | 1.000000 | 0.109653 | 0.933779 |
4 | 0.807539 | 0.608181 | 0.000000 | 0.109653 | 1.000000 | 0.000000 |
5 | 0.000000 | 0.000000 | 0.839987 | 0.933779 | 0.000000 | 1.000000 |
聚类特征
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2)
km.fit_transform(similarity_df)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)
Document | Category | ClusterLabel | |
---|---|---|---|
0 | The sky is blue and beautiful. | weather | 0 |
1 | Love this blue and beautiful sky! | weather | 0 |
2 | The quick brown fox jumps over the lazy dog. | animals | 1 |
3 | The brown fox is quick and the blue dog is lazy! | animals | 1 |
4 | The sky is very blue and the sky is very beaut... | weather | 0 |
5 | The dog is lazy but the brown fox is quick! | animals | 1 |
主题模型LDA
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=2, max_iter=100, random_state=42)
dt_matrix = lda.fit_transform(tv_matrix)
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])
features
T1 | T2 | |
---|---|---|
0 | 0.190548 | 0.809452 |
1 | 0.176804 | 0.823196 |
2 | 0.846184 | 0.153816 |
3 | 0.814863 | 0.185137 |
4 | 0.180516 | 0.819484 |
5 | 0.839172 | 0.160828 |
主题和词的权重
tt_matrix = lda.components_
for topic_weights in tt_matrix:
topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]
topic = sorted(topic, key=lambda x: -x[1])
topic = [item for item in topic if item[1] > 0.6]
print(topic)
print()
[('brown', 1.7273638692668467), ('dog', 1.7273638692668467), ('fox', 1.7273638692668467), ('lazy', 1.7273638692668467), ('quick', 1.7273638692668467), ('jumps', 1.0328325272484777), ('blue', 0.7731573162915626)]
[('sky', 2.264386643135622), ('beautiful', 1.9068269319456903), ('blue', 1.7996282104933266), ('love', 1.148127242397004), ('today', 1.0068251160429935)]
词嵌入模型(词向量模型), 首选
每个向量有个实际的真实含义
喜欢:词向量 和 爱:词向量 是相似的。
from gensim.models import word2vec
wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]
# Set values for various parameters
feature_size = 10 # Word vector dimensionality
window_context = 10 # Context window size
min_word_count = 1 # Minimum word count
sample = 1e-3 # Downsample setting for frequent words
w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size,
window=window_context, min_count = min_word_count,
sample=sample)
w2v_model.wv['sky']
array([ 0.0051514 , 0.03431731, 0.02772758, 0.03358332, -0.00501862,
0.02063181, 0.00937138, 0.04554451, -0.02628018, 0.0292932 ],
dtype=float32)
平均向量模型 不是最优, 还有别的好方法? LSTM
def average_word_vectors(words, model, vocabulary, num_features):
feature_vector = np.zeros((num_features,),dtype="float64")
nwords = 0.
for word in words:
if word in vocabulary:
nwords = nwords + 1.
feature_vector = np.add(feature_vector, model[word])
if nwords:
feature_vector = np.divide(feature_vector, nwords)
return feature_vector
def averaged_word_vectorizer(corpus, model, num_features):
vocabulary = set(model.wv.index2word)
features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
for tokenized_sentence in corpus]
return np.array(features)
w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
num_features=feature_size)
pd.DataFrame(w2v_feature_array) #lstm
/home/sunchengquan/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
if __name__ == '__main__':
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.024004 | 0.009261 | -0.000253 | 0.004115 | -0.019091 | -0.005033 | -0.002342 | -0.014454 | 0.002763 | 0.026265 |
1 | 0.010011 | 0.019334 | -0.010439 | 0.010748 | -0.012463 | -0.011657 | -0.007812 | -0.007604 | 0.001624 | 0.012943 |
2 | -0.011372 | -0.000493 | 0.009532 | 0.002820 | -0.008391 | 0.001193 | 0.003332 | -0.010040 | 0.007316 | -0.010241 |
3 | -0.012263 | -0.002740 | -0.000849 | -0.003037 | -0.005497 | -0.002302 | -0.011215 | -0.009798 | 0.013508 | -0.002283 |
4 | 0.021322 | 0.015926 | -0.000250 | 0.011049 | -0.002766 | 0.007039 | -0.000213 | -0.006395 | 0.003781 | 0.017381 |
5 | -0.021207 | 0.006429 | 0.004595 | 0.001258 | -0.001805 | 0.006114 | -0.004712 | -0.002169 | 0.014164 | -0.003220 |
图像特征
import skimage
from skimage import io
from skimage import color
#opencv tensorflow
图像shape
cat = io.imread('./images/cat.png')
dog = io.imread('./images/dog.png')
df = pd.DataFrame(['Cat', 'Dog'], columns=['Image'])
print(cat.shape, dog.shape)
(168, 300, 3) (168, 300, 3)
#0-255,越小的值代表越暗,越大的值越亮
cat
array([[[114, 105, 90],
[113, 104, 89],
[112, 103, 88],
...,
[127, 130, 121],
[130, 133, 124],
[133, 136, 127]],
[[113, 104, 89],
[112, 103, 88],
[111, 102, 87],
...,
[129, 132, 125],
[132, 135, 128],
[135, 138, 131]],
[[111, 102, 87],
[111, 102, 87],
[110, 101, 86],
...,
[132, 134, 133],
[136, 138, 137],
[139, 141, 140]],
...,
[[ 32, 26, 28],
[ 32, 26, 28],
[ 30, 24, 26],
...,
[131, 131, 131],
[131, 131, 131],
[130, 130, 130]],
[[ 33, 27, 29],
[ 32, 26, 28],
[ 31, 25, 27],
...,
[131, 131, 131],
[131, 131, 131],
[130, 130, 130]],
[[ 33, 27, 29],
[ 32, 26, 28],
[ 31, 25, 27],
...,
[131, 131, 131],
[131, 131, 131],
[130, 130, 130]]], dtype=uint8)
#coffee = skimage.transform.resize(coffee, (300, 451), mode='reflect')
fig = plt.figure(figsize = (8, 2.5))
ax1 = fig.add_subplot(1,2, 1)
ax1.imshow(cat)
ax2 = fig.add_subplot(1,2, 2)
ax2.imshow(dog)
<matplotlib.image.AxesImage at 0x7fa047fc3278>
dog_r = dog.copy() # Red Channel
dog_r[:,:,1] = dog_r[:,:,2] = 0 # set G,B pixels = 0
dog_g = dog.copy() # Green Channel
dog_g[:,:,0] = dog_r[:,:,2] = 0 # set R,B pixels = 0
dog_b = dog.copy() # Blue Channel
dog_b[:,:,0] = dog_b[:,:,1] = 0 # set R,G pixels = 0
plot_image = np.concatenate((dog_r, dog_g, dog_b), axis=1)
plt.figure(figsize = (12,2.5))
plt.imshow(plot_image)
<matplotlib.image.AxesImage at 0x7fa047ef4208>
dog_r[1,1]
array([160, 0, 0], dtype=uint8)
灰度图
fig = plt.figure(figsize = (8,4))
ax1 = fig.add_subplot(2,2, 1)
ax1.imshow(color.rgb2gray(cat), cmap="gray" )
ax2 = fig.add_subplot(2,2, 2)
ax2.imshow(color.rgb2gray(dog), cmap='gray')
<matplotlib.image.AxesImage at 0x7fa0469ff2b0>
参考:
- 梁劲-机器学习学习笔记
- 唐宇坤机器学习课程