机器学习数据预处理


在这里插入图片描述

The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools…

Preparing data is time-consuming

在这里插入图片描述
What data scientists spend the most time doing

Messy data is by far the most time-consuming aspect of the typical data scientist’s work flow

Data in the real world is dirty

在这里插入图片描述
Data is rarely clean and often you can have data quality issues

The typical data quality issues that arise are:

  • Incomplete: Data lacks attributes or containing missing values.
  • Noisy: Data contains erroneous records or outliers.
  • Inconsistent: Data contains conflicting records or discrepancies.

Preprocessing data to avoid “garbage in, garbage out”

在这里插入图片描述

Preprocessing data - Clean your data

Why to deal with missing values?

  • Missing values in a dataset can be due to error or because observations that were not recorded

  • When missing value are present, certain algorithms may not work or you may not have the desired result.

  • Missing data affects some models more than others1

  • Even for models that handle missing data, they can be sensitive to it (missing data for certain variables can result in poor predictions)2

1&2 Source: https://channel9.msdn.com/Events/OpenSourceTW/DevDays-Asia-2017/AI12

How to deal with missing values?

Typical missing value handling methods are:

  • Deletion
    Remove records with missing values
  • Dummy substitution
    Replace missing values with a dummy value: e.g, unknown for categorical or 0 for numerical values.
  • Mean substitution
    If the missing data is numerical, replace the missing values with the mean.
  • Frequent substitution
    If the missing data is categorical, replace the missing values with the most frequent item
  • Regression substitution
    Use a regression method to replace missing values with regressed values.

在这里插入图片描述

What you should know about the outliers/anomalies

  • Outliers may bring about problems by distorting the predictive model.
  • What’s an outlier is somewhat subjective.1
  • Outliers can be very common in multidimensional data.2
  • Some models are less sensitive (more robust ) to outliers than others.3
  • Outliers can be result of bad data collection, or they can legitimate extreme (or unusual) values.4
  • Sometimes outliers are the interesting data points we want to model, and other times they just get in the way.5

在这里插入图片描述

1,2,3,4 &5 Source: https://channel9.msdn.com/Events/OpenSourceTW/DevDays-Asia-2017/AI12

Cause of outliers

在这里插入图片描述
How to deal with outliers?

The choice of how to deal with an outlier should depend on the cause.

  • Keep outliers
    Outliers should not necessarily be omitted from the analysis as they may be genuine observations in the data.

     In many applications, outliers provide crucial information. For example, in a credit card fraud detection app, they indicated purchases that fall outside a customer’s usual buying patterns.1
    
  • Exclude outliers

    • There are two common approaches to exclude outliers 2

      • Trimming/Truncation: Trimming discards the outliers

      • Winsorising: Winsorising replaces the outliers with the nearest “non-suspect” data.

1 source : Mathworks Box Plot’s source: http://www.physics.csbsju.edu/stats/box2.html

2 source: https://en.wikipedia.org/wiki/Outlier#Working_with_outliers

Examples on how to deal with outliers

Preprocessing data - Data normalization

How to normalize data?

Data normalization re-scales numerical values to a specified range. Popular data normalization methods include:

在这里插入图片描述

Preprocessing data - Data discretization

How to discretize data?

A numeric variable may have many different values and for some algorithms this may lead to very complex models. You can convert continuous attributes by “binning” to categorical attributes for ease of use with certain machine learning methods.
在这里插入图片描述

Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns1.

Binning

  • Binning helps to improve model performance. It captures non-linear behavior of continuous variables.

  • It minimizes the impact of outliers. It removes “noise” from large numbers of distinct values

  • It makes the models more explainable – grouped values are easier to display and understand. It improves model build speed – predictive algorithms build much faster as the number of distinct values decreases.

Preprocessing data - Data reduction

How to reduce data ?

There are various methods to reduce data size for easier data handling. Depending on data size and the domain, the following methods can be applied:

  • Record Sampling
    Sample the data records and only choose the representative subset from the data.
  • Attribute Sampling
    Select only a subset of the most important attributes from the data.
  • Aggregation
    Divide the data into groups and store the numbers for each group. For example, the daily revenue numbers of a restaurant chain over the past 20 years can be aggregated to monthly revenue to reduce the size of the data

Sample Data

If the dataset you plan to analyze is large, it’s usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. This facilitates data understanding, exploration, and feature engineering.

More data can result in much longer running times for algorithms and larger computational and memory requirements. You can take a smaller representative sample of the selected data that may be much faster for exploring and prototyping solutions before considering the whole dataset.

Preprocessing data - Text cleaning

How to clean text data ?

Improper text encoding handling while writing/reading text leads to information loss, inadvertent introduction of unreadable characters, e.g., nulls, and may also affect text parsing.

Unstructured text such as tweets, product reviews, or search queries usually requires some preprocessing before it can be analyzed.

For example:

  • replacing special characters and punctuation marks with spaces

  • normalizing case

  • removing duplicate characters

  • removing user-defined or built-in stop-words

  • word stemming

  • Data Preprocessing

  • Feature Engineering

  • Feature Selection

Data Preprocessing

标准化

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np
import pandas as pd

views = pd.DataFrame([1295., 25., 19000., 5., 1., 300.], columns=['views'])
views
views
01295.0
125.0
219000.0
35.0
41.0
5300.0

Standard Scaler x i − μ σ \frac{x_i - \mu}{\sigma} σxiμ

ss = StandardScaler()
views['zscore'] = ss.fit_transform(views[['views']])
views
viewszscore
01295.0-0.307214
125.0-0.489306
219000.02.231317
35.0-0.492173
41.0-0.492747
5300.0-0.449877
vw = np.array(views['views'])
(vw[0] - np.mean(vw)) / np.std(vw)
-0.30721413311687235

Min-Max Scaler x i − m i n ( x ) m a x ( x ) − m i n ( x ) \frac{x_i - min(x)}{max(x) - min(x)} max(x)min(x)ximin(x)

mms = MinMaxScaler()
views['minmax'] = mms.fit_transform(views[['views']])
views
viewszscoreminmax
01295.0-0.3072140.068109
125.0-0.4893060.001263
219000.02.2313171.000000
35.0-0.4921730.000211
41.0-0.4927470.000000
5300.0-0.4498770.015738
(vw[0] - np.min(vw)) / (np.max(vw) - np.min(vw))
0.06810884783409653

Robust Scaler x i − m e d i a n ( x ) I Q R ( 1 , 3 ) ( x ) \frac{x_i - median(x)}{IQR_{(1,3)}(x)} IQR(1,3)(x)ximedian(x)

rs = RobustScaler()
views['robust'] = rs.fit_transform(views[['views']])
views
viewszscoreminmaxrobust
01295.0-0.3072140.0681091.092883
125.0-0.4893060.001263-0.132690
219000.02.2313171.00000018.178528
35.0-0.4921730.000211-0.151990
41.0-0.4927470.000000-0.155850
5300.0-0.4498770.0157380.132690
quartiles = np.percentile(vw, (25., 75.))
iqr = quartiles[1] - quartiles[0]
(vw[0] - np.median(vw)) / iqr
1.0928829915560916

Feature Engineering

Feature Engineering is the key task in machine learning

There is no formal definition of feature engineering. It means different things to different people. In Google‘s definition, the process of extracting features from raw data is called feature engineering. In Microsoft’s documentation feature engineering is more about feature construction.

Feature Engineering is a sort of art

Feature engineers requires a creative combination of domain expertise and insights obtained from the data exploration step.

This is a balancing act of finding and including informative variables while avoiding too many unrelated variables.

Informative variables improve our result; unrelated variables introduce unnecessary noise into the model

What’s feature
在这里插入图片描述

Feature Engineering can augment your data

在这里插入图片描述
在这里插入图片描述

Why should you perform Feature Selection ?

Curse of dimensionality

The curse of dimensionality refers to how certain learning algorithms may perform poorly in high-dimensional data.

For example, after a certain point, increasing the dimensionality of the problem by adding new features would actually degrade the performance of our classifier. This is illustrated by the following figure, and is often referred to as ‘The Curse of Dimensionality’ 2.

在这里插入图片描述

Summary: the modern approaches for Feature Selection

在这里插入图片描述

  • Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use cross validation.

  • Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.

  • Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.

  • Using the subset of features from the wrapper methods make the model more prone to overfitting as compared to using subset of features from the filter methods.

  • Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods.

数值特征

离散值处理

import re
import nltk
import pytz
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import scipy.stats as spstats
import datetime
from dateutil.parser import parse
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

%matplotlib inline
mpl.style.reload_library()
mpl.style.use('classic')
mpl.rcParams['figure.facecolor'] = (1, 1, 1, 0)
# mpl.rcParams['figure.figsize'] = [8.0, 5.0]
mpl.rcParams['figure.dpi'] = 100
%config InlineBackend.figure_format = 'retina'

vg_df = pd.read_csv('datasets/vgsales.csv', encoding = "ISO-8859-1")
vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]
NamePlatformYearGenrePublisher
1Super Mario Bros.NES1985.0PlatformNintendo
2Mario Kart WiiWii2008.0RacingNintendo
3Wii Sports ResortWii2009.0SportsNintendo
4Pokemon Red/Pokemon BlueGB1996.0Role-PlayingNintendo
5TetrisGB1989.0PuzzleNintendo
6New Super Mario Bros.DS2006.0PlatformNintendo
genres = np.unique(vg_df['Genre'])
genres
array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
       'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
       'Strategy'], dtype=object)

LabelEncoder

gle = LabelEncoder()
genre_labels = gle.fit_transform(vg_df['Genre'])
genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
genre_mappings
{0: 'Action',
 1: 'Adventure',
 2: 'Fighting',
 3: 'Misc',
 4: 'Platform',
 5: 'Puzzle',
 6: 'Racing',
 7: 'Role-Playing',
 8: 'Shooter',
 9: 'Simulation',
 10: 'Sports',
 11: 'Strategy'}
genre_labels
array([10,  4,  6, ...,  6,  5,  4])
vg_df['GenreLabel'] = genre_labels
vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]
NamePlatformYearGenreGenreLabel
1Super Mario Bros.NES1985.0Platform4
2Mario Kart WiiWii2008.0Racing6
3Wii Sports ResortWii2009.0Sports10
4Pokemon Red/Pokemon BlueGB1996.0Role-Playing7
5TetrisGB1989.0Puzzle5
6New Super Mario Bros.DS2006.0Platform4

Map

poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)

np.unique(poke_df['Generation'])
array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)
gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 
               'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}

poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10]
NameGenerationGenerationLabel
4OctilleryGen 22
5HelioptileGen 66
6DialgaGen 44
7DeoxysDefense FormeGen 33
8RapidashGen 11
9SwannaGen 55

对于数据,我们同样可以给它反变换回去

ord_gen_map = {v:k for k,v in gen_ord_map.items()}
poke_df['InvGeneration'] = poke_df['GenerationLabel'].map(ord_gen_map)
poke_df[['Name', 'Generation', 'InvGeneration']].head()
NameGenerationInvGeneration
0CharizardMega Charizard YGen 1Gen 1
1AbomasnowGen 4Gen 4
2SentretGen 2Gen 2
3LitleoGen 6Gen 6
4OctilleryGen 2Gen 2
poke_df[['Name', 'Generation', 'Legendary']].iloc[4:10]
NameGenerationLegendary
4OctilleryGen 2False
5HelioptileGen 6False
6DialgaGen 4True
7DeoxysDefense FormeGen 3True
8RapidashGen 1False
9SwannaGen 5False
# transform and map pokemon generations
gen_le = LabelEncoder()
gen_labels = gen_le.fit_transform(poke_df['Generation'])
poke_df['Gen_Label'] = gen_labels

# transform and map pokemon legendary status
leg_le = LabelEncoder()
leg_labels = leg_le.fit_transform(poke_df['Legendary'])
poke_df['Lgnd_Label'] = leg_labels

poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]
poke_df_sub.iloc[4:10]
NameGenerationGen_LabelLegendaryLgnd_Label
4OctilleryGen 21False0
5HelioptileGen 65False0
6DialgaGen 43True1
7DeoxysDefense FormeGen 32True1
8RapidashGen 10False0
9SwannaGen 54False0
One-hot Encoding
# encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder()
gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()
gen_feature_labels = list(gen_le.classes_)
print (gen_feature_labels)
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)

# encode legendary status labels using one-hot encoding scheme
leg_ohe = OneHotEncoder()
leg_feature_arr = leg_ohe.fit_transform(poke_df[['Lgnd_Label']]).toarray()
leg_feature_labels = ['Legendary_'+str(cls_label) for cls_label in leg_le.classes_]
print (leg_feature_labels)
leg_features = pd.DataFrame(leg_feature_arr, columns=leg_feature_labels)
['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6']
['Legendary_False', 'Legendary_True']
poke_df_ohe = pd.concat([poke_df_sub, gen_features, leg_features], axis=1)
columns = sum([['Name', 'Generation', 'Gen_Label'],gen_feature_labels,
              ['Legendary', 'Lgnd_Label'],leg_feature_labels], [])
poke_df_ohe[columns].iloc[4:10]
NameGenerationGen_LabelGen 1Gen 2Gen 3Gen 4Gen 5Gen 6LegendaryLgnd_LabelLegendary_FalseLegendary_True
4OctilleryGen 210.01.00.00.00.00.0False01.00.0
5HelioptileGen 650.00.00.00.00.01.0False01.00.0
6DialgaGen 430.00.00.01.00.00.0True10.01.0
7DeoxysDefense FormeGen 320.00.01.00.00.00.0True10.01.0
8RapidashGen 101.00.00.00.00.00.0False01.00.0
9SwannaGen 540.00.00.00.01.00.0False01.00.0
Get Dummy

Pandas库中同样有类似的操作,使用get_dummies也可以得到相应的特征

gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True)
pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]
NameGenerationGen 2Gen 3Gen 4Gen 5Gen 6
4OctilleryGen 210000
5HelioptileGen 600001
6DialgaGen 400100
7DeoxysDefense FormeGen 301000
8RapidashGen 100000
9SwannaGen 500010
gen_onehot_features = pd.get_dummies(poke_df['Generation'])
pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[4:10]
NameGenerationGen 1Gen 2Gen 3Gen 4Gen 5Gen 6
4OctilleryGen 2010000
5HelioptileGen 6000001
6DialgaGen 4000100
7DeoxysDefense FormeGen 3001000
8RapidashGen 1100000
9SwannaGen 5000010
poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
poke_df.head()
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison318454949656545Gen 1False
12IvysaurGrassPoison405606263808060Gen 1False
23VenusaurGrassPoison52580828310010080Gen 1False
33VenusaurMega VenusaurGrassPoison6258010012312212080Gen 1False
44CharmanderFireNaN309395243605065Gen 1False
poke_df[['HP', 'Attack', 'Defense']].head()
HPAttackDefense
0454949
1606263
2808283
380100123
4395243
poke_df[['HP', 'Attack', 'Defense']].describe()
HPAttackDefense
count800.000000800.000000800.000000
mean69.25875079.00125073.842500
std25.53466932.45736631.183501
min1.0000005.0000005.000000
25%50.00000055.00000050.000000
50%65.00000075.00000070.000000
75%80.000000100.00000090.000000
max255.000000190.000000230.000000
popsong_df = pd.read_csv('datasets/song_views.csv', encoding='utf-8')
popsong_df.head(10)
user_idsong_idtitlelisten_count
0b6b799f34a204bd928ea014c243ddad6d0be4f8fSOBONKR12A58A7A7E0You're The One2
1b41ead730ac14f6b6717b9cf8859d5579f3f8d4dSOBONKR12A58A7A7E0You're The One0
24c84359a164b161496d05282707cecbd50adbfc4SOBONKR12A58A7A7E0You're The One0
3779b5908593756abb6ff7586177c966022668b06SOBONKR12A58A7A7E0You're The One0
4dd88ea94f605a63d9fc37a214127e3f00e85e42dSOBONKR12A58A7A7E0You're The One0
568f0359a2f1cedb0d15c98d88017281db79f9bc6SOBONKR12A58A7A7E0You're The One0
6116a4c95d63623a967edf2f3456c90ebbf964e6fSOBONKR12A58A7A7E0You're The One17
745544491ccfcdc0b0803c34f201a6287ed4e30f8SOBONKR12A58A7A7E0You're The One0
8e701a24d9b6c59f5ac37ab28462ca82470e27cfbSOBONKR12A58A7A7E0You're The One68
9edc8b7b1fd592a3b69c3d823a742e1a064abec95SOBONKR12A58A7A7E0You're The One0

二值特征

watched = np.array(popsong_df['listen_count']) 
watched[watched >= 1] = 1
popsong_df['watched'] = watched
popsong_df.head(10)
user_idsong_idtitlelisten_countwatched
0b6b799f34a204bd928ea014c243ddad6d0be4f8fSOBONKR12A58A7A7E0You're The One21
1b41ead730ac14f6b6717b9cf8859d5579f3f8d4dSOBONKR12A58A7A7E0You're The One00
24c84359a164b161496d05282707cecbd50adbfc4SOBONKR12A58A7A7E0You're The One00
3779b5908593756abb6ff7586177c966022668b06SOBONKR12A58A7A7E0You're The One00
4dd88ea94f605a63d9fc37a214127e3f00e85e42dSOBONKR12A58A7A7E0You're The One00
568f0359a2f1cedb0d15c98d88017281db79f9bc6SOBONKR12A58A7A7E0You're The One00
6116a4c95d63623a967edf2f3456c90ebbf964e6fSOBONKR12A58A7A7E0You're The One171
745544491ccfcdc0b0803c34f201a6287ed4e30f8SOBONKR12A58A7A7E0You're The One00
8e701a24d9b6c59f5ac37ab28462ca82470e27cfbSOBONKR12A58A7A7E0You're The One681
9edc8b7b1fd592a3b69c3d823a742e1a064abec95SOBONKR12A58A7A7E0You're The One00
# > 0.9 变成1, 小于0.9 变成0
bn = Binarizer(threshold=0.9) 
pd_watched = bn.transform([popsong_df['listen_count']])[0]
popsong_df['pd_watched'] = pd_watched
popsong_df.head(11)
user_idsong_idtitlelisten_countwatchedpd_watched
0b6b799f34a204bd928ea014c243ddad6d0be4f8fSOBONKR12A58A7A7E0You're The One211
1b41ead730ac14f6b6717b9cf8859d5579f3f8d4dSOBONKR12A58A7A7E0You're The One000
24c84359a164b161496d05282707cecbd50adbfc4SOBONKR12A58A7A7E0You're The One000
3779b5908593756abb6ff7586177c966022668b06SOBONKR12A58A7A7E0You're The One000
4dd88ea94f605a63d9fc37a214127e3f00e85e42dSOBONKR12A58A7A7E0You're The One000
568f0359a2f1cedb0d15c98d88017281db79f9bc6SOBONKR12A58A7A7E0You're The One000
6116a4c95d63623a967edf2f3456c90ebbf964e6fSOBONKR12A58A7A7E0You're The One1711
745544491ccfcdc0b0803c34f201a6287ed4e30f8SOBONKR12A58A7A7E0You're The One000
8e701a24d9b6c59f5ac37ab28462ca82470e27cfbSOBONKR12A58A7A7E0You're The One6811
9edc8b7b1fd592a3b69c3d823a742e1a064abec95SOBONKR12A58A7A7E0You're The One000
10fb41d1c374d093ab643ef3bcd70eeb258d479076SOBONKR12A58A7A7E0You're The One111

多项式特征

atk_def = poke_df[['Attack', 'Defense']]
atk_def.head()
AttackDefense
04949
16263
28283
3100123
45243
pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
res = pf.fit_transform(atk_def)
res
array([[   49.,    49.,  2401.,  2401.,  2401.],
       [   62.,    63.,  3844.,  3906.,  3969.],
       [   82.,    83.,  6724.,  6806.,  6889.],
       ...,
       [  110.,    60., 12100.,  6600.,  3600.],
       [  160.,    60., 25600.,  9600.,  3600.],
       [  110.,   120., 12100., 13200., 14400.]])
intr_features = pd.DataFrame(res, columns=['Attack', 'Defense', 'Attack^2', 'Attack x Defense', 'Defense^2'])
intr_features.head(5)
AttackDefenseAttack^2Attack x DefenseDefense^2
049.049.02401.02401.02401.0
162.063.03844.03906.03969.0
282.083.06724.06806.06889.0
3100.0123.010000.012300.015129.0
452.043.02704.02236.01849.0

binning特征

连续值做离散化操作

fcc_survey_df = pd.read_csv('datasets/fcc_2016_coder_survey_subset.csv', encoding='utf-8')
fcc_survey_df[['ID.x', 'EmploymentField', 'Age', 'Income']].head()
ID.xEmploymentFieldAgeIncome
0cef35615d61b202f1dc794ef2746df14office and administrative support28.032000.0
1323e5a113644d18185c743c241407754food and beverage22.015000.0
2b29a1027e5cd062e654a63764157461dfinance19.048000.0
304a11e4bcb573a1261eb0d9948d32637arts, entertainment, sports, or media26.043000.0
49368291c93d5d5f5c8cdb1a575e18beceducation20.06000.0
fig, ax = plt.subplots()
fcc_survey_df['Age'].hist(color='#A9C5D3')
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Text(0, 0.5, 'Frequency')

在这里插入图片描述

Binning based on rounding
Age Range: Bin
---------------
 0 -  9  : 0
10 - 19  : 1
20 - 29  : 2
30 - 39  : 3
40 - 49  : 4
50 - 59  : 5
60 - 69  : 6
  ... and so on
fcc_survey_df['Age_bin_round'] = np.array(np.floor(np.array(fcc_survey_df['Age']) / 10.))
fcc_survey_df[['ID.x', 'Age', 'Age_bin_round']].iloc[1071:1076]
ID.xAgeAge_bin_round
10716a02aa4618c99fdb3e24de522a09943117.01.0
1072f0e5e47278c5f248fe861c5f7214c07a38.03.0
10736e14f6d0779b7e424fa3fdd9e4bd3bf921.02.0
1074c2654c07dc929cdf3dad4d1aec4ffbb353.05.0
1075f07449fc9339b2e57703ec788623252335.03.0

分位数切分

fcc_survey_df[['ID.x', 'Age', 'Income']].iloc[4:9]
ID.xAgeIncome
49368291c93d5d5f5c8cdb1a575e18bec20.06000.0
5dd0e77eab9270e4b67c19b0d6bbf621b34.040000.0
67599c0aa0419b59fd11ffede98a3665d23.032000.0
76dff182db452487f07a47596f314bddc35.040000.0
89dc233f8ed1c6eb2432672ab4bb3924933.080000.0
fig, ax = plt.subplots()
fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
ax.set_title('Developer Income Histogram', fontsize=12)
ax.set_xlabel('Developer Income', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Text(0, 0.5, 'Frequency')

在这里插入图片描述

quantile_list = [0, .25, .5, .75, 1.]
quantiles = fcc_survey_df['Income'].quantile(quantile_list)
quantiles
0.00      6000.0
0.25     20000.0
0.50     37000.0
0.75     60000.0
1.00    200000.0
Name: Income, dtype: float64
fig, ax = plt.subplots()
fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')

for quantile in quantiles:
    qvl = plt.axvline(quantile, color='r')
ax.legend([qvl], ['Quantiles'], fontsize=10)

ax.set_title('Developer Income Histogram with Quantiles', fontsize=12)
ax.set_xlabel('Developer Income', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Text(0, 0.5, 'Frequency')

在这里插入图片描述

quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'], 
                                                 q=quantile_list)
fcc_survey_df['Income_quantile_label'] = pd.qcut(fcc_survey_df['Income'], 
                                                 q=quantile_list, labels=quantile_labels)
fcc_survey_df[['ID.x', 'Age', 'Income', 
               'Income_quantile_range', 'Income_quantile_label']].iloc[4:9]
ID.xAgeIncomeIncome_quantile_rangeIncome_quantile_label
49368291c93d5d5f5c8cdb1a575e18bec20.06000.0(5999.999, 20000.0]0-25Q
5dd0e77eab9270e4b67c19b0d6bbf621b34.040000.0(37000.0, 60000.0]50-75Q
67599c0aa0419b59fd11ffede98a3665d23.032000.0(20000.0, 37000.0]25-50Q
76dff182db452487f07a47596f314bddc35.040000.0(37000.0, 60000.0]50-75Q
89dc233f8ed1c6eb2432672ab4bb3924933.080000.0(60000.0, 200000.0]75-100Q

对数变换 COX-BOX

log下什么都不写默认是自然对数

偏度值更低, 更能满足正太分布的要求

fcc_survey_df['Income_log'] = np.log((1+ fcc_survey_df['Income']))
fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log']].iloc[4:9]
ID.xAgeIncomeIncome_log
49368291c93d5d5f5c8cdb1a575e18bec20.06000.08.699681
5dd0e77eab9270e4b67c19b0d6bbf621b34.040000.010.596660
67599c0aa0419b59fd11ffede98a3665d23.032000.010.373522
76dff182db452487f07a47596f314bddc35.040000.010.596660
89dc233f8ed1c6eb2432672ab4bb3924933.080000.011.289794
income_log_mean = np.round(np.mean(fcc_survey_df['Income_log']), 2)

fig, ax = plt.subplots()
fcc_survey_df['Income_log'].hist(bins=30, color='#A9C5D3')
plt.axvline(income_log_mean, color='r')
ax.set_title('Developer Income Histogram after Log Transform', fontsize=12)
ax.set_xlabel('Developer Income (log scale)', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.text(11.5, 450, r'$\mu$='+str(income_log_mean), fontsize=10)
Text(11.5, 450, '$\\mu$=10.43')

在这里插入图片描述

日期相关特征

time_stamps = ['2015-03-08 10:30:00.360000+00:00', '2017-07-13 15:45:05.755000-07:00',
               '2012-01-20 22:30:00.254000+05:30', '2016-12-25 00:30:00.000000+10:00']
df = pd.DataFrame(time_stamps, columns=['Time'])
df
Time
02015-03-08 10:30:00.360000+00:00
12017-07-13 15:45:05.755000-07:00
22012-01-20 22:30:00.254000+05:30
32016-12-25 00:30:00.000000+10:00
ts_objs = np.array([pd.Timestamp(item) for item in np.array(df.Time)])
df['TS_obj'] = ts_objs
ts_objs
array([Timestamp('2015-03-08 10:30:00.360000+0000', tz='UTC'),
       Timestamp('2017-07-13 15:45:05.755000-0700', tz='pytz.FixedOffset(-420)'),
       Timestamp('2012-01-20 22:30:00.254000+0530', tz='pytz.FixedOffset(330)'),
       Timestamp('2016-12-25 00:30:00+1000', tz='pytz.FixedOffset(600)')],
      dtype=object)
df['Year'] = df['TS_obj'].apply(lambda d: d.year)
df['Month'] = df['TS_obj'].apply(lambda d: d.month)
df['Day'] = df['TS_obj'].apply(lambda d: d.day)
df['DayOfWeek'] = df['TS_obj'].apply(lambda d: d.dayofweek)
df['DayOfYear'] = df['TS_obj'].apply(lambda d: d.dayofyear)
df['WeekOfYear'] = df['TS_obj'].apply(lambda d: d.weekofyear)
df['Quarter'] = df['TS_obj'].apply(lambda d: d.quarter)

df[['Time', 'Year', 'Month', 'Day', 'Quarter', 
    'DayOfWeek', 'DayOfYear', 'WeekOfYear']]
TimeYearMonthDayQuarterDayOfWeekDayOfYearWeekOfYear
02015-03-08 10:30:00.360000+00:00201538166710
12017-07-13 15:45:05.755000-07:0020177133319428
22012-01-20 22:30:00.254000+05:30201212014203
32016-12-25 00:30:00.000000+10:00201612254636051

时间相关特征

df['Hour'] = df['TS_obj'].apply(lambda d: d.hour)
df['Minute'] = df['TS_obj'].apply(lambda d: d.minute)
df['Second'] = df['TS_obj'].apply(lambda d: d.second)
df['MUsecond'] = df['TS_obj'].apply(lambda d: d.microsecond)   #毫秒
df['UTC_offset'] = df['TS_obj'].apply(lambda d: d.utcoffset()) #UTC时间位移

df[['Time', 'Hour', 'Minute', 'Second', 'MUsecond', 'UTC_offset']]
TimeHourMinuteSecondMUsecondUTC_offset
02015-03-08 10:30:00.360000+00:001030036000000:00:00
12017-07-13 15:45:05.755000-07:0015455755000-1 days +17:00:00
22012-01-20 22:30:00.254000+05:302230025400005:30:00
32016-12-25 00:30:00.000000+10:000300010:00:00

按照早晚切分时间

hour_bins = [-1, 5, 11, 16, 21, 23]
bin_names = ['Late Night', 'Morning', 'Afternoon', 'Evening', 'Night']
df['TimeOfDayBin'] = pd.cut(df['Hour'], 
                            bins=hour_bins, labels=bin_names)
df[['Time', 'Hour', 'TimeOfDayBin']]
TimeHourTimeOfDayBin
02015-03-08 10:30:00.360000+00:0010Morning
12017-07-13 15:45:05.755000-07:0015Afternoon
22012-01-20 22:30:00.254000+05:3022Night
32016-12-25 00:30:00.000000+10:000Late Night

文本特征

构造一个文本数据集

corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]
labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 
                          'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df
DocumentCategory
0The sky is blue and beautiful.weather
1Love this blue and beautiful sky!weather
2The quick brown fox jumps over the lazy dog.animals
3The brown fox is quick and the blue dog is lazy!animals
4The sky is very blue and the sky is very beaut...weather
5The dog is lazy but the brown fox is quick!animals

基本预处理

nltk.download("stopwords")
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/sunchengquan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.





True
#词频与停用词
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')
print (stop_words)
def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
norm_corpus = normalize_corpus(corpus)
norm_corpus
#The sky is blue and beautiful.
array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog', 'brown fox quick blue dog lazy',
       'sky blue sky beautiful today', 'dog lazy brown fox quick'],
      dtype='<U30')

词袋模型

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

cv = CountVectorizer(min_df=0., max_df=1.)
cv.fit(norm_corpus)
print (cv.get_feature_names())
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix
['beautiful', 'blue', 'brown', 'dog', 'fox', 'jumps', 'lazy', 'love', 'quick', 'sky', 'today']





array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0],
       [0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1],
       [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]])
vocab = cv.get_feature_names()
pd.DataFrame(cv_matrix, columns=vocab)
beautifulbluebrowndogfoxjumpslazylovequickskytoday
011000000010
111000001010
200111110100
301111010100
411000000021
500111010100

N-Grams模型

bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)
beautiful skybeautiful todayblue beautifulblue dogblue skybrown foxdog lazyfox jumpsfox quickjumps lazylazy brownlazy doglove bluequick bluequick brownsky beautifulsky blue
000100000000000001
110100000000010000
200000101010100100
300010110100001000
401001000000000011
500000110101000000

TF-IDF 模型

TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。TF意思是词频(Term Frequency),IDF意思是逆文本频率指数(Inverse Document Frequency), 整个词在语料库中是否稀有,在某篇文章却出现频繁

from sklearn.feature_extraction.text import TfidfVectorizer #中国 蜜蜂 养殖 它们的片频数都是20次
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)
beautifulbluebrowndogfoxjumpslazylovequickskytoday
00.600.520.000.000.000.000.000.000.000.600.00
10.460.390.000.000.000.000.000.660.000.460.00
20.000.000.380.380.380.540.380.000.380.000.00
30.000.360.420.420.420.000.420.000.420.000.00
40.360.310.000.000.000.000.000.000.000.720.52
50.000.000.450.450.450.000.450.000.450.000.00

Similarity特征

from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df
012345
01.0000000.7531280.0000000.1854470.8075390.000000
10.7531281.0000000.0000000.1396650.6081810.000000
20.0000000.0000001.0000000.7843620.0000000.839987
30.1854470.1396650.7843621.0000000.1096530.933779
40.8075390.6081810.0000000.1096531.0000000.000000
50.0000000.0000000.8399870.9337790.0000001.000000
聚类特征
from sklearn.cluster import KMeans

km = KMeans(n_clusters=2)
km.fit_transform(similarity_df)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)
DocumentCategoryClusterLabel
0The sky is blue and beautiful.weather0
1Love this blue and beautiful sky!weather0
2The quick brown fox jumps over the lazy dog.animals1
3The brown fox is quick and the blue dog is lazy!animals1
4The sky is very blue and the sky is very beaut...weather0
5The dog is lazy but the brown fox is quick!animals1

主题模型LDA

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=2, max_iter=100, random_state=42)
dt_matrix = lda.fit_transform(tv_matrix)
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])
features
T1T2
00.1905480.809452
10.1768040.823196
20.8461840.153816
30.8148630.185137
40.1805160.819484
50.8391720.160828
主题和词的权重
tt_matrix = lda.components_
for topic_weights in tt_matrix:
    topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]
    topic = sorted(topic, key=lambda x: -x[1])
    topic = [item for item in topic if item[1] > 0.6]
    print(topic)
    print()
[('brown', 1.7273638692668467), ('dog', 1.7273638692668467), ('fox', 1.7273638692668467), ('lazy', 1.7273638692668467), ('quick', 1.7273638692668467), ('jumps', 1.0328325272484777), ('blue', 0.7731573162915626)]

[('sky', 2.264386643135622), ('beautiful', 1.9068269319456903), ('blue', 1.7996282104933266), ('love', 1.148127242397004), ('today', 1.0068251160429935)]

词嵌入模型(词向量模型), 首选

每个向量有个实际的真实含义

喜欢:词向量 和 爱:词向量 是相似的。

from gensim.models import word2vec

wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]

# Set values for various parameters
feature_size = 10    # Word vector dimensionality  
window_context = 10          # Context window size                                                                                    
min_word_count = 1   # Minimum word count                        
sample = 1e-3   # Downsample setting for frequent words

w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, 
                          window=window_context, min_count = min_word_count,
                          sample=sample)
w2v_model.wv['sky']
array([ 0.0051514 ,  0.03431731,  0.02772758,  0.03358332, -0.00501862,
        0.02063181,  0.00937138,  0.04554451, -0.02628018,  0.0292932 ],
      dtype=float32)

平均向量模型 不是最优, 还有别的好方法? LSTM

def average_word_vectors(words, model, vocabulary, num_features):
    
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    
    for word in words:
        if word in vocabulary: 
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
    
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector
    
def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)
w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
                                             num_features=feature_size)
pd.DataFrame(w2v_feature_array) #lstm
/home/sunchengquan/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  if __name__ == '__main__':
0123456789
00.0240040.009261-0.0002530.004115-0.019091-0.005033-0.002342-0.0144540.0027630.026265
10.0100110.019334-0.0104390.010748-0.012463-0.011657-0.007812-0.0076040.0016240.012943
2-0.011372-0.0004930.0095320.002820-0.0083910.0011930.003332-0.0100400.007316-0.010241
3-0.012263-0.002740-0.000849-0.003037-0.005497-0.002302-0.011215-0.0097980.013508-0.002283
40.0213220.015926-0.0002500.011049-0.0027660.007039-0.000213-0.0063950.0037810.017381
5-0.0212070.0064290.0045950.001258-0.0018050.006114-0.004712-0.0021690.014164-0.003220

图像特征

import skimage
from skimage import io
from skimage import color
#opencv tensorflow

图像shape

cat = io.imread('./images/cat.png')
dog = io.imread('./images/dog.png')
df = pd.DataFrame(['Cat', 'Dog'], columns=['Image'])


print(cat.shape, dog.shape)
(168, 300, 3) (168, 300, 3)
#0-255,越小的值代表越暗,越大的值越亮
cat
array([[[114, 105,  90],
        [113, 104,  89],
        [112, 103,  88],
        ...,
        [127, 130, 121],
        [130, 133, 124],
        [133, 136, 127]],

       [[113, 104,  89],
        [112, 103,  88],
        [111, 102,  87],
        ...,
        [129, 132, 125],
        [132, 135, 128],
        [135, 138, 131]],

       [[111, 102,  87],
        [111, 102,  87],
        [110, 101,  86],
        ...,
        [132, 134, 133],
        [136, 138, 137],
        [139, 141, 140]],

       ...,

       [[ 32,  26,  28],
        [ 32,  26,  28],
        [ 30,  24,  26],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]],

       [[ 33,  27,  29],
        [ 32,  26,  28],
        [ 31,  25,  27],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]],

       [[ 33,  27,  29],
        [ 32,  26,  28],
        [ 31,  25,  27],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]]], dtype=uint8)
#coffee = skimage.transform.resize(coffee, (300, 451), mode='reflect')
fig = plt.figure(figsize = (8, 2.5))
ax1 = fig.add_subplot(1,2, 1)
ax1.imshow(cat)
ax2 = fig.add_subplot(1,2, 2)
ax2.imshow(dog)
<matplotlib.image.AxesImage at 0x7fa047fc3278>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VAFJQi9a-1583330158843)(output_104_1.png)]

dog_r = dog.copy() # Red Channel
dog_r[:,:,1] = dog_r[:,:,2] = 0 # set G,B pixels = 0
dog_g = dog.copy() # Green Channel
dog_g[:,:,0] = dog_r[:,:,2] = 0 # set R,B pixels = 0
dog_b = dog.copy() # Blue Channel
dog_b[:,:,0] = dog_b[:,:,1] = 0 # set R,G pixels = 0

plot_image = np.concatenate((dog_r, dog_g, dog_b), axis=1)
plt.figure(figsize = (12,2.5))
plt.imshow(plot_image)
<matplotlib.image.AxesImage at 0x7fa047ef4208>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-D1mp7YBj-1583330158844)(output_105_1.png)]

dog_r[1,1]
array([160,   0,   0], dtype=uint8)

灰度图

fig = plt.figure(figsize = (8,4))
ax1 = fig.add_subplot(2,2, 1)
ax1.imshow(color.rgb2gray(cat), cmap="gray" )
ax2 = fig.add_subplot(2,2, 2)
ax2.imshow(color.rgb2gray(dog), cmap='gray')
<matplotlib.image.AxesImage at 0x7fa0469ff2b0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zSB9EWbC-1583330158845)(output_108_1.png)]

参考:

  1. 梁劲-机器学习学习笔记
  2. 唐宇坤机器学习课程
  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值