Week2-7Preprocessing

最新推荐文章于 2023-12-15 15:33:54 发布

zypandora

最新推荐文章于 2023-12-15 15:33:54 发布

阅读量222

点赞数

分类专栏： NLP(Michigan)

本文链接：https://blog.csdn.net/zypandora/article/details/49702035

版权

NLP(Michigan) 专栏收录该内容

45 篇文章 0 订阅

订阅专栏

Convert the raw text to the format that is easier to process.

Text preprocessing

这里写图片描述

Type and tokens

Type is any sequence of characters that represent a specific word, token is any occurrence of type. So the type can only appear once in a document, whereas token can appear multiple times.

To be or not to be

4 types, 6 tokens

Tokenization

这里写图片描述

Word segmentation

Arabic
Japanese
German
Chinese

Sentence boundary recognition

Decision tree
Features
- punctuation
- formatting
- fonts
- spacing
- capitalization
- case
- use of abbreviations(Dr.)

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

zypandora

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

week1-text preprocessing->feature extraction

tinymd的博客

11-05

335

This week, we focus on text classificationtext preprocessingTokenizationNormalization tips: text classification can be used to sentiment analysis. text preprocessing Tokenization How to process text d...

machine learning-week6 preprocessing

weixin_42814109的博客

06-14

180

在机器学习中，我们有个最基本的假设：训练集和测试及数据是独立同分布的。但现实生活中我们的数据通常会出现某些问题：（1）Incomplete：

参与评论您还未登录，请先登录后发表或查看评论

Course3-Week2-推荐系统

最新发布

虎慕的博客

12-15

1902

推荐机制、协同过滤算法、基于内容过滤、PCA主成分分析

Week-T9 猫狗识别2

qq_40724911的博客

11-13

110

二、准备数据 2.1 获取数据集三种配置数据集的方式：：打乱数据，关于此函数的详细介绍可以参考：https://zhuanlan.zhihu.com/p/42417456 ：预取数据，加速运行，其详细介绍可以参考我前两篇文章，里面都有讲解。：将数据集缓存到内存当中，加速运行 2.2. 可视化数据 3.2 编译模型在准备对模型进行训练之前，还需要再对其进行一些设置。以下内容是在模型的编译步骤中添加的：● 损失函数（loss）：用于衡量模型在训练期间的准确率。● 优化器（op

week7 Auto-Encoder

myooooou的博客

09-08

155

week7 Auto-Encoder

Week T7 - 咖啡豆识别（VGG-16）

qq_40724911的博客

10-22

115

● 难度：夯实基础⭐⭐● 语言：Python3、TensorFlow2● 时间：9月5-9月9日🔎 探索（难度有点大）

DeepLearning.ai-Week2-Residual Networks

weixin_30477797的博客

08-16

1 - Import Packages import numpy as np from keras import layers from keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPo...

Assignment | 04-week2 -Residual Networks

Self Improvement Lab

02-05

589

该系列仅在原课程基础上课后作业部分添加个人学习笔记，如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ Coursera 课程 |deeplearning.ai |网易云课堂 CSDN：http:

week02-Introduction

weixin_44485501的博客

10-27

269

Foundation Data Sciences Week 02: Introduction to Jupyter Notebooks and Pandas Learning outcomes: In this lab you will learn the very basics of the python library pandas, which is used for data management. By the end of the lab you should be able to: use

M5-competition第一名代码解析学习-1. preprocessing.ipynb

weixin_41168869的博客

07-03

188

具体来说，使用了Pandas中的groupby方法，将数据表按照’store_id’和’item_id’进行分组，然后对每个分组内的’sell_price’列进行了shift(1)操作，即将每个分组内的’sell_price’列整体向下移动一位，再用整个’sell_price’列除以移动后的结果，生成’price_momentum’新列。np.iinfo(np.int8).min表示获取np.int8数据类型的最小值，np.iinfo(np.int8).max表示获取np.int8数据类型的最大值。

Assignment | 05-week1 -Improvise a Jazz Solo with an LSTM Network

Self Improvement Lab

03-02

3589

该系列仅在原课程基础上课后作业部分添加个人学习笔记，如有错误，还请批评指教。- ZJ Coursera 课程 |deeplearning.ai |网易云课堂 CSDN：http://blog.csdn.net/JUNJUN_ZHAO/article/details/79420913 Welcome to your final programming assignment of t...

吴恩达深度学习4-Week2课后作业2-残差网络

Apple_hzc的博客

11-16

534

一、Deeplearning-assignment 在本次作业中，我们将学习如何通过残差网络(ResNets)建立更深的卷及网络。理论上，深层次的网络可以表示非常复杂的函数，但在实践中，他们是很难创建和训练的。残差网络使得建立比以前更深层次的网络成为可能。对于残差网络的详细讲解，具体可参考该论文：Deep Residual Learning for Image Recognition 在这个任...

Sequence Model-week1编程题2-Character level language model【RNN生成恐龙名 LSTM生成莎士比亚风格文字】...

豆子

06-26

413

Character level language model - Dinosaurus land 为了构建字符级语言模型来生成新的名称，你的模型将学习不同的名字，并随机生成新的名字。任务清单：如何存储文本数据，以便使用RNN进行处理。如何合成数据，通过采样在每个time step预测，并通过下一个RNN-cell unit。如何构建字符级文本，生成循环神经网络(RNN...

Neural Networks and Deep Learning week2 Logistic Regression with a Neural Network mindset

QQ1845517170的博客

11-24

729

该实验作业的主要目的通过该实验指导完成对猫图片的识别在此过程中需要a初始化参量b计算代价函数c使用连续梯度算法你可能对以下参考文献感兴趣 http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/ 这是一篇从头开始搭建神经网络的文章，可以更好的帮助你理解神经网络的细节，该文章有github的同步文件，你也可以去看看github上同步实现操作 https://stats.stackexchange.co..

Week7-1Noisy channel model

zypandora的博客

12-22

739

The noisy channel model Example: Input: written English(X) Encoder: garble the input(X->Y) Output: spoken English(Y) More examples: Grammatical english to english with mistakes English to bitmaps(cha

Week6-6Language Modelling3

zypandora的博客

12-20

718

Evaluation of LM Extrinsic Intrinsic Correlate the two for validation purposes Intrinsic: Perplexity Does the model fit the data? A good model will give high probability to a real sentence. Perplexity

Week3-4Dimensionality reduction

zypandora的博客

11-17

632

Problems with the simple vector approaches to similarityDimensionality reduction looking for hidden similarities in data based on matrix decomposition Matrix decompositionSVDExample Assume that we have

Week4-4Earley Parser

zypandora的博客

11-27

576

Background Developed by Jay Earley in 1970 No need to convert grammar to CNF Left to right Complexityfast than O(n3)O(n^3) in many casesEarley Parser look for both full and partial constituents when re

编写python代码，根据给定的天气数据，使用随机森林和决策树预测某一天的最高气温 year,moth,day,week分别表示的具体的时间 temp_2：前天的最高温度值 temp_1：昨天的最高温度值 average：在历史中，每年这一天的平均最高温度值 actual：标签值，当天的真实最高温度其中，原始数据中在week列中并不是一些数值特征，而是表示周几的字符串，需要进行数据预处理，使用编码方式：One-Hot Encoding，one-hot 编码类似于虚拟变量，是一种将分类变量转换为几个二进制列的方法。其中 1 代表某个输入属于该类别。

05-28

好的，我可以帮你编写这段代码。假设我们有一个名为"weather.csv"的数据集，其中包含了日期、最高气温、最低气温、风速等信息。 ``` python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.preprocessing import OneHotEncoder # 读取数据 data = pd.read_csv("weather.csv") # 数据预处理：将week列进行One-Hot编码 encoder = OneHotEncoder(sparse=False) week_encoded = encoder.fit_transform(data[['week']]) week_encoded = pd.DataFrame(week_encoded, columns=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']) data = pd.concat([data, week_encoded], axis=1) # 特征选择 features = ['year', 'month', 'day', 'temp_2', 'temp_1', 'average', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'] X = data[features] y = data['actual'] # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 决策树模型 dt_model = DecisionTreeRegressor(random_state=42) dt_model.fit(X_train, y_train) dt_score = dt_model.score(X_test, y_test) # 随机森林模型 rf_model = RandomForestRegressor(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) rf_score = rf_model.score(X_test, y_test) # 预测某一天的最高气温 prediction = rf_model.predict([[2022, 9, 14, 23, 25, 28, 0, 0, 0, 0, 1, 0, 0]]) print("预测的最高气温为：", prediction) ``` 这段代码中，我们首先导入了需要的库和数据集，然后使用One-Hot编码将week列进行了处理，将其转换为了七个二进制列。接着选择了多个特征作为输入X，真实的最高气温作为输出y。使用train_test_split函数将数据集划分为训练集和测试集，然后分别使用决策树和随机森林进行训练和测试，并计算了模型的得分。最后，使用随机森林模型预测了某一天的最高气温。需要注意的是，这里预测时输入的特征必须与训练时使用的特征保持一致，否则会导致预测结果不准确。实际应用中还需要进行更多的数据预处理、特征工程和模型调参等步骤。