李沐实用机器学习（class3, class4)

最新推荐文章于 2024-01-05 11:46:45 发布

啥都想学点的研究生

最新推荐文章于 2024-01-05 11:46:45 发布

阅读量1k

点赞数

文章标签：机器学习

本文链接：https://blog.csdn.net/qq_60678226/article/details/124045772

版权

Class3: Web scraping

The goal is to extract data from website

Noisy, weak labels, can be spammy（噪点比较多，标号比较弱，可能是一些垃圾信息）

Available at scale（数据规模大）

Many ML datasets are obtained by web scraping

Web crawling VS scrapping

Crawling: indexing whole pages on internet

Scraping: scraping particular data from web pages of a website

Web scraping tools

“curl” often doesn’t work

Website owners use various ways to stop bots

Use headless browser: a web browser without a GUI

You need a lot of new IPs, easy to get through public clouds

Legal consideration

Web scraping isn’t illegal by itself

But you should

NOT scrape data have sensitive information(eg: private data involving username/password, personal health/medical information)不要去爬敏感数据

NOT scape copyrighted data(eg: You Tube videos)

Follow the Terms of service that explicitly prohibits we scraping

Consult a lawyer if you are doing it for profit

Summary

Web scarping is a powerful way to collect data at scale when the website doesn’t offer a data API.

Low cost if using public clouds

Use browser’s inspection tool to locate the information in HTML

Be cautions to use it properly

Class4: data labeling

Have enough data –improve label, data, or model—enough label? —enough budget? –use weak label?

Semi-supervised learning (SSL)

Focus on the scenario where there is a small amount of labeled data, along with large amount of unlabeled data. （一小部分的数据时有标注的，还有很多的数据没有标注，如何将有标注的数据和大部分没有标注的数据一起利用起来）

Make assumptions on data distribution to use unlabeled data. （对标签数据做一些假设）

Continuity assumption: examples with similar features are more likely to have the same label. （连续性假设）

Cluster assumption: data have inherent cluster structure, examples in the same cluster tend to have the same label. （聚类假设）

Manifold assumption: the data lie on a manifold of much lower dimension than the input space. （流型性假设：数据内在的复杂性远远比看到的要低）

Self-training

Seif-training is a SSL method

Labeled data – train – model（从一小部分已标记的数据中训练一个模型）
Unlabeled – predict – model （用训练得到的模型对问标记的数据进行预测）
Model—pseudo-labeled data （通过模型对未标记的数据得到一些伪标号，only keep highly confident predictions）
pseudo-labeled data—merge – labeled data（将得到伪标号的数据与原始有标记的数据进行融合，得到新的数据标签）

we can use expensive models

deep neural networks, model ensemble/bagging.