李沐实用机器学习(class3, class4)

Class3: Web scraping

The goal is to extract data from website

       Noisy, weak labels, can be spammy(噪点比较多,标号比较弱,可能是一些垃圾信息)

       Available at scale(数据规模大)

Many ML datasets are obtained by web scraping

Web crawling VS scrapping

       Crawling: indexing whole pages on internet

       Scraping: scraping particular data from web pages of a website

Web scraping tools

“curl” often doesn’t work

Website owners use various ways to stop bots

Use headless browser: a web browser without a GUI

You need a lot of new IPs, easy to get through public clouds

Legal consideration

Web scraping isn’t illegal by itself

But you should

       NOT scrape data have sensitive information(eg: private data involving username/password, personal health/medical information)不要去爬敏感数据

       NOT scape copyrighted data(eg: You Tube videos)

       Follow the Terms of service that explicitly prohibits we scraping

Consult a lawyer if you are doing it for profit

Summary

Web scarping is a powerful way to collect data at scale when the website doesn’t offer a data API.

Low cost if using public clouds

Use browser’s inspection tool to locate the information in HTML

Be cautions to use it properly

Class4: data labeling

Have enough data –improve label, data, or model—enough label? —enough budget? –use weak label?

Semi-supervised learning (SSL)

Focus on the scenario where there is a small amount of labeled data, along with large amount of unlabeled data. (一小部分的数据时有标注的,还有很多的数据没有标注,如何将有标注的数据和大部分没有标注的数据一起利用起来)

Make assumptions on data distribution to use unlabeled data. (对标签数据做一些假设)

       Continuity assumption: examples with similar features are more likely to have the same label. (连续性假设)

       Cluster assumption: data have inherent cluster structure, examples in the same cluster tend to have the same label. (聚类假设)

       Manifold assumption: the data lie on a manifold of much lower dimension than the input space. (流型性假设:数据内在的复杂性远远比看到的要低)

Self-training

Seif-training is a SSL method

  1. Labeled data – train – model(从一小部分已标记的数据中训练一个模型)
  2. Unlabeled – predict – model (用训练得到的模型对问标记的数据进行预测)
  3. Model—pseudo-labeled data (通过模型对未标记的数据得到一些伪标号,only keep highly confident predictions)
  4. pseudo-labeled data—merge – labeled data(将得到伪标号的数据与原始有标记的数据进行融合,得到新的数据标签)

we can use expensive models

       deep neural networks, model ensemble/bagging.

Label through crowdsourcing

ImageNet label millions of image through Amazon Mechanical Turk. It took several years and millions dollars to build.

Challenges

Simplify user interaction: design easy tasks, clear instructions and simple to use interface

       Need to find qualified workers for complex jobs.

Quality control: label qualities generated by different labels vary.

Reduce #task: Active Learning

Focus on same scenario as SSL but with human intervention

       Select the most “interesting” unlabeled data to labelers

Uncertainty sampling chooses an example whole prediction is most uncertain

       The highest class prediction score is close to random(1/n)

Similar to self-training we can use expensive models

       Query-by-committee trains multiple models and perform major voting

Active Learning + self-training

These two methods are often used together

 

Quality control

Labelers make mistakes(honest or not) and may fail to understand the instructions

Simplest but most expensive: sending the same task to multiple labeled ,then determine the label by majority voting.

Weak supervision(弱监督学习)

Semi-automatically generation labels

       Less accurate than manual ones, but good enough for training

Data programming: heuristic programs to assign labels

       Keyword search, pattern matching, third-party models

Summary

Ways to get labels

       Self-training: iteratively train models to label unlabeled data

       Crowdsourcing: leverage globe labelers to manually label data

       Data-programming: heuristic programs to assign noisy labels

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值