Class3: Web scraping
The goal is to extract data from website
Noisy, weak labels, can be spammy(噪点比较多,标号比较弱,可能是一些垃圾信息)
Available at scale(数据规模大)
Many ML datasets are obtained by web scraping
Web crawling VS scrapping
Crawling: indexing whole pages on internet
Scraping: scraping particular data from web pages of a website
Web scraping tools
“curl” often doesn’t work
Website owners use various ways to stop bots
Use headless browser: a web browser without a GUI
You need a lot of new IPs, easy to get through public clouds
Legal consideration
Web scraping isn’t illegal by itself
But you should
NOT scrape data have sensitive information(eg: private data involving username/password, personal health/medical information)不要去爬敏感数据
NOT scape copyrighted data(eg: You Tube videos)
Follow the Terms of service that explicitly prohibits we scraping
Consult a lawyer if you are doing it for profit
Summary
Web scarping is a powerful way to collect data at scale when the website doesn’t offer a data API.
Low cost if using public clouds
Use browser’s inspection tool to locate the information in HTML
Be cautions to use it properly
Class4: data labeling
Have enough data –improve label, data, or model—enough label? —enough budget? –use weak label?
Semi-supervised learning (SSL)
Focus on the scenario where there is a small amount of labeled data, along with large amount of unlabeled data. (一小部分的数据时有标注的,还有很多的数据没有标注,如何将有标注的数据和大部分没有标注的数据一起利用起来)
Make assumptions on data distribution to use unlabeled data. (对标签数据做一些假设)
Continuity assumption: examples with similar features are more likely to have the same label. (连续性假设)
Cluster assumption: data have inherent cluster structure, examples in the same cluster tend to have the same label. (聚类假设)
Manifold assumption: the data lie on a manifold of much lower dimension than the input space. (流型性假设:数据内在的复杂性远远比看到的要低)
Self-training
Seif-training is a SSL method
- Labeled data – train – model(从一小部分已标记的数据中训练一个模型)
- Unlabeled – predict – model (用训练得到的模型对问标记的数据进行预测)
- Model—pseudo-labeled data (通过模型对未标记的数据得到一些伪标号,only keep highly confident predictions)
- pseudo-labeled data—merge – labeled data(将得到伪标号的数据与原始有标记的数据进行融合,得到新的数据标签)
we can use expensive models
deep neural networks, model ensemble/bagging.
Label through crowdsourcing
ImageNet label millions of image through Amazon Mechanical Turk. It took several years and millions dollars to build.
Challenges
Simplify user interaction: design easy tasks, clear instructions and simple to use interface
Need to find qualified workers for complex jobs.
Quality control: label qualities generated by different labels vary.
Reduce #task: Active Learning
Focus on same scenario as SSL but with human intervention
Select the most “interesting” unlabeled data to labelers
Uncertainty sampling chooses an example whole prediction is most uncertain
The highest class prediction score is close to random(1/n)
Similar to self-training we can use expensive models
Query-by-committee trains multiple models and perform major voting
Active Learning + self-training
These two methods are often used together
Quality control
Labelers make mistakes(honest or not) and may fail to understand the instructions
Simplest but most expensive: sending the same task to multiple labeled ,then determine the label by majority voting.
Weak supervision(弱监督学习)
Semi-automatically generation labels
Less accurate than manual ones, but good enough for training
Data programming: heuristic programs to assign labels
Keyword search, pattern matching, third-party models
Summary
Ways to get labels
Self-training: iteratively train models to label unlabeled data
Crowdsourcing: leverage globe labelers to manually label data
Data-programming: heuristic programs to assign noisy labels