康奈尔抓取数据集
If you want to skip the HTML tag digging and get straight to scraping, here’s the gist. Note that the scraper tries to do an exact match with each item in your wanted list. Otherwise, read on for a short background on webscraping, when it’s useful to scrape websites, and some challenges you may experience while scraping.
如果您想跳过HTML标签的挖掘而直接进行抓取,请按以下要点进行。 请注意,搜寻器会尝试与所需列表中的每个项目进行完全匹配。 否则,请继续阅读有关网络抓取的简短背景,对抓取网站很有用的信息,以及在抓取时可能遇到的一些挑战。
from autoscraper import AutoScraper
# replace with desired url
url = 'https://www.yelp.com/biz/chun-yang-tea-flushing-new-york-flushing'
# make sure that autoscraper can exactly match the items in your wanted_list
wanted_list = ['A review'] # replace with item(s) of interest
# build the scraper
scraper = AutoScraper()
result = scraper.build(url, wanted_list)
# get similar results, and check which rules to keep
groups = scraper.get_result_similar(url, grouped=True)
groups.keys()
groups['rule_io6e'] # replace with rule(s) of interest
# keep rules and save the model to disk
scraper.keep_rules('rule_io6e') # replace with rule(s) of interest
scraper.save('yelp-reviews') # replace with desired model name
#-------------------------------------------------------------------------
# using the model later
scraper.load('yelp-reviews')
new_url = "" # replace with desired url
scraper.get_result_similar(new_url)
背景 (Background)
I’ve recently been doing some research into bubble tea trends in the United States. I wanted to look at changes in drink orders, when boutique and franchise bubble tea shops were opened, and customer reviews regarding these establishments. Naturally, I turned to Yelp. But a few limitations very quickly set me back; I was limited to the first 1000 businesses on the Yelp API, and I could only get three Yelp selected reviews per business.
我最近一直在研究美国的泡泡茶趋势。 我想看一下饮料订单的变化,开设精品店和特许经营的泡茶店时的情况,以及有关这些场所的顾客评论。 自然,我转向Yelp。 但是一些限制很快使我退缩。 我仅限于Yelp API上的前1000家企业,每个企业只能获得3条Yelp选定的评论。
This makes sense from a business perspective — you wouldn’t want other businesses easily snooping in on your successes and failures and iterating off that. But it also demonstrates the larger misfortunes about web scraping. On one hand, it’s a great way to obtain data for a side project on a topic