摘要:利用python爬虫爬取北京豆瓣同城活动数据,包括城市名称,主题,时间,费用,发起人等等,最终保存为csv格式。
代码如下:
import requests
from bs4 import BeautifulSoup
import csv
f = open('activity.csv', 'a', newline='', encoding='utf-8')
writer = csv.writer(f)
writer.writerow(['城市名称', '主题', '时间', '地点', '费用', '事件标签', '发起人', '参与人数', '感兴趣人数'])
for i in range(192):
url = f'https://beijing.douban.com/events/future-all?start={i*10}'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')
datas = soup.find('div', class_='article').find_all('div', class_='info')
for data in datas:
real_url = data.find('a')['href']
real_res = requests.get(real_url, headers=headers)
real_soup = BeautifulSoup(real_res.text, 'html.parser')
city = real_soup.find('div', class_='local-label').find('a').text
events = real_soup.find('div', class_='event-info')
topic = events.find('h1').text.strip()
event_lable = events.find('a', itemprop='eventType').text
cost = events.find('span', itemprop='ticketAggregate').text.split('\n')[1].strip()
address = events.find('span', itemprop='street-address').text
sponsor = events.find('a', itemprop='name').text.strip()
join_number = events.find_all('span', class_='num')[1].text
interest_number = events.find_all('span', class_='num')[0].text
try:
activity_time = events.find('li', class_='calendar-str-item').text
except AttributeError:
activity_time = events.find('ul', class_='calendar-strs').text.split('\n')[1]
writer.writerow([city, topic, activity_time, address, cost, event_lable, sponsor, join_number, interest_number])
友情提示:网页的实际页码数是变化的,活动每天都会更新,所以实际操作时的循环数要根据当前页码的数量选择。