题目:获取虎扑步行街论坛上所有帖子的数据,内容包括帖子名称、帖子链接、作者、作者链接、创建时间、回复数、浏览数、最后回复用户和最后回复时间,网络地址为:https://bbs.hupu.com/bxj
使用mysql作为数据存储器,完整代码如下:
import requests
from bs4 import BeautifulSoup
import pymysql
import time
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
data_list = []
def get_info(url):
html = requests.get(url,headers=headers)
soup = BeautifulSoup(html.text,'lxml')
names = soup.select('div > div.post-title > a')
authors = soup.select('div > div.post-auth > a')
times = soup.select('div > div.post-time')
replys = soup.select('div > div.post-datum')
for name,author,posttime,reply in zip(names,authors,times,replys):
data = {
'nameik':'https://bbs.h