人生苦短,我用python
今天给大家爬取虎扑步行街的详细信息
效果图如下:
首先根据虎扑步行街链接获取前十个页面链接
for i in range(1, 10):
link = "https://bbs.hupu.com/bxj-" + str(i)
接着根据链接获取到页面html
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers = headers)
html = r.content
html = html.decode('utf-8')
soup = BeautifulSoup(html, 'lxml')
有了整个页面html后就可以获取需要的信息,首先提取每一条发布的信息
post_list = soup.find('ul',class_="for-list")
# print(post_list)
post_all=post_list.find_all('li')
在每一条信息内我们可以获取信息内详细的内容
data_list =[]
for post in post_list:
title = post.find('div',class_='titlelink box').text.strip()
post_link = post.find('div',class_='titlelink box').a['href']
post_link = "https://bbs.hupu.com" + post_link
author = post.find('div',class_='author box').a.text.strip()
author_page = post.find('div',class_='author box').a['href']
start_date = post.find('div',class_='author box').contents[5].text.strip()
reply_view = post.find('span',class_='ansour box').te