1.数据的爬取和清洗
(1)标题和作者的获取以及数据整理
from bs4 import BeautifulSoup
data_all =[]
for i in range(0,10):
url = 'http://bbs.tianya.cn/list-no02-1.shtml'
douban_data = requests.get(url)
soup = BeautifulSoup(douban_data.text,'lxml')
titles = soup.select('tr.bg td.td-title a')
author = soup.select('tr.bg td a.author')
for title,price in zip(titles,author):
data = {'title':title.get_text().strip().split()[0],
'author':price.get_text().strip()}
# print(data)
data_all.append(data)
len(data_all)
(2)点击量和回复量的获取(这里应该循环获取,因为每一个单页的网址不一样)
import requests
from bs4 import BeautifulSoup
url = 'http://bbs.tianya.cn/list.jsp?item=no02&nextid=1556923587000'
douban_data = request