任务描述
实战大项目:模拟登录丁香园,并抓取论坛页面所有的人员基本信息与回复帖子内容。
丁香园论坛:http://www.dxy.cn/bbs/thread/626626#626626 。
话不多说,先上代码:
def getHTMLText(url):
try:
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
cookie = '*****' # 输入自己的cookie
cookie = cookie.encode('utf-8')
headers = {'User_agent': user_agent, 'Cookie': cookie}
r = requests.get(url, headers=headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
traceback.print_exc()
return ''
def parsePage(text):
htmlInfo = {}
soup = BeautifulSoup(text, 'html.parser')
auths = soup.find_all('div', attrs={'class': 'auth'})
i = 0
for auth in auths:
htmlInfo[i] = {}
htmlInfo[i]['name'] = auth.text
i += 1
i = 0
levels = soup.findAll('div', attrs={'class': 'info clearfix'})
for level in levels:
level1 = level.find_all('div')
if level1:
htmlInfo[i]['level'] = level1[-1].text.strip()
else:
htmlInfo[i]['level'] = level.find('p').text.strip()
i += 1
i = 0
user_attens = soup.findAll('div', attrs={'class': 'user_atten'})
for user_atten in user_attens:
for user_attr in user_atten.select('li'):
user_attr_str = user_attr.text
htmlInfo[i][user_attr_str[-2:]] = user_attr_str[:-2]
i += 1
tds = soup.find_all('td', attrs={'class': 'postbody'})
i = 0
for td in tds:
content = ''
for string in td.stripped_strings:
content += string + ' '
htmlInfo[i]['content'] = content
i += 1
return htmlInfo
def printHTMLInfo(htmlInfo):
print(f'name\t\tlevel\t\t\tscore\tvote\tdingdang\tcontent')
htmlInfo = list(htmlInfo.values())[:-1]
for value in htmlInfo:
print(
f"{value['name']:10}\t{value['level']:14}\t{value['积分']}\t{value['得票']}\t{value['丁当']}\t\t{value['content']}",
end='\n\n')
def main():
url = "http://www.dxy.cn/bbs/thread/626626"
text = getHTMLText(url)
htmlInfo = parsePage(text)
printHTMLInfo(htmlInfo)
main()
结果展示:
思路:
- 模拟登录:
登录前,界面只有四条信息,而登录后有二十七条,这里要注意,如果爬下来是二十八条,那么最后一条应该是你自己的信息,需要删掉。
这里采用的是在headers中增加cookie的办法,具体位置如下:
将其粘贴到代码中即可。如果cookie中有中文,会产生编码问题,注意加上cookie = cookie.encode('utf-8')
! - 其他信息的提取基本都是之前的老路子,只是每个人的level字段提取有些麻烦
一周的爬虫学习很快就结束了,比想象中的要简单一些吧,以后的日子也要加油!