上一篇: 爬取豆瓣电影信息
利用BeautifulSoup和lxml两个模块爬取丁香园网站的回复信息
BeautifulSoup实现代码:
from bs4 import BeautifulSoup
import requests
#访问的网址
url = 'http://www.dxy.cn/bbs/thread/626626#626626'
#头部信息
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
#代理ip
proxies = {
"https": "125.126.221.22",
}
html = requests.request('get',url=url,headers=headers,proxies = proxies)
text = html.text
soup = BeautifulSoup(text,'lxml')
info = soup.find_all("tbody")
all_info = {}
for data in info:
try:
other_info = []
auth = data.find("div", class_="auth").get_text(strip=True)
content = data.find("td", class_="postbody").get_text(strip=True)
date = data.find('div',class_='post-info').get_text(strip=True)
other_info.append(date[:16])
other_info.append(content)
all_info[auth]=other_info
except:
pass
print(all_info)
lxml实现代码:
import requests
from lxml import etree
#访问的网址
url = 'http://www.dxy.cn/bbs/thread/626626#626626'
#头部信息
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
#代理ip
proxies = {
"https": "125.126.221.22",
}
html = requests.request('get',url=url,headers=headers,proxies = proxies)
text = html.text
tree = etree.HTML(text)
xpath_auth="//div[@class= 'auth']/a/text()"
xpath_contect="//td[@class= 'postbody']"
re_auth = tree.xpath(xpath_auth)
re_content = tree.xpath(xpath_contect)
for auth,content in zip(re_auth,re_content):
print('作者:'+auth,' 内容:'+content.xpath('string(.)').strip()+'\n')
下一篇:模拟登录QQ邮箱