BeautifulSoup
博主曾经花时间将官方文档从头到尾看了一遍,结果是没几天就忘光光,之后幡然醒悟,官方文档是用来查的,不是用来记的,遇到问题查一下慢慢就有印象!
爬取丁香园论坛回复
from bs4 import BeautifulSoup
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
url = 'http://www.dxy.cn/bbs/thread/626626#626626'
r = requests.get(url,headers=headers)
html = r.text
soup = BeautifulSoup(html, "html.parser")
for data in soup.find_all("tbody"):
try:
userid = data.find("div", class_="auth").get_text(strip=True)
print(userid)
content = data.find("td", class_="postbody").get_text(strip=True)
print(content)
except:
pass
xpath爬取丁香园论坛回复
xpath 曾经读过一遍,但是由于很久没用xapth,忘了很多,只有经常用才能记的牢
import requests
from lxml import etree
url = 'http://www.dxy.cn/bbs/thread/626626'
r = requests.get(url)
html = r.text
tree = etree.HTML(html)
users = tree.xpath('//div[@class="auth"]/a/text()') #返回一个列表
content = tree.xpath('//td[@class="postbody"]')
results = {}
for i in range(len(users)):
results[users[i]] = content[i].xpath('string()').strip()
for user in users:
print(results[user])
print("*"*80)sts[i].xpath('string()').strip()
print(results)