python3爬虫 —— 爬取丁香园网站的信息

最新推荐文章于 2024-07-30 09:53:24 发布

interestingπ

最新推荐文章于 2024-07-30 09:53:24 发布

阅读量1.9k

点赞数

分类专栏：爬虫文章标签： python3 爬虫 lxml BeautifulSoup

本文链接：https://blog.csdn.net/weixin_42365428/article/details/89197637

版权

爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

上一篇： 爬取豆瓣电影信息

利用BeautifulSoup和lxml两个模块爬取丁香园网站的回复信息

BeautifulSoup实现代码：

from bs4 import BeautifulSoup
import requests

#访问的网址
url = 'http://www.dxy.cn/bbs/thread/626626#626626'
#头部信息
headers = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
#代理ip
proxies = {
    "https": "125.126.221.22",
}
html = requests.request('get',url=url,headers=headers,proxies = proxies)

text = html.text
soup = BeautifulSoup(text,'lxml')
info = soup.find_all("tbody")

all_info = {}
for data in info:
    try:
        other_info = []
        auth = data.find("div", class_="auth").get_text(strip=True)
        content = data.find("td", class_="postbody").get_text(strip=True)
        date = data.find('div',class_='post-info').get_text(strip=True)
        other_info.append(date[:16])
        other_info.append(content)
        all_info[auth]=other_info
    except:
         pass
print(all_info)

lxml实现代码：

import requests
from lxml import etree
#访问的网址
url = 'http://www.dxy.cn/bbs/thread/626626#626626'
#头部信息
headers = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
#代理ip
proxies = {
    "https": "125.126.221.22",
}
html = requests.request('get',url=url,headers=headers,proxies = proxies)
text = html.text

tree = etree.HTML(text)

xpath_auth="//div[@class= 'auth']/a/text()"
xpath_contect="//td[@class= 'postbody']"

re_auth = tree.xpath(xpath_auth)
re_content = tree.xpath(xpath_contect)

for auth,content in zip(re_auth,re_content):
    print('作者:'+auth,' 内容：'+content.xpath('string(.)').strip()+'\n')

下一篇：模拟登录QQ邮箱

interestingπ

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
python3爬虫 —— 爬取丁香园网站的信息

利用BeautifulSoup和lxml两个模块爬取丁香园网站的回复信息BeautifulSoup实现代码：from bs4 import BeautifulSoupimport requests#访问的网址url = 'http://www.dxy.cn/bbs/thread/626626#626626'#头部信息headers = { 'user-agent':'Moz...
复制链接

扫一扫

专栏目录