python3爬虫 —— 爬取丁香园网站的信息

上一篇: 爬取豆瓣电影信息

利用BeautifulSoup和lxml两个模块爬取丁香园网站的回复信息

BeautifulSoup实现代码:

from bs4 import BeautifulSoup
import requests

#访问的网址
url = 'http://www.dxy.cn/bbs/thread/626626#626626'
#头部信息
headers = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
#代理ip
proxies = {
    "https": "125.126.221.22",
}
html = requests.request('get',url=url,headers=headers,proxies = proxies)

text = html.text
soup = BeautifulSoup(text,'lxml')
info = soup.find_all("tbody")

all_info = {}
for data in info:
    try:
        other_info = []
        auth = data.find("div", class_="auth").get_text(strip=True)
        content = data.find("td", class_="postbody").get_text(strip=True)
        date = data.find('div',class_='post-info').get_text(strip=True)
        other_info.append(date[:16])
        other_info.append(content)
        all_info[auth]=other_info
    except:
         pass
print(all_info)

lxml实现代码:

import requests
from lxml import etree
#访问的网址
url = 'http://www.dxy.cn/bbs/thread/626626#626626'
#头部信息
headers = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
#代理ip
proxies = {
    "https": "125.126.221.22",
}
html = requests.request('get',url=url,headers=headers,proxies = proxies)
text = html.text

tree = etree.HTML(text)

xpath_auth="//div[@class= 'auth']/a/text()"
xpath_contect="//td[@class= 'postbody']"

re_auth = tree.xpath(xpath_auth)
re_content = tree.xpath(xpath_contect)

for auth,content in zip(re_auth,re_content):
    print('作者:'+auth,' 内容:'+content.xpath('string(.)').strip()+'\n')

下一篇:模拟登录QQ邮箱

  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值