python_爬取博客内容

最新推荐文章于 2023-10-14 14:45:00 发布

杨MAX洁

最新推荐文章于 2023-10-14 14:45:00 发布

阅读量992

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/Hpu_A/article/details/51518990

版权

Python 专栏收录该内容

30 篇文章 1 订阅

订阅专栏

# -*- coding: utf-8 -*-
__author__ = 'YangShengjie'
import urllib
url = "http://blog.sina.com.cn/s/blog_4701280b0100h3c8.html"#下载文章：前扑后继
#之所以选择这篇文章是因为其网页结构的特殊性，中间嵌套了视频
conn = urllib.urlopen(url).read() #读取网页
#print conn
s1 = conn.find(r'<div id="sina_keyword_ad_area2') #将文章内容所在区域化块
t1 = conn.find(r'</div>',s1) #运用相对位置
conn1 = conn[s1:t1]
#print conn1
s = conn1.find(r'<p STYLE') #缩小文章内容所在区域，在上一个获取的文章块内再次化快
t = conn1.find(r'</P>',s)
conn2= conn1[s +29 :t]  #获取首段内容，其结构较特殊，不能加入循环
#print conn2
content = " "
while s !=-1 and t !=-1: # 控制爬取的范围
    content = content +'\n'+ conn2  #将文章连在一块，每一段之间用换行标示
    s = conn1.find(r'<p STYLE',t)
    t = conn1.find(r'</P>',s)
    conn2= conn1[s+45:t]  #不同于首段的29
else:
   print content #输出文章内容
#下载文章
filename = url[26:]
open(filename,'w').write(content)

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

杨MAX洁

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python_爬取博客内容

# -*- coding: utf-8 -*-__author__ = 'YangShengjie'import urlliburl = "http://blog.sina.com.cn/s/blog_4701280b0100h3c8.html"#下载文章：前扑后继#之所以选择这篇文章是因为其网页结构的特殊性，中间嵌套了视频conn = urllib.urlopen(url).read(
复制链接

扫一扫