[Python下载CSDN博客]3. V2版本_完善内容显示格式和图片的下载

一. 目标


V1的基础上把提取到的文章保存为html格式并且文章内容的格式保留(例如字体颜色等) ,  有标题等


提取文章内容


2.1 分析其中一篇文章的HTML


使用浏览器查看http://blog.csdn.net/bagboy_taobao_com/article/details/5582868  的HTML并保存为article.html (保存的格式必须为UTF8, 否则会乱码). 双击打开article.html, 可以正确显示. OK, 可以用文本打开分析


2.1.2 文章标题和内容的HTML


<div id="article_details" class="details">
    <div class="article_title">
		<span class="ico ico_type_Original"></span>
		<h3>
			<span class="link_title"><a href="/bagboy_taobao_com/article/details/5582868">
        递归目录的所有文件(文章标题)
			</a></span>
		</h3>
	</div>
	......
    
	<div id="article_content" class="article_content">
		文章的内容, 包括所有标签
	</div>
</div>


使用BeautifulSoup直接查找<div id="article_details" class="details"><div id="article_content" class="article_content">即可


2.2.2 处理文章内容中的图片


有一些文章上传了一些图片这里要求下载到本地并对文章内容中图片的连接转成本地连接.

看如下面文章内容中的的HTML

<img alt="" src="https://img-blog.csdn.net/20131026083038921?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvYmFnYm95X3Rhb2Jhb19jb20=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast">
<br>
</br>
</img>

<img alt="" src="https://img-blog.csdn.net/20131026083043656?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvYmFnYm95X3Rhb2Jhb19jb20=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast"><br>
</br></img>


找到<div id="article_content" class="article_content">标签后再查找该便签中的所有img标签并提取图片的url, 由于<div id="article_content" class="article_content">还需要保存到本地所以img标签中url'也对应修改成本地的路径.


2.2.3 如何输出


HTML可以看到标题需要进行提取处理而文章的内容把格式都提取的话就是直接输出文章所在的div即可(BeautifulSoup分析出来的div)

所以代码

#!/usr/bin/env python
# coding=utf-8
# Python 2.7.3
# 获取博客文章
# File: GetArticle.py
import urllib2
import HTMLParser
import httplib
from bs4 import BeautifulSoup

class CHYGetArticle:
	def Parser(self, htmlStr, article):
		soup2 = BeautifulSoup(htmlStr)
		divTitle = soup2.find("div", class_ = "article_title")
		article[0] = divTitle.h3.span.text
		article[0] = article[0].replace("\n\r", "")		# 这里必须要重新赋值
		article[0] = article[0].strip()					# 这里必须要重新赋值		
		divComment = soup2.find("div", class_ = "article_content")
		article[1] = divComment							# 直接保存文章内容的div
		# 分析图片列表, 并把img标签的url替换为本地路径
		imgList = divComment.find_all("img")
		i = 1
		for imgItem in imgList:
			img = str(i) + ".jpg"
			article[2].append([imgItem["src"], img])
			imgItem["src"] = img
			i = i + 1
'''
# http://blog.csdn.net/bagboy_taobao_com/article/details/13090313
# 测试代码
if __name__ == '__main__':
	conn = httplib.HTTPConnection("blog.csdn.net")
	# 要模拟成IE发送, 否则CSDN不接受Python的请求
	user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'    
	headersP = { 'User-Agent' : user_agent }
	conn.request(method = "GET", url = "/bagboy_taobao_com/article/details/13090313", headers = headersP)
	r1 = conn.getresponse()				# 获得响应
	htmlByte = r1.read()				# 获得HTML
	htmlStr = htmlByte.decode("utf8")	# 需要转换成utf8编码, 否则分析异常
	my = CHYGetArticle()
	article = [None, None, []]
	my.Parser(htmlByte, article)
	f = open("data.html", "w")
	print >> f, '<html xmlns="http://www.w3.org/1999/xhtml">'
	print >> f, '<head><title>',
	print >> f, article[0].encode("utf8"),
	print >> f, '</title>'
	print >> f, '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />'
	print >> f, '</head>'
	print >> f, '<body>'
	print >> f, article[0].encode("utf8"), 			# print最后参数加一个"逗号", 这样就输出最后不换行
	print >> f, article[1]
	print >> f, '</body>'
	print >> f, '</html>'
	
	# 保存图片
	for img in article[2]
		# 下载图片
'''	


2.2 主程序


主程序在原来的基础上增加图片下载的功能即可.


#!/usr/bin/env python
# coding=utf-8
# Python 2.7.3
import os
import GetCategoryAndMonth
import GetArticleList
import GetArticle

import urllib2
import httplib

def GetTypeList(host, blogName, list, type):
	'''
	获取类型列表
	'''
	conn = httplib.HTTPConnection(host)
	# 要模拟成IE发送, 否则CSDN不接受Python的请求
	user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'    
	headersP = { 'User-Agent' : user_agent }
	conn.request(method = "GET", url = "/" + blogName, headers = headersP)
	r1 = conn.getresponse()				# 获得响应
	htmlByte = r1.read()				# 获得HTML
	htmlStr = htmlByte.decode("utf8")	# 需要转换成utf8编码, 否则分析异常
	my = GetCategoryAndMonth.CHYGetCategoryAndMonth()
	my.Parser(htmlByte, type, list)

def GetTypeArticleList(host, articleListUrl, list):
	'''
	获取一类型的文章列表
	'''
	conn = httplib.HTTPConnection(host)
	# 要模拟成IE发送, 否则CSDN不接受Python的请求
	user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'    
	headersP = { 'User-Agent' : user_agent }
	conn.request(method = "GET", url = articleListUrl, headers = headersP)
	r1 = conn.getresponse()				# 获得响应
	htmlByte = r1.read()				# 获得HTML
	htmlStr = htmlByte.decode("utf8")	# 需要转换成utf8编码, 否则分析异常
	my = GetArticleList.CHYGetArticleList()
	my.Parser(htmlByte, list)

def GetArticleFun(host, articleUrl, article):
	'''
	获取文章内容
	'''
	conn = httplib.HTTPConnection(host)
	# 要模拟成IE发送, 否则CSDN不接受Python的请求
	user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'    
	headersP = { 'User-Agent' : user_agent }
	conn.request(method = "GET", url = articleUrl, headers = headersP)
	r1 = conn.getresponse()				# 获得响应
	htmlByte = r1.read()				# 获得HTML
	htmlStr = htmlByte.decode("utf8")	# 需要转换成utf8编码, 否则分析异常
	my = GetArticle.CHYGetArticle()
	my.Parser(htmlByte, article)

def ValidFileName(fileName):
	validFileName = fileName.replace("/", "");
	validFileName = validFileName.replace("?", "");
	validFileName = validFileName.replace(":", "");
	validFileName = validFileName.replace('"', "");
	validFileName = validFileName.replace("'", "");
	return validFileName
	
def DownImg(imgUrl, name):
	conn = httplib.HTTPConnection("img.blog.csdn.net")
	# 要模拟成IE发送, 否则CSDN不接受Python的请求
	user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'    
	headersP = { 'User-Agent' : user_agent }
	conn.request(method = "GET", url = imgUrl.replace("https://img-blog.csdn.net", ""), headers = headersP)
	r1 = conn.getresponse()				# 获得响应
	data = r1.read()				# 获得HTML
	f = file(name,"wb")  
	f.write(data)  
	f.close() 
	
if __name__ == '__main__':
	# 创建一个目录
	host = "blog.csdn.net"
	blogName = "bagboy_taobao_com"
	blogDir = "F:" + os.sep + blogName     # F:\<blogName> 目录下
	os.mkdir(blogDir)
	# 获取分类列表
	listType = []
	GetTypeList(host, blogName, listType, 1)
	# 循环创建类型目录
	for listTypeItem in listType:
		typeDir = blogDir + os.sep + listTypeItem[1]
		os.mkdir(typeDir)
		listArticle = []
		GetTypeArticleList(host, listTypeItem[0], listArticle)
		for listArticleItem in listArticle:
			article = [None, None, []]	# 标题, 内容, 图片列表
			GetArticleFun(host, listArticleItem, article)
			articleDir = typeDir + os.sep + listArticleItem.replace("/" + blogName + "/article/details/", "") + "_" + ValidFileName(article[0])
			print(articleDir)
			# 以文章的标题名为保存的文件名
			os.mkdir(articleDir)
			title = articleDir + os.sep + "article.html"
			# print(title)
			f = open(title, 'w');
			print >> f, '<html xmlns="http://www.w3.org/1999/xhtml">'
			print >> f, '<head><title>',
			print >> f, article[0].encode("utf8"),
			print >> f, '</title>'
			print >> f, '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />'
			print >> f, '</head>'
			print >> f, '<body>'
			print >> f, article[0].encode("utf8"),
			print >> f, article[1]
			print >> f, '</body>'
			print >> f, '</html>'
			
			# 提取图片
			for imgItem in article[2]:
				name = articleDir + os.sep + imgItem[1]
				DownImg(imgItem[0], name)



小结

1. V2版本增加文章内容的格式保存和文章中图片下载到本地并能正确显示功能.

2. 对文章内容格式的保存时直接写入文章内容所对应的div即可这得益于BeautifulSoup对标签的输出支持.

3. 对图片的下载到本地时使用BeautifulSoup分析出div中所有img标签并且要对img便签的src属性进行修改为本地连接这也是得益于BeautifulSoup对标签属性的修改很方便.

2. BeautifulSoupHTML标签的获取和修改很方便.

3. 现在下载下来的文章已经可以进行chm压缩了当然可以再完善一些.

4. 下载下来的文章的格式其实还是没有原CSDN文章的排版好看可能还有一些例如CSS或者js的没分析到(因为我不懂这些).

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
这个错误是由于TensorFlow和Keras版本不兼容导致的。根据引用\[1\]和引用\[2\]的信息,你可以尝试以下解决方法: 1. 检查你的TensorFlow和Keras版本是否兼容。确保你使用的TensorFlow版本与Keras兼容。你可以查看TensorFlow和Keras的官方文档来获取兼容版本的信息。 2. 如果你的TensorFlow版本过高,可以尝试降低TensorFlow的版本。根据引用\[2\]的信息,你可以尝试导入旧版本的TensorFlow来解决问题。 3. 另外,根据引用\[3\]的信息,你可以尝试使用`from tensorflow.keras`而不是`from tensorflow.python.keras`来导入相关模块。 综上所述,你可以尝试检查版本兼容性,降低TensorFlow版本或更改导入语句来解决`module 'tensorflow.compat.v2.__internal__' has no attribute 'dispatch'`的问题。 #### 引用[.reference_title] - *1* [AttributeError: module ‘tensorflow.compat.v2.__internal__‘ has no attribute ‘register_clear_...](https://blog.csdn.net/QAQIknow/article/details/122158695)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down1,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [module ‘tensorflow.compat.v2.__internal__‘ has no attribute ‘tf2](https://blog.csdn.net/weixin_44731100/article/details/121356206)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down1,239^v3^insert_chatgpt"}} ] [.reference_item] - *3* [【import keras问题】module ‘tensorflow.compat.v2.__internal__‘ has no attribute ‘register_clear_...](https://blog.csdn.net/yyybeautiful/article/details/129082175)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值