python免费下载csdn_[Python下载CSDN博客]1. 简单实现(三)-CSDN博客

本文链接：https://blog.csdn.net/weixin_36166558/article/details/112941437

3.4 主程序

提取分类列表,提取某分类列表和提取文章内容都实现,现在把它们整合即可.

3.4.1 提取策略

1. 提取分类列表(或者存档列表),每一类创建一个目录(目录名是分类名或者存档日期).

2. 提取每一类的文章.

3. 每一篇文章一个目录,文章内容在该目录的article.txt文件中.

3.4.2 主程序代码

#!/usr/bin/env python

# coding=utf-8

# Python 2.7.3

import os

import GetCategoryAndMonth

import GetArticleList

import GetArticle

import urllib2

import httplib

def GetTypeList(host, blogName, list, type):

'''

获取类型列表

'''

conn = httplib.HTTPConnection(host)

# 要模拟成IE发送, 否则CSDN不接受Python的请求

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

headersP = { 'User-Agent' : user_agent }

conn.request(method = "GET", url = "/" + blogName, headers = headersP)

r1 = conn.getresponse()# 获得响应

htmlByte = r1.read()# 获得HTML

htmlStr = htmlByte.decode("utf8")# 需要转换成utf8编码, 否则分析异常

my = GetCategoryAndMonth.CHYGetCategoryAndMonth(type, list)

my.feed(htmlStr)

def GetTypeArticleList(host, articleListUrl, list):

'''

获取一类型的文章列表

'''

conn = httplib.HTTPConnection(host)

# 要模拟成IE发送, 否则CSDN不接受Python的请求

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

headersP = { 'User-Agent' : user_agent }

conn.request(method = "GET", url = articleListUrl, headers = headersP)

r1 = conn.getresponse()# 获得响应

htmlByte = r1.read()# 获得HTML

htmlStr = htmlByte.decode("utf8")# 需要转换成utf8编码, 否则分析异常

my = GetArticleList.CHYGetArticleList(list)

my.feed(htmlStr)

def GetArticleFun(host, articleUrl, article):

'''

获取文章内容

'''

conn = httplib.HTTPConnection(host)

# 要模拟成IE发送, 否则CSDN不接受Python的请求

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

headersP = { 'User-Agent' : user_agent }

conn.request(method = "GET", url = articleUrl, headers = headersP)

r1 = conn.getresponse()# 获得响应

htmlByte = r1.read()# 获得HTML

htmlStr = htmlByte.decode("utf8")# 需要转换成utf8编码, 否则分析异常

my = GetArticle.CHYGetArticle()

my.feed(htmlStr)

article[0] = my.title

article[1] = my.comment

def ValidFileName(fileName):

validFileName = fileName.replace("/", "");

validFileName = fileName.replace("?", "");

validFileName = fileName.replace(":", "");

return validFileName

if __name__ == '__main__':

# 创建一个目录

host = "blog.csdn.net"

blogName = "bagboy_taobao_com"

blogDir = "F:" + os.sep + blogName # F:\ 目录下

os.mkdir(blogDir)

# 获取分类列表

listType = []

GetTypeList(host, blogName, listType, 1)

# print(listType)

# 循环创建类型目录

for listTypeItem in listType:

typeDir = blogDir + os.sep + listTypeItem[1]

os.mkdir(typeDir)

listArticle = []

GetTypeArticleList(host, listTypeItem[0], listArticle)

for listArticleItem in listArticle:

article = ["", ""]

GetArticleFun(host, listArticleItem, article)

articleDir = typeDir + os.sep + listArticleItem.replace("/" + blogName + "/article/details/", "") + "_" + ValidFileName(article[0])

# print(articleDir)

# 以文章的标题名为保存的文件名

os.mkdir(articleDir)

title = articleDir + os.sep + "article.txt"

# print(title)

f = open(title, 'w');

print >> f, article[0].encode("utf8")

print >> f, article[1].encode("utf8")

四. 小结

1. 使用Python提取网页内容很清晰简单.

2. 自己对Python的使用还不太熟悉,所以实现这样的功能时,使用的技术是东拼西凑的使用的.并没有怎么规划好类,函数等面向对象的设计.

3. 通过实现这样的功能来熟悉Python的使用.

4. 这里实现这种网页提取和HTML分析可能是比较慢的.特别是分析HTML上,我没有使用正则表达式,而且HTMLParser分析HTML是一种顺序式的,回调式的,要"回滚"比较麻烦,特别一些标签是有前后联系的.

5. 现在提取到的内容是纯文本的,没有图片,什么都没有的,排版很难看.后续可以把内容的排版也保存,图片也保存, 并且需要保存为HTML格式和能压成CHM文件.

6. 其实这中抓取网页内容程序在技术上已经是很简单的了,关键就是你对要抓取的网页的HTML的逻辑分析.

7. 看这两个博客,它们做的很强大,可以保存为PDF, DOC, TXT等.