此文章讲基于百度文库文章提取器设计思路的代码实现,如果你还不知道爬取百度文库文章的思路请移步:百度文库文章提取器(上)
URL访问
# request请求函数,返回网页源码
def fetch_url(url):
global version
try:
content = requests.get(url).content.decode("gbk")
except UnicodeDecodeError: # 捕获编码异常,如果有异常说明为新版本页面,用utf-8格式解码
content = requests.get(url).content.decode("utf-8")
version[0] = 1
return content
version变量是一个包含两个元素的列表,分别代表文章版式(新版or旧版)和文章类型(doc、txt…)。函数访问传入的url,将response的内容进行gbk解码,如果出现异常则说明需要用utf-8格式解码(根据实测,旧版文章只能gbk解码,而新版文章只能utf-8解码),同时令version[0]=1。
判断文章类型
# 判断文档类型
def verjud(content):
if version[0] == 0:
version[1] = re.findall(r"docType.*?\:.*?\'(.*?)\'\,", content)[0] # 通过正则获取文档类型
# print(version[1])
# print(type(version[1]))
else:
if "indexXreader" in content:
version[1] = "doc"
elif "indexTxt" in content:
version[1] = "txt"
# print(version[1])
如果是旧版,在网页源码中可以通过正则找到docType后面的文章类型。如果是新版,在网页源代码的最下方有且仅有indexXreader或indexTxt,分别代表doc和txt。
旧版doc文章爬取
# 旧版doc文档
def old_doc(content):
url_list = re.findall(r'(https.*?0.json.*?)\\x22}', content) # 提取文章所有页链接
url_list = [addr.replace("\\\\\\/", "/") for addr in url_list] # 删掉转义字符
result = []
for url in url_list[:int(len(url_list)/2)]: # 提取的链接只需要用前一半
content = fetch_url(url)
temp = ""
y = 0
txtlists = re.findall('"c":"(.*?)".*?"y":(.*?),', content)
for item in txtlists:
if not y == item[1]:
y = item[1]
n = '\n'
else:
n = ''
temp += n
temp += item[0].encode('utf-8').decode('unicode_escape', 'ignore')
result.append(temp)
return result
通过正则将所有文章页的URL提取出来,在通过replace方法剔除链接中的转义字符,此时的URL才是可以访问的。遍历访问url_list并将文章内容提取出来。
新版doc文章爬取
# 新版doc文档
def new_doc(content):
url_list = re.findall(r'(https:\\\\/\\\\/wkbjcloudbos.*?0.json.*?)\\', content) # 提取文章所有页链接
url_list = [addr.replace(r"\\", "") for addr in url_list] # 删掉转义字符
result = []
for url in url_list[:int(len(url_list))]: # 此版本文库没有多余链接
content = fetch_url(url)
temp = ""
txtlists = re.findall('"c":"(.*?)".*?"y":(.*?),', content)
y = 0
for item in txtlists:
if not y == item[1]:
y = item[1]
n = '\n'
else:
n = ''
temp += n
temp += item[0].encode('utf-8').decode('unicode_escape', 'ignore')
result.append(temp)
return result
和旧版doc文章除正则表达式其他基本相同。
新版txt文章爬取
# 新版txt文档
def new_txt(content):
txtId = re.findall("show_doc_id\":\"(.*?)\"", content)[0] # 获取文档id
md5 = re.findall("md5sum=(.*?)&", content)[0] # 获取md5sum值
sign = re.findall("sign=(.*?)\"", content)[0] # 获取sign值
pageNum = re.findall("\"page\":\"(.*?)\"", content)[0] # 获取文档总页码数
resign = re.findall("\"rsign\":\"(.*?)\"", content)[0] # 获取resign值
url = "https://wkretype.bdimg.com/retype/text/" + txtId + "?md5sum=" + md5 + "&sign=" + sign + "&callback=cb&pn=1&rn=" + pageNum +\
"&type=txt&rsign=" + resign # 拼接字符串获取文档链接
# print(url)
# text = requests.get(url).content.decode('gbk')
txtcontent = json.loads(fetch_url(url)[3:-1]) # 加载json格式文档
result = []
for item in txtcontent:
temp = ""
for i in item['parags']:
temp += i['c'].replace('\\r', '\r').replace('\\n', '\n')
result.append(temp)
return result
在网页源代码中分别获取txtId、md5、sign、pageNum、resign,然后拼接URL并访问,在返回的数据中提取文章内容。
“主函数”
def getarticle(url):
# url = "https://wenku.baidu.com/view/dbc53006302b3169a45177232f60ddccda38e63d.html?rec_flag=default&sxts=1584344873233" # 旧版doc
# url = "https://wenku.baidu.com/view/62906818227916888486d74f.html?rec_flag=default" # 新版doc
# url = "https://wenku.baidu.com/view/ccfb5a96ba68a98271fe910ef12d2af90242a8f5.html?from=search" # 新版txt
# url = input('请输入要下载的文库URL地址')
# return ["123", "456"]
content = fetch_url(url)
verjud(content)
message = []
if version[0] == 0:
if version[1] == "doc":
message = old_doc(content)
elif version[1] == "txt":
message = old_txt(content)
else:
pass
else:
if version[1] == "doc":
message = new_doc(content)
elif version[1] == "txt":
message = new_txt(content)
else:
pass
# for item in message:
# print(item)
# print("*"*50)
return message
函数最后message是一个列表,每个元素对应的就是所爬取文章每页的内容。
文章只展示部分代码,需要所有源码请到GitHub自行下载。