题目:抓取博客https://www.cnblogs.com/ 里的文章标题和url地址。并将标题和url输出的console。
代码如下图,思路和注释也在图中:
#!usr/bin/env python3
#-*-coding=utf-8-*-
__author__='km'
import urllib.request
from re import *
import re
def download(url):
result = urllib.request.urlopen(url=url)
content = result.read()
htmlStr = content.decode("utf-8")
return htmlStr
def analyes(htmlStr):
aList = findall('<a[^>]*post-item-title[^>]*>[^<]*</a>',htmlStr)
result = []
for a in aList:
#search的这种方法,代码没有运行成功,这里的g没有提取出来,暂时没有找到原因。于是换了findall的方法
#g = search('herf[\s]*=[\s]*[\'"]([^>\'""]*)[\'"]', a)
g = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+',a)
#if g != None:
# url = g.group(1)
# print('2',url)
index1 = a.find(">")
index2 = a.rfind("<")
title = a[index1 + 1:index2]
d = {}
d['url'] = g
d['title'] = title
result.append(d)
return result
def crawler(url):
html = download(url)
blogList = analyes(html)
for blog in blogList:
print("title:",blog["title"])
print("url:",blog["url"])
if __name__=='__main__':
crawler('https://www.cnblogs.com/')
运行结果:
疑问:
正则那块使用search,则得到的g=None,导致没有获取到url。暂没有找到原因,如果有大佬知道,求解释。