《python网络数据采集》章节3.2 'gbk' codec can't

snk_090623

于 2017-06-12 14:30:52 发布

阅读量361

点赞数

本文链接：https://blog.csdn.net/snk_090623/article/details/73105735

版权

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id="mw-content-text").findAll("p")[0], decoding='utf-8')
        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'], encoding='utf-8')
    except AttributeError:
        print("页面缺少一些属性！不过不用担心！")
    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                # 我们遇到了新页面
                newPage = link.attrs['href']
                print("----------------\n"+newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks("")

import io, sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') #改变标准输出的默认编码

在开头插入这两行代码就搞定了

不过作为一只菜鸟，根本就不知道到底是不是成功了。。。

snk_090623

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《python网络数据采集》章节3.2 'gbk' codec can't

from urllib.request import urlopenfrom bs4 import BeautifulSoupimport repages = set()def getLinks(pageUrl): global pages html = urlopen("http://en.wikipedia.org"+pageUrl) bsObj = Beaut
复制链接

扫一扫