Python爬取百度百科简介以及内链词条简介

最新推荐文章于 2024-03-01 14:18:40 发布

bxri

最新推荐文章于 2024-03-01 14:18:40 发布

阅读量1k

点赞数 4

分类专栏： Python 文章标签： python 百度百科 python爬虫

本文链接：https://blog.csdn.net/xurQQ/article/details/110099093

版权

Python 专栏收录该内容

1 篇文章 1 订阅

订阅专栏

注意初始中文词条需要转换(不转换会报错UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128))：

如：爱奇艺--->urllib.parse.quote("爱奇艺")-->%E7%88%B1%E5%A5%87%E8%89%BA

程序源码

from urllib.request import urlopen
import urllib
from bs4 import BeautifulSoup


pages = set()
def getLink(pageurl):
   global pages
   headers = {

   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36"

   }
   print('https://baike.baidu.com{}'.format(pageurl))
   html = urlopen(urllib.request.Request('https://baike.baidu.com{}'.format(pageurl), headers=headers))
   bs = BeautifulSoup(html, 'html.parser')
   try:
      print(bs.h1.get_text())
      for text in bs.find('div', {'class': 'lemma-summary'}).find_all('div', {'class': 'para'}):
         print(text.get_text())
      for link in bs.find('div', {'class': 'para'}).find_all('a'):
         if 'href' in link.attrs:
            print(link.attrs['href'])
         else:
              print('-----非链接-----')
              print(link)
              print('-----非链接-----')
   except AttributeError:
      print("yemianqieshaoshuxing")

   for link in bs.find('div', {'class': 'para'}).find_all('a', href=re.compile('^(/item/)')):
      if 'href' in link.attrs:
         if link.attrs['href'] not in pages:
            newpage = link.attrs['href']
            print('-'*20)

            print(newpage)
            pages.add(newpage)
            getLink(newpage)


get_ = '/item/爱奇艺'
print(get_)
getLink(urllib.parse.quote(get_))

bxri

关注

4
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
Python爬取百度百科简介以及内链词条简介

注意初始中文词条需要转换(不转换会报错UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128))：如：爱奇艺--->urllib.parse.quote("爱奇艺")-->%E7%88%B1%E5%A5%87%E8%89%BA程序源码from urllib.request import urlopenimpo...
复制链接

扫一扫

专栏目录