Python爬虫小实践--查词器

最新推荐文章于 2024-08-11 03:38:42 发布

JK12

最新推荐文章于 2024-08-11 03:38:42 发布

阅读量1.3k

点赞数 2

文章标签： python 爬虫正则表达式 web服务器

本文链接：https://blog.csdn.net/Yunlog/article/details/115018312

版权

Python爬虫小实践–查词器

标题实验用到的环境：

Python3
用到了bs4， lxml等库(未安装可以通过pip install xxx来安装)
本实验用的是anaconda3中的Jupyter Notebook

标题导入所需要用到的库

// An highlighted block
# -*- coding: utf-8 -*-
"""
Created on Fri Mar 19 22:22:10 2021

@author: yunlo
"""
import urllib.request
from bs4 import BeautifulSoup  # 导入urllib库的request模块
import lxml                    #文档解析器
import re                      #正则表达式

标题重点:Url解析器（返回解析好的html文件）

#只需输入需要解析的网址就行,另外通过for循环可以实现翻页
def askURL(url):
    head = {    # 模拟浏览器头部信息，向服务器发送消息（伪装）
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
        # 用户代理，告诉服务器，我们是什么类型的浏览器（可以接收同等的消息）
    }
    req = urllib.request.Request(url, headers=head)
    html = ''
    try:
        response = urllib.request.urlopen(req, timeout=5) # 可以加 timeout 来限制时间
        html = response.read().decode("utf-8")
        #print(html)
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
    except:
        print("Time out may be!")
    return html

注意：上面的User-Agent，大概率不一样，可以通过点击F12, 再点击1，刷新，点击2，点击3，下拉4，看到
在这里插入图片描述

标题基础网址及正则语句

baseurl = 'http://dict.cn/'    #用于查找单词的网址（基于dict.cn来查找）

findZS = re.compile(r'<strong>(.*?)</strong>') #查找词性
findCX = re.compile(r'<span>(.*?)</span>')     #查找意思
findLJ = re.compile(r'<li>(.*?)</li>')         #查找例句

标题用于输入查找的单词

def pdstr():
    s = input('请输入您想要查找的单词：')
    if s.isalpha():
        return s
    print('请检查一下输入是否错误！')
    return pdstr()

标题查词器主体部分(本质是通过提取网页源码得到自己想要的部分，这里需要对html结构有一定的认识，才好定位)

1.soup.find_all(): 查找大纲（通过html的结构定位）
2.re.findall():精确查找（通过正则表达式匹对）

def chaciqi():
    
    word = pdstr() # 输入单词，并判断

    #得到html文本  
    html = askURL(baseurl + word)

    #解析html
    soup = BeautifulSoup(html, 'lxml')

    items = str(soup.find_all('ul', class_="dict-basic-ul"))
    cx = re.findall(findCX, items)
    means = re.findall(findZS, items)

    item = str(soup.find_all('div', class_="layout sort"))
    ljs = re.findall(findLJ, item)

    print(word)
    for i in range(len(cx)): # 词性数量代表循环数，悄悄告诉你，like:有多达6个词性哟>_<
        print(cx[i] + ' ' + means[i])

    print('例句：')
    for i in ljs:
        print(i.replace('<br/>', '---'))