Python使用（一）从网页爬取并清洗一些优美的中英双语短句

最新推荐文章于 2023-05-29 12:53:59 发布

zhudfly2013

最新推荐文章于 2023-05-29 12:53:59 发布

阅读量1.7k

点赞数

分类专栏： Python 文章标签： Python 爬取网页清洗数据

本文链接：https://blog.csdn.net/zhudfly2013/article/details/81667047

版权

Python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Crawl Short Sentence

爬取一些优美的中英双语短句

找到一个网站

http://www.siandian.com/haojuzi/1574.html

用上面的网站链接做例子

# 通过url获取网页
import urllib.request

def get_html(url):
    # 要设置请求头，让服务器不知道是程序
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent': user_agent}

    req = urllib.request.Request(url=url, headers=headers)
    response = urllib.request.urlopen(req)
    html_content = response.read().decode("gbk")
    return html_content

导入了urllib，使用环境是Python3, 在python3中urllib和urllib2已经合并
headers的设置是防止服务器对我们的请求屏蔽，模拟正常的用户请求
最后返回网页的内容

分析网页

找到需要抓取内容的特征，上面网站的特征：

<p>
    <br />
    1、我的世界不允许你的消失,不管结局是否完美。<br />
    No matter the ending is perfect or not, you cannot disappear from my world.</p>
<p>
    2、爱情是一个精心设计的谎言。<br />
    Love is a carefully designed lie.</p>

短语都是以”< p>”开始，以”< /p>”结尾

获取语句

import re
def get_sentence(content):
    content_list = re.findall('<p>.*?</p>', content, re.S)

导入了re模块，re模块使Python语言拥有了全部的正则表达式功能
使用了findAll函数，第一个参数为 正则表达式 ，第二个参数为 需要匹配的内容 ，第三个参数为 Flag
本次使用的正则表达式比较简单,匹配以”\
“开始，以”\
“结尾的内容，”.*?”的含义为：
1. ’.’ 表示匹配任意字符
2. ‘*’ 表示匹配前一个字符0至无限次
3. ‘?’ 表示非贪婪模式，在满足条件的情况下尽可能少的匹配

分析获取的语句

type1:

'<p>\r\n\t<br />\r\n\t1、我的世界不允许你的消失,不管结局是否完美。<br />\r\n\tNo matter the ending is perfect or not, you cannot disappear from my world.</p>'

type2:

恋爱中，干傻事总是让人感到十分美妙。
\r\n\tIn love folly is always sweet.
"" data-snippet-id="ext.74a355223e48433da5a7cce13eabd2b6" data-snippet-saved="false" data-codota-status="done">"<p>\r\n\t66、<a href='http://www.siandian.com/lianaijiqiao/' target='_blank'><u>恋爱</u></a>中，干傻事总是让人感到十分美妙。<br />\r\n\tIn love folly is always sweet.</p>"

清洗数据

def clean_sentence(item_temp):
    item_temp = item_temp.replace("<p>\r\n\t<br />", "").replace("<br />\r\n\t", "&&").replace("</p>", "").replace("<p>", "").replace("\r\n\t", "")
    item_temp = item_temp.split('、')
    if len(item_temp) == 2:
        item_temp = item_temp[1]
    else:
        # print(item_temp)
        return ''
    if "<a href=" not in item_temp:
        return item_temp + " &$\n"

    return ''

清洗后的语句为(添加&& 和 &$用于之后拆分中英文语句)：

我的世界不允许你的消失,不管结局是否完美。&&No matter the ending is perfect or not, you cannot disappear from my world. &$

完整代码

.*?', content, re.S)
    sentence_list = []
    for item_loop in content_list:
        item_loop = clean_sentence(item_loop)
        if len(item_loop) > 0:
            sentence_list.append(item_loop)

    for show in sentence_list:
        print(show)

    return sentence_list


# 清洗语句
def clean_sentence(item_temp):
    item_temp = item_temp.replace("\r\n\t
", "").replace("
\r\n\t", "&&")\
        .replace("
", "").replace("", "").replace("\r\n\t", "")
    item_temp = item_temp.split('、')
    if len(item_temp) == 2:
        item_temp = item_temp[1]
    else:
        # print(item_temp)
        return ''
    if "# -*- coding: UTF-8 -*-
import re
import urllib.request

websites = ["http://www.siandian.com/haojuzi/1574.html"]


# 通过url获取网页
def get_html(url):
    # 要设置请求头，让服务器知道不是机器人
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent': user_agent}

    req = urllib.request.Request(url=url, headers=headers)
    response = urllib.request.urlopen(req)
    html_content = response.read().decode("gbk")
    return html_content


# 通过正则表达式来获取语句
def get_sentence(content):
    content_list = re.findall('<p>.*?</p>', content, re.S)
    sentence_list = []
    for item_loop in content_list:
        item_loop = clean_sentence(item_loop)
        if len(item_loop) > 0:
            sentence_list.append(item_loop)

    for show in sentence_list:
        print(show)

    return sentence_list


# 清洗语句
def clean_sentence(item_temp):
    item_temp = item_temp.replace("<p>\r\n\t<br />", "").replace("<br />\r\n\t", "&&")\
        .replace("</p>", "").replace("<p>", "").replace("\r\n\t", "")
    item_temp = item_temp.split('、')
    if len(item_temp) == 2:
        item_temp = item_temp[1]
    else:
        # print(item_temp)
        return ''
    if "<a href=" not in item_temp:
        return item_temp + " &$\n"

    return ''


if __name__ == '__main__':
    html = get_html(websites[0])
    get_sentence(html)

zhudfly2013

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Python使用（一）从网页爬取并清洗一些优美的中英双语短句

Crawl Short Sentence 爬取一些优美的中英双语短句找到一个网站http://www.siandian.com/haojuzi/1574.html用上面的网站链接做例子# 通过url获取网页import urllib.requestdef get_html(url): # 要设置请求头，让服务器不知道是程序 user_agen...
复制链接

扫一扫