Python爬虫的应用：统记词频

最新推荐文章于 2023-02-20 11:21:37 发布

不是汤圆

最新推荐文章于 2023-02-20 11:21:37 发布

阅读量1.1k

点赞数 4

分类专栏： python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_40771317/article/details/120395117

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最近闲来无事，受Python爬虫学习的启发，我对广泛歪果仁使用的的英语词频异常感兴趣。于是，我寻思着利用最近所学习的爬虫知识，做一个小型的英语单词词频统计。

主要框架为：Python爬取可在线阅读的英文原著，提取其中的段落，并保存在本地的txt文档中，对txt文档中的数据进行分离，分离出单个单词后，将每个单词添加到Python collection模块中的Counter对象中，实现对单词的计数，接下来将counter中记录的每个单词以及其个数储存在mysql中，以便后面实现数据可视化，最后以词频降序排列，提取出前100个单词，储存在csv文档中，并利用pyecharts中的WordCloud（词云）对象让数据直观化。

想法成立，开始动手。

首先，我们得寻找一个适合在线阅读名著的非盈利网站，其中比较出名的是Zlibrary, 以及“古登堡计划”，我们选择后者，因为它是不需要登录的，而且对爬虫友好。我是想找个大概几十本书去爬取，所以我将目标定位在了这个网址：https://www.gutenberg.org/ebooks/search/?sort_order=downloads

这是目标页面：

这里用开发者工具检查一下我们需要的书籍的url在哪里，

检查后发现目标位于列表项（li）中的a标签中的href属性中，接下来开始获取URL

首先创建一个用于接受url,返回html对象的方法：open_page()

def open_page(url):
    req = requests.get(url, headers=headers)
    req.encoding = 'uft-8'
    html = req.text
    return html

接下来，创建一个用于获取目标书籍url的方法：

def get_full_url(html):
    global start_index
    global link_list
    bsObj = BeautifulSoup(html, 'html.parser')
    links = bsObj.find_all('li', {'class': 'booklink'})
    start_index += len(links)
    for link in links:
        short_url = link.find('a', {'class': 'link'})['href']
        full_url = base_url + short_url
        link_list.append(full_url)

def get_inform():
    global start_index
    global main_url
    html = open_page(main_url)
    get_full_url(html)
    main_url = "https://www.gutenberg.org/ebooks/search/?sort_order=downloads" + "&start_index=" + str(start_index)
    while start_index < 100:
        get_inform()

这里说明一下，原网页提供了按受欢迎程度排名前100名的书籍，但是每个页面受大小限制，只显示25本书籍的信息，所以我定义了一个全局字段start_index用于记录已获取了多少本书，而这里的main_url是记录当前页面的url，其命名是有规则的，每次的start_index都是不一样的。

我们将获取获取到的每本书籍的url储存在提前创建好的列表list_url中，便于之后利用。

但是，事情并不简单，获取到的url并非可以直接访问到在线阅读界面，链接打开后如下图：

所以，我们需要先进入该页面并提取两个关键信息：

1.目标书籍的url

2.目标书籍的title

def get_inform():
    global start_index
    global main_url
    html = open_page(main_url)
    get_full_url(html)
    main_url = "https://www.gutenberg.org/ebooks/search/?sort_order=downloads" + "&start_index=" + str(start_index)
    while start_index < 100:
        get_inform()


def goto_page():
    global link_list
    while len(link_list) > 0:
        link = link_list.pop()
        html = open_page(link)
        bsObj = BeautifulSoup(html, 'html.parser')
        title = bsObj.find('td', {'itemprop': 'headline'}).get_text()
        title = re.sub(r'[\n\t\r\v\f]', ' ', title)
        title = title.strip()
        page_url = bsObj.find('table', {'class': 'files'}).find('td', {'class': 'unpadded icon_save'}).find('a')['href']
        page_url = base_url + page_url
        link_dict[title] = page_url

原网页提供了按照受欢迎程度的排名前100位的书籍，受页面大小限制，每个页面只显示25本书，所以创建了start_index用于记录已获取书籍的数目，超过100时就停止爬取。按照同样的方式定位到需要爬取元素的位置，爬取并作相应处理，title去掉其中包含的的特殊字符，以免影响待会儿的文件命名，url还原为完整地址，分别储存在提前创建的link_dict中（以title为key, url为value）。

接下来时，分别进入每本书的在线阅读页面，并提取每个章节的段落，储存在txt文档中。

def store_page(html, title):
    bsObj = BeautifulSoup(html, "html.parser")
    chapters = bsObj.find_all("div", {"class": "chapter"})
    if len(chapters) == 0:
        return False
    file_path = 'book/' + title + '.txt'
    with open(file_path, 'w+', encoding='utf-8') as fd:
        for chapter in chapters:
            ps = chapter.find_all('p')
            if len(ps) == 0:
                return False
            for p in ps:
                content = p.get_text()
                fd.write(content)


def is_to_store():
    for t, u in link_dict.items():
        html = open_page(u)
        state = store_page(html, t)
        if state:
            pass
        else:
            continue

由于，每本书对应的html的DOM树结构不一样，所里上面选择了折衷的方案，大多数书籍都是由以多个class为chapter的div中包含几个段落（p）的结构储存的每个章节，所以上面的代码就只适用符合该结构的书籍，不适用的就跳过了。

由于代码的不完善，上面的过程中，产生了一些无效文件，如没有后缀名的文件和文件大小为0的txt文件，这些文件都是无意义的，所以需要删去。

def delete_invalid_file():
    print('start to delete invalid file:')
    root_path = r'A:\pythonProject\book'
    file_list = os.listdir('book')
    pattern = re.compile(r'.*\.txt')
    for file in file_list:
        flag = re.match(pattern, file)
        new_path = root_path + '\\' + file
        if flag:
            size = os.path.getsize(new_path)
            if size == 0:
                os.remove(new_path)
                print(file, end=' ')
        else:
            os.remove(new_path)
            print(file, end=' ')
    print('delete over')

删去无效文件后，储存下来的txt如下：

接下来，我们将有效的文件名（书名）储存在mysql中，这样也有一个数据来源。

def get_filename():
    global conn
    global cur
    print('start to store book_name to mysql')
    file_list = os.listdir('book')
    for i in range(len(file_list)):
        cur.execute('INSERT INTO recorder (book_name) VALUES (%s)', file_list[i])
        file_list[i] = 'book' + '\\' + file_list[i]
    # conn.commit()
    return file_list

上面的代码将书名储存在MySQL中，并返回一个包含每个txt相对路径的的file_list，便于后面读取。

def read_words(file_name):
    print('start to count from: ' + file_name)
    global words_counter
    with open(file_name, 'r', encoding='utf-8') as fd:
        lines = fd.readlines()
        for line in lines:
            line_str = line.strip()
            if line_str != '':
                words = re.findall(r'[a-z]+', line_str.lower())
                words_counter.update(words)
    print('count over')

上面的代码，遍历每个txt文件，读取其中的内容，并将每个单词分离出来，并添加到提前创建的Counter中记录每个单词出现的次数。

接下来，数据持久化，将储存在Counter中的每个单词及其出现的次数储存在MySQL中。

def update_to_mysql():
    global conn
    global words_counter
    global cur
    print('start to store data to mysql:')
    for w, t in words_counter.items():
        if len(w) > 1 or (w == 'a' or w == 'i'):
            cur.execute('SELECT id, total FROM counter WHERE word = %s', w)
            if cur.rowcount == 0:
                cur.execute('INSERT INTO counter (word, total) VALUES(%s, %s)', (w, t))
            elif cur.rowcount == 1:
                result = cur.fetchone()
                id = result[0]
                total = result[1] + t
                cur.execute('UPDATE counter SET total = %s WHERE id = %s', (total, id))
    conn.commit()
    cur.close()
    conn.close()
    print('store over')

为了，以后添加，这里要分两种情况，一种是原先sql中没有出现的单词，这样直接插入新纪录就行；另一种，原本sql记录中就有这个单词了，就需要更新其出现的次数。

储存完毕后，我们将前按照词频降序排名的前100位单词选出来，并将其转化为csv和更加直观的词云图：

def convert_to_csv():
    global conn
    global cur
    header = ['word', 'count']
    cur.execute("SELECT word, total FROM `counter` ORDER BY total DESC LIMIT 100")
    result = cur.fetchall()
    with open('book/result.csv', 'w', encoding='utf-8', newline='') as fd:
        f_csv = csv.writer(fd)
        f_csv.writerow(header)
        f_csv.writerows(result)
    cur.close()
    conn.close()


def data_to_WordCloud():
    global conn
    global cur
    data = []
    cur.execute("SELECT word, total FROM `counter` ORDER BY total DESC LIMIT 100")
    result = cur.fetchall()
    for row in result:
        data.append(row)
    word_map = WordCloud()
    word_map.add("词频统计", data, word_size_range=[20, 100])
    word_map.render(path='book/wordCloud.html')
    cur.close()
    conn.close()

最后，csv文件在excel中打开如下：

词云图如下：

写在文末：最终，我的任务算是完成了，但是其中存在的不足也是显而易见的，主要问题在于单词的格式并不是初始格式，但是将其还原为其原型是一件非常耗费时间的事情，无论是调用api，或是用在线查词网页都非常慢，所以最后还是妥协了。

最后，一起加油吧！

不是汤圆

关注

4
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫的应用：统记词频

最近闲来无事，受Python爬虫学习的启发，我对广泛歪果仁使用的的英语词频异常感兴趣。于是，我寻思着利用最近所学习的爬虫知识，做一个小型的英语单词词频统计。主要框架为：Python爬取可在线阅读的英文原著，提取其中的段落，并保存在本地的txt文档中，对txt文档中的数据进行分离，分离出单个单词后，将每个单词添加到Python collection模块中的Counter对象中，实现对单词的计数，接下来将counter中记录的每个单词以及其个数储存在mysql中，以便后面实现...
复制链接

扫一扫