自动下载多个html里的文本

工作经常要和国外同事交流, 但是英语口语比较一般, 所以有时候会找些听力口语等资料学习. 在某听力app上看到一份资料还可以, 叫 老外最想聊的100个口语话题, 就想找个电子版放到手机阅读软件里, 可以更方便查看. 网上找了半天, 没有找到可以下载的电子版. 但是找到了一个网页, 里面放了100个链接, 每个链接里有对应每个章节的对话. 于是我就想用脚本, 把这些链接里文章保存到一个文档里, 这样电子书不就有了~
感觉我好聪明~ 网上随便搜了搜, 稍微改了改, 一小会脚本就就搞好了~

1. 得到页面上的所有链接

参考了这篇文章:
https://blog.csdn.net/cw632386583/article/details/88270250

import re
from pprint import pprint
import requests
from bs4 import BeautifulSoup

def get_html(url):
    try:
        html = requests.get(url).text
    except Exception as e:
        print('web requests url error: {}\nlink: {}'.format(e, url))
    return html
 
def parse_page(soup):
    article = soup.find_all('div', class_='tab_rim')
    lines = article[0].find_all('div')
    for line in lines:
        print(line.text)

class WebDownloader(object):
    def __init__(self, base_url):
        self.url = base_url
        self.links = set()
 
    def parse_html(self, verbose=False):
        html = get_html(self.url)
        soup = BeautifulSoup(html, parser='lxml')

        for link in soup.findAll('a'):
            if link.has_attr('href'):
                href = str(link.get('href'))
                if href.startswith('http'):
                    self.links.add(href)
                    if verbose:
                        print(link.get('href'))
 
    def download(self):
        for link in self.links:
            link = str(link)
            if link.endswith('.pdf'):  # handle direct pdf url link
                file_name = link.split('/')[-1]
                try:
                    r = requests.get(link)
                    # with open(os.path.join(path, file_name), 'wb+') as f:
                    with open(file_name, 'wb+') as f:
                        f.write(r.content)
                except Exception as e:
                    print('Downloading error:{}\nlink:{}'.format(e, link))

url = 'https://www.24en.com/p/160258.html' #改成自己的url
wd = WebDownloader(url)
wd.parse_html()
pprint(wd.links)

2. 得到页面上的需要的文本

因为我要爬取的文本都带有在class=tab_rim 的标签里, 所以我用了:

soup.find_all('div', class_='tab_rim') 

这只是一个示例程序, 不同页面不一样, 所以代码要根据自己的需求修改下, 不能拿来直接用.

import re
from pprint import pprint
import requests
from bs4 import BeautifulSoup
 
 
def get_html(url):
    try:
        html = requests.get(url).text
    except Exception as e:
        print('web requests url error: {}\nlink: {}'.format(e, url))
    return html
 

def parse_page(soup):
    article = soup.find_all('div', class_='tab_rim')
    lines = article[0].find_all('div')
    for line in lines:
        print(line.text)

class WebDownloader(object):
 
    def __init__(self, base_url):
        self.url = base_url
        self.links = set()
 
    def parse_html(self, verbose=False):
        html = get_html(self.url)
        soup = BeautifulSoup(html, features="lxml")
        parse_page(soup)
        return
 
    def download(self):
        for link in self.links:
            link = str(link)
            if link.endswith('.pdf'):  # handle direct pdf url link
                file_name = link.split('/')[-1]
                try:
                    r = requests.get(link)
                    # with open(os.path.join(path, file_name), 'wb+') as f:
                    with open(file_name, 'wb+') as f:
                        f.write(r.content)
                except Exception as e:
                    print('Downloading error:{}\nlink:{}'.format(e, link))

urls = [
'https://www.24en.com/study/speaking/2013-07-24/158488.html',
'https://www.24en.com/study/speaking/2013-07-24/158489.html',
'https://www.24en.com/study/speaking/2013-07-24/158490.html',
] #只放了3个链接作为例子

for ui, url in enumerate(urls):
	wd = WebDownloader(url)
	wd.parse_html()

展示一下我爬取的第一章的文本~

First Impressions 初遇相识
 
dialogue 1
 
Steve: Mary, how was your date with john?
玛丽,你和约翰的约会怎么样?
 
Mary: it's ok. It seems we have a lot in common.
还可以.我们好像很投缘.
 
S: oh, really. That is great news. What does he look like?
真的吗?太好了.他长的怎么样?
 
M: he is tall and slim, fair-haired.
他身材高挑,金黄色的头发.
 
S: sounds like he is pretty cute. What do you think of him?
听起来长的很帅,那你对她印象怎么样啊
 
M: he is a nice guy and very considerate. I was impressed with how smart he was and he has a great sense of humor.
人不错,非常体贴,知识渊博,又很幽默.
 
S: oh, it's love at first sight. When will you see each other again?
啊,看来你是一见钟情了,下次什么时候见面?
 
M: he didn't say, but he said he would call me.
他没说,不过他说会给我打电话.
 
S: maybe he is busy with his work. Just wait and try not to think about it too much!
也许他现在工作比较忙,等等吧,别太想人家啊!
 
M: oh, steve. Stop it! I am a bit nervous! What if he doesn't call?
哎,苏珊,人家本来就很紧张了.如果他不大电话怎么办呢?
 
S: come on, Mary, you're a total catch. I bet he will call you. Don't worry.
别担心,玛丽,你这么讨人喜欢,我保证他会打电话的,别担心.
 
M: thank you, Steve. You're always so encouraging.
谢谢你,史蒂夫,你总是不断鼓励我.
 
S: that's what friends are for.
别说这些,我们是朋友嘛
 
dialogue 2
 
S: you know, Mary, I feel we meet somewhere before. Where were you born?
玛丽,你知道吗,我觉得好像在哪儿见过你,你是在哪里出生的?
 
M: I was born in Beijing, but I spent most of my childhood in London.
我在北京出生,后来去了伦敦。
 
S: what was your childhood like?
你的童年生活怎么样?
 
M: I had a pretty strict upbringing, and my parents taught at an university so they have extremely high expectations for me.
我小时候家教特别严,我父母以前是大学老师,对我期望特别高。
 
S: where did you go to university?
你在哪里上的大学?
 
M: my parents wanted me to stay in Beijing, but I decided to go back to England. I graduated from University of Newcastle upon Tyne with a degree in Cross Culture Communication.
我父母想让我留在北京,可是我决定回英国。后来我从纽卡斯尔大学毕业并且拿到了跨文化交流专业的学位。
 
S: what is your current occupation?
那你现在做什么工作呢?
 
M: I am a journalist. I write for China Daily.
我做记者,在《中国日报》工作。
 
S: did you know that you wanted to be a journalist right after your graduation?
那你一毕业就知道自己相当记者吗?
 
M: no, I didn’t. I started working at a university in London but as time went by, I found I did not really like my job. I decided to explore other fields. Journalism is great fit for me as well as a challenge. 
没有。我最初在伦敦一所大学教书,颗后来我觉得自己并不喜欢当老师。于是就决定尝试一下其它领域。新闻工作对我来说不失为一种新的尝试和挑战。
 
S: do you like your current job?
那你喜欢你现在的工作吗?
 
M: yes, I came to Beijing two years ago looking for new opportunities. I was lucky because my friend introduced me to my current company.
是的,两年前我来到北京,希望找到新的机会。当时很幸运,一个朋友介绍我进了现在的公司。 

这段代码成功爬取了100个章节里前66个章节的文本. 后边章节格式稍微不太一样需要修改下

articles = soup.find_all('div', class_='tab_rim')
pages = articles[0].find_all('p')
for pi in pages:
    page = str(pi.text)
    page = page.replace('\r\n\xa0\r\n', '\n\n')
    page = page.replace('\r\n\xa0', '\n')
    print(page)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值