工作经常要和国外同事交流, 但是英语口语比较一般, 所以有时候会找些听力口语等资料学习. 在某听力app上看到一份资料还可以, 叫 老外最想聊的100个口语话题, 就想找个电子版放到手机阅读软件里, 可以更方便查看. 网上找了半天, 没有找到可以下载的电子版. 但是找到了一个网页, 里面放了100个链接, 每个链接里有对应每个章节的对话. 于是我就想用脚本, 把这些链接里文章保存到一个文档里, 这样电子书不就有了~
感觉我好聪明~ 网上随便搜了搜, 稍微改了改, 一小会脚本就就搞好了~
1. 得到页面上的所有链接
参考了这篇文章:
https://blog.csdn.net/cw632386583/article/details/88270250
import re
from pprint import pprint
import requests
from bs4 import BeautifulSoup
def get_html(url):
try:
html = requests.get(url).text
except Exception as e:
print('web requests url error: {}\nlink: {}'.format(e, url))
return html
def parse_page(soup):
article = soup.find_all('div', class_='tab_rim')
lines = article[0].find_all('div')
for line in lines:
print(line.text)
class WebDownloader(object):
def __init__(self, base_url):
self.url = base_url
self.links = set()
def parse_html(self, verbose=False):
html = get_html(self.url)
soup = BeautifulSoup(html, parser='lxml')
for link in soup.findAll('a'):
if link.has_attr('href'):
href = str(link.get('href'))
if href.startswith('http'):
self.links.add(href)
if verbose:
print(link.get('href'))
def download(self):
for link in self.links:
link = str(link)
if link.endswith('.pdf'): # handle direct pdf url link
file_name = link.split('/')[-1]
try:
r = requests.get(link)
# with open(os.path.join(path, file_name), 'wb+') as f:
with open(file_name, 'wb+') as f:
f.write(r.content)
except Exception as e:
print('Downloading error:{}\nlink:{}'.format(e, link))
url = 'https://www.24en.com/p/160258.html' #改成自己的url
wd = WebDownloader(url)
wd.parse_html()
pprint(wd.links)
2. 得到页面上的需要的文本
因为我要爬取的文本都带有在class=tab_rim
的标签里, 所以我用了:
soup.find_all('div', class_='tab_rim')
这只是一个示例程序, 不同页面不一样, 所以代码要根据自己的需求修改下, 不能拿来直接用.
import re
from pprint import pprint
import requests
from bs4 import BeautifulSoup
def get_html(url):
try:
html = requests.get(url).text
except Exception as e:
print('web requests url error: {}\nlink: {}'.format(e, url))
return html
def parse_page(soup):
article = soup.find_all('div', class_='tab_rim')
lines = article[0].find_all('div')
for line in lines:
print(line.text)
class WebDownloader(object):
def __init__(self, base_url):
self.url = base_url
self.links = set()
def parse_html(self, verbose=False):
html = get_html(self.url)
soup = BeautifulSoup(html, features="lxml")
parse_page(soup)
return
def download(self):
for link in self.links:
link = str(link)
if link.endswith('.pdf'): # handle direct pdf url link
file_name = link.split('/')[-1]
try:
r = requests.get(link)
# with open(os.path.join(path, file_name), 'wb+') as f:
with open(file_name, 'wb+') as f:
f.write(r.content)
except Exception as e:
print('Downloading error:{}\nlink:{}'.format(e, link))
urls = [
'https://www.24en.com/study/speaking/2013-07-24/158488.html',
'https://www.24en.com/study/speaking/2013-07-24/158489.html',
'https://www.24en.com/study/speaking/2013-07-24/158490.html',
] #只放了3个链接作为例子
for ui, url in enumerate(urls):
wd = WebDownloader(url)
wd.parse_html()
展示一下我爬取的第一章的文本~
First Impressions 初遇相识
dialogue 1
Steve: Mary, how was your date with john?
玛丽,你和约翰的约会怎么样?
Mary: it's ok. It seems we have a lot in common.
还可以.我们好像很投缘.
S: oh, really. That is great news. What does he look like?
真的吗?太好了.他长的怎么样?
M: he is tall and slim, fair-haired.
他身材高挑,金黄色的头发.
S: sounds like he is pretty cute. What do you think of him?
听起来长的很帅,那你对她印象怎么样啊
M: he is a nice guy and very considerate. I was impressed with how smart he was and he has a great sense of humor.
人不错,非常体贴,知识渊博,又很幽默.
S: oh, it's love at first sight. When will you see each other again?
啊,看来你是一见钟情了,下次什么时候见面?
M: he didn't say, but he said he would call me.
他没说,不过他说会给我打电话.
S: maybe he is busy with his work. Just wait and try not to think about it too much!
也许他现在工作比较忙,等等吧,别太想人家啊!
M: oh, steve. Stop it! I am a bit nervous! What if he doesn't call?
哎,苏珊,人家本来就很紧张了.如果他不大电话怎么办呢?
S: come on, Mary, you're a total catch. I bet he will call you. Don't worry.
别担心,玛丽,你这么讨人喜欢,我保证他会打电话的,别担心.
M: thank you, Steve. You're always so encouraging.
谢谢你,史蒂夫,你总是不断鼓励我.
S: that's what friends are for.
别说这些,我们是朋友嘛
dialogue 2
S: you know, Mary, I feel we meet somewhere before. Where were you born?
玛丽,你知道吗,我觉得好像在哪儿见过你,你是在哪里出生的?
M: I was born in Beijing, but I spent most of my childhood in London.
我在北京出生,后来去了伦敦。
S: what was your childhood like?
你的童年生活怎么样?
M: I had a pretty strict upbringing, and my parents taught at an university so they have extremely high expectations for me.
我小时候家教特别严,我父母以前是大学老师,对我期望特别高。
S: where did you go to university?
你在哪里上的大学?
M: my parents wanted me to stay in Beijing, but I decided to go back to England. I graduated from University of Newcastle upon Tyne with a degree in Cross Culture Communication.
我父母想让我留在北京,可是我决定回英国。后来我从纽卡斯尔大学毕业并且拿到了跨文化交流专业的学位。
S: what is your current occupation?
那你现在做什么工作呢?
M: I am a journalist. I write for China Daily.
我做记者,在《中国日报》工作。
S: did you know that you wanted to be a journalist right after your graduation?
那你一毕业就知道自己相当记者吗?
M: no, I didn’t. I started working at a university in London but as time went by, I found I did not really like my job. I decided to explore other fields. Journalism is great fit for me as well as a challenge.
没有。我最初在伦敦一所大学教书,颗后来我觉得自己并不喜欢当老师。于是就决定尝试一下其它领域。新闻工作对我来说不失为一种新的尝试和挑战。
S: do you like your current job?
那你喜欢你现在的工作吗?
M: yes, I came to Beijing two years ago looking for new opportunities. I was lucky because my friend introduced me to my current company.
是的,两年前我来到北京,希望找到新的机会。当时很幸运,一个朋友介绍我进了现在的公司。
这段代码成功爬取了100个章节里前66个章节的文本. 后边章节格式稍微不太一样需要修改下
articles = soup.find_all('div', class_='tab_rim')
pages = articles[0].find_all('p')
for pi in pages:
page = str(pi.text)
page = page.replace('\r\n\xa0\r\n', '\n\n')
page = page.replace('\r\n\xa0', '\n')
print(page)