自动爬取每日天气、每日微博热搜、每日外文网数据(附带自动翻译)数据,并以文本形式邮件发送
项目特点:
- 自动爬取中国天气网指定城市的一周天气;
- 自动爬取每日微博热搜的热搜标题以及链接;
- 自动爬取外文网http://conflictoflaws.net/的每日推荐标题及链接,并同时将标题处理为中英对照,翻译引擎为有道在线翻译;
- 将所爬取的所有数据整合为文本形式分别保存到本地并以邮件发送到目标邮箱。
具体实现代码如下:
1. 主体逻辑代码如下:
def get_text():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3722.400 QQBrowser/10.5.3738.400'
}
url = {
'weibo': 'https://s.weibo.com/top/summary?cate=realtimehot',
'tianqi': 'http://www.weather.com.cn',
'law': 'http://conflictoflaws.net',
}
weibo = get_weibo(url, headers)
tianqi = get_tianqi(url, headers)
laws = get_law(url, headers)
for item in laws:
law = item
text = oprate(weibo, tianqi, law)
return text
def main():
print('|============正在搜集数据===========|')
text = get_text()
print('|======搜索完成,正在更新旧数据=====|')
os.remove('text.txt')
time.sleep(3)
for each in text:
with open('text.txt', 'a+', encoding= 'utf-8') as f:
f.write(each + '\n')
print('|==============准备发送=============|')
with open('text.txt', 'r', encoding='utf-8') as f:
string = f.read()
time.sleep(5)
try_max = 1
while try_max < 6:
try:
from_addr = 'xxxx@126.com'
password = 'xxxx'
to_addr = ['xxxx@qq.com', 'xxxx@126.com', 'xxxx@qq.com']
smtp_server = 'smtp.126.com'
message = MIMEText(string, 'plain', 'utf-8')
message['From'] = 'xxxx <xxxx@126.com>'
message['To'] = 'Little Pig <SuperUser@qq.com>'
message['Subject'] = Header(u'阿光每日小报', 'utf-8').encode()
server = smtplib.SMTP(smtp_server, 25)
server.set_debuglevel(1)
server.login(from_addr, password)
server.sendmail(from_addr, to_addr, message.as_string())
server.quit()
except SMTPDataError:
print('|====发送失败,正在尝试重发第%d次====|' % try_max)
try_max += 1
time.sleep(3)
else:
print('|===========邮件发送完成============|')
time.sleep(5)
break
if __name__ == '__main__':
main()
2. 自动爬取中国天气网指定城市(兰州,长沙,南京,海南)的一周天气
def get_tianqi(url, headers):
lanzhou_url = url.get('tianqi') + '/weather/101160101.shtml'
changsha_url = url.get('tianqi') + '/weather/101250101.shtml'
nanjing_url = url.get('tianqi') + '/weather/101190101.shtml'
hainan_url = url.get('tianqi') + '/weather/101310101.shtml'
url_pool = [lanzhou_url, changsha_url, nanjing_url, hainan_url]
weathers = []
for item in url_pool:
weather = []
html = requests.get(item, headers).content.decode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
day_list = soup.find('ul', 't clearfix').find_all('li')
for day in day_list:
date = day.find('h1').get_text()
wea = day.find('p', 'wea').get_text()
if day.find('p', 'tem').find('span'):
hightem = day.find('p', 'tem').find('span').get_text()
else:
hightem = ''
lowtem = day.find('p', 'tem').find('