导读
最近学英语中,想借助原汁原味的英文素材,浏览了下《经济学人》、《纽约时报》、《大西洋月刊》等,发现《大西洋月刊》(The Atlantic)比较合胃口,所以就写了个爬虫爬取每日新闻,保存markdown文件,便于推送到博客上。
文章收纳:
问题:
- 正则表达式忘得差不多了
- scrapy使用也是,今晚复习了下如何爬取网页,保存数据、配置还没看
import requests
from lxml import etree
import re
# url = 'https://www.theatlantic.com/science/archive/2018/10/horsepox-smallpox-virus-science-ethics-debate/572200/'
url_root = 'https://www.theatlantic.com/latest/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
}
def get_urlLists(html):
selector = etree.HTML(html)
url_lists = selector.xpath('//ul[@class="river"]/li/a/@href')
url_lists = ['https://www.theatlantic.com{}'.format(url) for url in url_lists]
return url_lists
root_html = requests.get(url_root,headers=headers).text
url_lists = get_urlLists(root_html)
len(url_lists)
30
lxml解析网页的方式
- 解析str
- etree.HTML(str)
- 解析html文件
- etree.parse(‘html文件路径’,etree.HTMLparser())
def get_MarkDown_by_url(url):
html = requests.get(url,headers=headers)
if html.status_code==200