爬取目标
爬取的网址为:4900+ Fate系列 高清壁纸 | 桌面背景,大家应该把该网站用于学习目的。
分析
打开浏览器检查找到图片所在的标签
Ajax
通过在后台与服务器进行少量数据交换,Ajax 可以使网页实现异步更新。这意味着可以在不重新加载整个网页的情况下,对网页的某部分进行更新。
因为更多的图片下拉才会显示,
通过分析 url =''https://wall.alphacoders.com/by_collection.php?id=615&lang=Chinese&quickload=4969&page=2"
"https://wall.alphacoders.com/by_collection.php?id=615&lang=Chinese&quickload=4969&page=3"page代表的页数
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'wall.alphacoders.com'
}
def get_image(page):
parms = {
'id': '615',
'lang': 'Chinese',
'quickload': '4969',
'page': page
}
bask_url = 'https://wall.alphacoders.com/by_collection.php?'
Z_url = bask_url + urlencode(parms)
res = requests.get(url=Z_url, headers=headers).text
现在要获取图片的src和alt的属性值,通过Xpath获取
// | 从匹配选择的当前节点选择文档中的节点 |
@ | 获取属性值 |
/ | 从节点选取。 |
res = requests.get(url=Z_url, headers=headers).text
tree = etree.HTML(res)
urls = tree.xpath('//picture//img/@src')
titles = tree.xpath('//picture//img/@alt')
for url, title in zip(urls, titles):
print(f"{title} - {url}")
这章只教会如何通过Xpath分析图片。
完整代码块
from lxml import etree
import requests
from urllib.parse import urlencode
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'wall.alphacoders.com'
}
def get_image(page):
parms = {
'id': '615',
'lang': 'Chinese',
'quickload': '4969',
'page': page
}
bask_url = 'https://wall.alphacoders.com/by_collection.php?'
Z_url = bask_url + urlencode(parms)
res = requests.get(url=Z_url, headers=headers).text
tree = etree.HTML(res)
urls = tree.xpath('//picture//img/@src')
titles = tree.xpath('//picture//img/@alt')
for url, title in zip(urls, titles):
print(f"{title} - {url}")
if __name__ == '__main__':
for page in range(1, 11):
get_image(page)