F12进入开发者工具
ctrl+u查看网页源代码
]
xpath无法爬取网页源代码不存在的东西,例如运行以下代码:
import time
from lxml import etree
import requests
def open_url(s):
url="https://music.163.com/#/artist?id=2116"
res = requests.get(url)
time.sleep(1)
r=res.text
selector = etree.HTML(r)
# /text()提取文字
# x="(//span[@class='txt']//b)[{}]/text()".format(str(s)) #运行结果:['${soil(x.name)}']
# @提取属性
x="(//span[@class='txt']//a)[{}]/@href".format(str(s)) #运行结果:['/song?id=${x.id}']
a=selector.xpath(x)
print(a)
open_url(1)
注意URL的“#”,浏览器上有“#”,但是开发者工具的Request URL没有“#”。我怀疑开发是故意的。
注释掉 ‘Accept-Encoding’:‘gzip, deflate, br’,否则乱码
import requests
from bs4 import BeautifulSoup
url="https://music.163.com/artist?id=2116"
headers={
'authority':'music.163.com',
'method':'GET',
'path':'/artist?id=2116',
'scheme':'https',
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
# 'accept-encoding':'gzip,deflate,br',
'accept-language':'zh-CN,zh;q=0.9',