爬取的目标
用到的工具
语言:Python
解析库:lxml
查询语言:Xpath
XPath相关知识
开始爬取数据
最后需要的得到JSON格式
{
"name":,
"usage":,
"params":[
{"param":,"content":}
]
}
目标网站首页给出了一个命令的列表,我们需要得到每一个命令子页面的url
将源代码复制到xpather上,可以方便测试写出的XPath是否正确
得到解析出子页面url的XPath
//div[contains(@class,'column col-half')]/ul/li[@class='format-standard']/a/@href
在子页面中提取要爬取的数据
usage
//p/strong[contains(text(),"语法格式:")]/parent::node()/text()
params
//article//table//td
python代码
import requests
from lxml import etree
data=[]
html = requests.get("https://www.linuxcool.com/").text
content = etree.HTML(html)
urls = content.xpath("//div[contains(@class,'column col-half')]/ul/li[@class='format-standard']/a/@href")[:-2]
def deal_suburl(it,url):
it['params']=[]
html=requests.get(url).text
content=etree.HTML(html)
usage=content.xpath('//p/strong[contains(text(),"语法格式")]/parent::node()/text()')
it['usage']=usage
params=content.xpath('//article//table//td/text()')
for index in range(int(len(params)/2)):
tmp={}
tmp['param']=params[2*index].strip()
tmp['content']=params[2*index+1].strip()
it['params'].append(tmp)
for url in urls:
it={}
it['name']=url.split('/')[-1]
deal_suburl(it,url)
data.append(it)
写入文件
import json
file_name ='data.json'
with open(file_name,'w',encoding='UTF-8') as f:
f.write(json.dumps(data,ensure_ascii=False))