用python写网络爬虫-使用xpath代替正则表达式

学习使用xpath代替正则表达式,首先得了解xpath语法
xpath语法详解
1.使用xpath遇到的几个问题,一个是不会提取a标签的内容,后来查资料发现需要在后面加/text()
2.提取倒数标签数据使用[last()-xx]的方式

今天还顺便了解了requests模块,发现这个模块蛮好用的
注释写中文出错后再开头加上# -- coding:gbk --即可

附上今天使用xpath改造的项目源码:
test2.py

import sys
import requests
import re

def download(url):
    print("downloading:"+url)
    headers={'User-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    try:
        content=requests.get(url,headers=headers)
    except urllib.error.URLError as e:
        print("download error:"+e.reason)
        content=None
    return content

def save(file_name, file_content):
    print("saving.......")
    with open(file_name + ".html", "wb") as f:
        f.write(file_content.content)
    f.close()

murl="http://blog.csdn.net/Joliph/article/list/1"
html = download(murl)
save(re.split('/',murl)[-1], html)

budejie2.py

# -*- coding:gbk -*-
import sys
import requests
import re
from lxml import etree
from multiprocessing.dummy import Pool as threadpool
import urllib.request
import time

def wp():
    print("plz input 1~50")
    page=input("which budejie page you want to download?(1~50):")
    page=int(page)
    while page>50:
        print("plz input 1~50")
        page=input("which budejie page you want to download?(1~50):")
        page=int(page)
    return page

pg=wp()
headers={'User-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url="http://www.budejie.com/"+str(pg)

html=requests.get(url,headers=headers).text
selector=etree.HTML(html)

mp4url=[]
start_time=time.time()
mp4list=selector.xpath('//*[@class="j-r-list-tool-l-down f-tar j-down-video j-down-hide ipad-hide"]/a')
for i in mp4list:
    mp4url.append(str(i.xpath('@href')[0]))
end_time=time.time()
print("匹配耗时:"+str(end_time-start_time)+"Seconds")

for i in mp4url:
    print("downloading:"+i)
    filename=re.split('/',i)[-1]
    urllib.request.urlretrieve(i,filename)

csdn2.py:

# -*- coding:gbk -*-
import sys
import requests
import re
from lxml import etree

def download(url):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"}
    html=requests.get(url,headers=headers).text
    return html

html2=download("http://blog.csdn.net/Joliph")
selector2=etree.HTML(html2)
pagelist=selector2.xpath('//*[@id="papelist"]/a[last()-2]/text()')[0]
#这里有有个潜在的问题,在我博客写到5页以上时出现...后无法判断页数
pagelist=int(pagelist)
for page in range(1,pagelist+1):
    url="http://blog.csdn.net/Joliph/article/list/"+str(page)
    html=download(url)
    selector=etree.HTML(html)
    titlelist=selector.xpath('//*[@class="list_c_t"]/a/text()')
    yearlist=selector.xpath('//*[@class="date_t"]/span/text()')
    monthlist=selector.xpath('//*[@class="date_t"]/em/text()')
    daylist=selector.xpath('//*[@class="date_b"]/text()')
    #/text()!!!!!!!!!!!!!!!!!!!!!!!
    number=len(titlelist)
    for i in range(1,number+1):
        print(yearlist[i-1]+"."+monthlist[i-1]+"."+daylist[i-1]+"----"+titlelist[i-1])
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值