关闭

xpath爬虫小例

标签: python爬虫xpath
69人阅读 评论(0) 收藏 举报
分类:

刚刚接触爬虫,马上用xpath对最近很关注的我的手机安卓6.0rom消息和某论坛当天新帖进行爬取。应该算是我第一个真正意义上的爬虫代码了,很简短。

  • 电脑系统:win10旗舰版
  • 运行环境:python2.7.10+pycharm5.0.1
  • 实现功能:定向爬虫
#-*-coding:utf-8-*-
from lxml import etree
import requests

def spider_ROM(url):  
    html = requests.get(url)
    selector = etree.HTML(html.text)
    # name = selector.xpath('//body/div[2]/div[3]/div/div[2]/div/ul/li/h4/a/text()')
    name = selector.xpath('//*[@class="list-group files"]/li/h4/a/text()')
    size = selector.xpath('//*[@class="list-group files"]/li/p/span[@class="info"]/text()')
    date = selector.xpath('//*[@class="list-group files"]/li/p/span[@class="date"]/text()')
    i = 0
    for each in name:
        print each
        print date[i]+' ',
        if i==3:
            print size[i]
        else:
            print size[i] + '\n'
        i = i + 1
    print url+'\n'

def spider_Jifeng(url):
    html = requests.get(url)
    selector = etree.HTML(html.text)
    content_field = selector.xpath('//*[starts-with(@id,"normalthread")]/tr')
    for each in content_field:
        pre_title = each.xpath('th/em/a/text()')
        if pre_title:
            pre_title = pre_title[0]
        else:
            pre_title = each.xpath('th/em/a/font/text()')[0]
        title = each.xpath('th/a/text()')[0]
        time = each.xpath('td[2]/em/span/font/span/text()')
        if time:
            time = time[0]
        else:
            time = each.xpath('td[2]/em/span/span/text()')[0]
        #实现只显示今天的帖子
        if each.xpath('td[2]/em/span/font/@color')[0] == '#0000FF':
            break
        print time.replace(u'\xa0', u'') +'   ',u'【'+pre_title+u'】',title.replace(' ','')
    print url

if __name__ == '__main__':
    url_rom = 'https://www.androidfilehost.com/?w=search&s=d802'
    url_jifeng = 'http://bbs.gfan.com/forum.php?mod=forumdisplay&fid=1345&filter=author&orderby=dateline'
    print ''
    print ''
    print u'                   |【LG G2 安卓6.0rom 消息更新】|'
    spider_ROM(url_rom)
    print ''
    print u'                 |【机锋论坛 LG G2 今日 新帖汇总更新】|'
    spider_Jifeng(url_jifeng)
    print ''
    print ''
0
0

查看评论
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:72次
    • 积分:14
    • 等级:
    • 排名:千里之外
    • 原创:1篇
    • 转载:0篇
    • 译文:0篇
    • 评论:0条
    文章分类
    文章存档
    阅读排行
    评论排行