python 简单爬虫实现

静态网页,爬时光网,加翻页功能

'''
Created on 2015-9-28

'''

from lxml import html
from time import sleep

#the name of Male star
names_xpath = "//strong[@class='px14']/a/text()"
#Introduction
introductions_xpath = "//dd[@class='iinfo']/p[@class='mt6 c_666']/text()"
#the next button
next_button_xpath = "//a[@id='key_nextpage']/@href"

names = []
introductions = []

base_url = 'http://movie.mtime.com/list/{}'
next_page = "http://movie.mtime.com/list/250.html"

while len(names) < 50 and next_page:
    print "Retrieved names from url: {}" .format(next_page)
    
    dom = html.parse(next_page)
    names += dom.xpath(names_xpath)
    introductions += dom.xpath(introductions_xpath)  
        
    next_pages = dom.xpath(next_button_xpath)
    if next_pages:
        next_page = base_url.format(next_pages[0])
    else:
        print "No next button found"
        next_page = None
    sleep(3)
    
i = 0
with open('information.txt', 'wb') as out:
    while i < len(names) and i < len(introductions) :
        out.write(names[i].encode('utf-8'))
        out.write(introductions[i].encode('utf-8'))
        out.write('\n'.encode('utf-8'))
        i += 1
print "WRITE DONE"

with open('information.txt') as file:
    informations = file.readlines()
    
print "Well, we got {} Male Star!".format(len(informations))
for information in informations:
    print information


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值