python 对指定URL获取其子链接

仿照http://blog.csdn.net/lming_08/article/details/44710779里面的方法, 获取指定URL 的所需的子链接及其描述.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import urllib2
import re
 
if len(sys.argv) != 2:
	print "%s url" % __file__
	sys.exit(-1)
 
url=sys.argv[1]
 
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
headers = { 'User-Agent' : user_agent }
'''
<a href="http://faxian.smzdm.com/p/488573" target="_blank" onclick="ga('send', 'event','发现频道','列表_文章图片','488573_HITACHI 日立 CM-N1000 冷冻收缩毛孔多功能美容仪');" class="picBox">
<img src="http://ym.zdmimg.com/201503/29/5517b316c0c752738.jpg_d200.jpg" alt="HITACHI 日立 CM-N1000 冷冻收缩毛孔多功能美容仪" title=""  height=
'''
req = urllib2.Request(url, headers = headers)
try:
    html = urllib2.urlopen(req).read()
    
    pattern = re.compile(r"<a href=.* target=\"_blank\" onclick=.*\s?.*<img src=.*\.jpg\" alt=.*title=\"\".*height=") # correct
    res_list = pattern.findall(html)
 
    for content in res_list:
        pat = re.compile(r"http://.*p/\d{6}")
        url = pat.search(content).group()
        pat = re.compile(r"alt=\".*\" title")
        desc = pat.search(content).group()[5:-8]
 
        print url, re.sub(r"\s?", "", desc)
except urllib2.HTTPError:
    print "failed parsing web url"

执行结果为:

lming_08@ubuntu:~/MyWorkSpace/Pycode/htmlparse$ python get_smzdm_productinfo.py http://faxian.smzdm.com/fenlei/nvshixiangshui
http://faxian.smzdm.com/p/487641 TOMMYHILFIGER都市新贵女士EDT淡香水30m
http://faxian.smzdm.com/p/487231 GUERLAIN娇兰AquaAllegoria花草水语系列橙花伊甸园女士淡香
http://faxian.smzdm.com/p/482913 山东福利:LANCOME兰蔻珍爱爱恋女士香水30m
http://faxian.smzdm.com/p/479941 SalvatoreFerragamo菲拉格慕仲夏之梦淡香水喷雾100ml/3.4o
http://faxian.smzdm.com/p/478681 VIVIENNEWESTWOODBoudoir密室女士香水(50ml
http://faxian.smzdm.com/p/478055 SwissArmyMountainWater香
http://faxian.smzdm.com/p/475269 BURBERRY博柏利周末香水DEP50m
http://faxian.smzdm.com/p/473353 MOSCHINO雾仙浓奥莉芙娃娃淡香水4.9m
http://faxian.smzdm.com/p/472327 GALIMARD加利马尔蓝色妖姬绽放夏日限量版30m
http://faxian.smzdm.com/p/471217 Dior迪奥真我淡香水50m
http://faxian.smzdm.com/p/470015 BVLGARI宝格丽淡香水喷雾100m
http://faxian.smzdm.com/p/469435 ANNASUI安娜苏幻境绮缘女士持久淡香水50m
http://faxian.smzdm.com/p/468123 CalvinKlein卡文克莱因为你女用淡香水100ml(简装
http://faxian.smzdm.com/p/467927 BURBERRY博柏利body肌体香水喷雾35M
http://faxian.smzdm.com/p/467535 SalvatoreFerragamo菲拉格慕闪耀光采淡香水喷雾100m
http://faxian.smzdm.com/p/467391 SalvatoreFerragamo菲拉格慕花水时刻淡香水喷雾100m
http://faxian.smzdm.com/p/464821 BURBERRY博柏利周末香水喷雾50m
http://faxian.smzdm.com/p/462473 Annasui安娜苏摇滚心情淡香水喷雾50m
http://faxian.smzdm.com/p/461755 LANVIN浪凡我愿意女士香水4.5m
http://faxian.smzdm.com/p/461189 Lanvin浪凡光韵女士香水5m

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值