子域名查询(python)

教程

#python 2
#-*-coding:utf-8-*-
import requests
import re

key='qq.com'

sites=[]

match='style="text-decoration:none;">(.*?)/'

for i in range(48):
  i=i*10
  url="http://www.baidu.com.cn/s?wd=site:"+key+"&cl=3&pn=%s" % i
  response=requests.get(url).content
  subdomains=re.findall(match,response)
  sites += list(subdomains)

site=list(set(sites))   #set()实现去重
#print site
print "The number of sites is %d" % len(site)

for i in site:          
  print i

python3 需要在get返回的数据进行编码转换

#python 3
import requests
import re

key='qq.com'
sites=[]

match='style="text-decoration:none;">(.*?)/'

for i in range(48):
  i=i*10
  url="http://www.baidu.com.cn/s?wd=site:"+key+"&cl=3&pn=%s" % i
  response=requests.get(url).content
  subdomains=re.findall(match,response.decode('utf8'))
  sites += list(subdomains)

site=list(set(sites))   #set()实现去重
#print site
print("The number of sites is %d" % len(site))

for i in site:          
  print(i)

唉,最近百度有点顶,要加上一层伪装,才可以爬到域名。现把解决方法放出,难顶。

#python 3
import requests
import re

key='jd.com'
sites=[]
head = {'User-Agent': \
            'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36'}

match = 'style="text-decoration:none;">(.*?)</b>'

for i in range(20):
  url="https://www.baidu.com/s?ie=UTF-8&wd=inurl%3A"+key
  response=requests.get(url,headers=head).content
  subdomains=re.findall(match,response.decode('utf8'))
  print(subdomains)
  sites += list(subdomains)

site=list(set(sites))   #set()实现去重
#print site
print("The number of sites is %d" % len(site))

for i in site:          
  print(i)
import requests
import re
 
head = {'User-Agent': \
            'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36'}
key = 'jd.com'  # 这里填主域名
lst = []
 
match = 'style="text-decoration:none;">(.*?)</b>'
 
for i in range(1, 20):  # 1-19页
    url = "https://www.baidu.com/s?wd=inurl:{}&pn={}&oq={}&ie=utf-8".format(key, i, key)
    print(url)
    # response = requests.get(url,headers=head,cookies = cook).content
    response = requests.get(url, headers=head).content
    subdomains = re.findall(match, response.decode())
    for j in subdomains:
        j = j.replace('<b>', '')
        if key in j:
            if j not in lst:
                lst.append(j)
                # print(lst)
print(lst)
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值