爬取数据解决方案- 每页打开+单个查询

最新推荐文章于 2022-01-03 16:58:13 发布

guimaster

最新推荐文章于 2022-01-03 16:58:13 发布

阅读量820

点赞数

分类专栏：爬虫文章标签： python-爬虫

本文链接：https://blog.csdn.net/u011521609/article/details/68062676

版权

爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

爬取某网站，首先爬取目标的id，然后拼网址，但是再次运行，发现根据id拼接的网址已经有所变化，解决方式

第一，直接每页打开，不给网页id更换的时间#

第二，直接打开之后，将每个目标网页存到list，这样后面效率高，本质和第一种方式一致

第三，首先爬取所有小区名称，然后使用selenium一个一个搜索，这样的方法应该是最牛逼的，但是这样比较慢，还是先爬取小区之后，采用前两种方式，剩下来没有搞定的，在用这种，也就是说结合起来比较好

下面的已经已经学会了bs4 在飞机上试验几次明白了，后来看了别人说xml效率更高，准备换，不过bs4毕竟还是纯使用python写的，还是会用的。

上代码：

# -*- coding: utf-8 -*-
import requests
import re
import time
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
import random

# 下面分别设置头和代理

hds=[{'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'},\
    {'User-Agent':'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11'},\
    {'User-Agent':'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)'},\
    {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0'},\
    {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36'},\
    {'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'},\
    {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'},\
    {'User-Agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'},\
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'},\
    {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'},\
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'},\
    {'User-Agent':'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11'},\
    {'User-Agent':'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11'}]
proxie = {'http' : 'http://221.195.53.117:8998'}  #靠谱的代理http://www.gatherproxy.com/zh/  https://www.zhihu.com/question/23825711 也许学会了就可以彻底翻墙了

# 放弃首先爬取id的方式，直接打开每页之后每个打开,然后再次利用循环直接爬取

for i in range(1,2):
    sleep(1)
    url = '你需要自己完成的'+ str(i)+'/'
    # html = requests.get(url,headers=hds[random.randint(0,len(hds)-1)],proxies = proxie,allow_redirects = False)
    html = requests.get(url,headers=hds[random.randint(0,len(hds)-1)],allow_redirects = False)
    html.encoding = 'utf-8'
    bigger = str(re.findall(r'''class="listContent"(.*?)class="contentBottom clear"''',html.text,re.S))  #不加str会提示expected string or bytes-like object，首先需要转化成字符串
    soup = BeautifulSoup(bigger,'html.parser',from_encoding='utf-8')
    newurls = soup.select('div[class="title"]')   # 列表表达式的应用list_a = [tag.get('href') for tag in soup.select('a[href]')]
    for each in newurls:
        xqurl = re.findall(r'''href="(.*?)"''',str(each),re.S)
        sleep(random.randint(6,12))
        xqhtml = requests.get(xqurl[0],headers=hds[random.randint(0,len(hds)-1)],allow_redirects = False)
        xqhtml.encoding = 'utf-8'
        xy = re.findall(r'''resblockPosition:'(.*?)',''', xqhtml.text, re.S)
        soup1 = BeautifulSoup(xqhtml.text, 'html.parser', from_encoding='utf-8')
        name = soup1.find(class_=["你需要自己完成的"]).string
        price = soup1.find(class_=["你需要自己完成的"]).string  # 列表表达式的应用list_a = [tag.get('href') for tag in soup.select('a[href]')]
        other = soup1.find_all(class_=["你需要自己完成的"])
        print(name,price,xy[0],other[0],other[1],other[2],other[3],other[4],other[5],other[6],other[7],xqurl[0])

数据做好之后，就可以放到csv或者其他数据库了

上面扫描每一页的方法存在一个bug，就是不同页有重复，有一些list包括的没有爬取到，于是就要和list对比，将没有爬取到单独查询一遍，一个意外收获是，这样的方法也可以用来更新数据（包括更新小区，更新小区信息）

另外本来想完全用sele实现，后来发现切换窗口比较复杂，还不如用request代码，反正追求结果好就行了。

代码如下：

# -*- coding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.common.keys import Keys #导入模拟点击
import re
import codecs
import requests
from bs4 import BeautifulSoup
from time import sleep  #导入等待 无比重要的等待

x = '搜索名称'
browser = webdriver.Chrome()  #mac系统的话chromedriver()放到 usr/local/bin/ 下面就可以 不需要禁用sip
browser.get('网址')
elem = browser.find_element_by_id("searchInput")
sleep(5)
elem.send_keys(x)
elem.send_keys(Keys.RETURN)
sleep(5)
html = requests.get(browser.current_url)
html.encoding = 'utf-8'
targeturl = re.findall(r'''<a class="img" href="(.*?)" target="_blank" rel="nofollow">''',html.text,re.S)
xqhtml = requests.get(targeturl[0])
xqhtml.encoding = 'utf-8'