【旅行三部曲】爬取地方景区（2）

最新推荐文章于 2024-07-29 22:17:27 发布

^三年梦^

最新推荐文章于 2024-07-29 22:17:27 发布

阅读量311

点赞数 5

文章标签：爬虫 python

本文链接：https://blog.csdn.net/2301_77455840/article/details/140742764

版权

前言

在【旅行三部曲】爬取商家评论（1）中，我们爬取了景点的评论，评分等，让我们大致了解了每个景区的情况，但是要想比较其它景点的话，难道要我们一个一个景区的去爬取吗，所以通过今天的文章我们可以去实现批量爬取。

大致思路

获取大量数据的话还是照旧用自动化工具selenium吧，然后爬取每一页景点的名字以及它相关的url，之后点击下一页，再爬取直到最后一页的景区结束。最后在对信息进行一个大致预处理

初始化

 def __init__(self):
        #防止浏览器被反爬虫程序识别
        opt = Options()
        opt.add_argument('--disable-blink-features=AutomationControlled')
        self.driver = webdriver.Chrome(executable_path='chromedriver-win64/chromedriver.exe', chrome_options=opt)

如果你的driver是微软浏览器的话，那为了防止自动化工具被识别到就用如下的代码

self.driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

页面点击+爬取

进入网站页面，登录你的账号密码信息，观察页面发现它的HTML的页数如下所示
在这里插入图片描述

既知道尾页也有下一页的点击class_name,那就好办了，循环点击即可

在这里插入图片描述

查看页面的HTML发现景点的信息全部都在一个大盒子里面，接下来就可以用xpath提取div这个盒子之后再进行遍历提取我们想要的信息即可。

        self.driver.get(
            'https://travelsearch.fliggy.com/index.htm?spm=181.61408.a1z7d.5.650e5e9elrqZHf&searchType=product&keyword=' + place + '&category=SCENIC')
        ActionChains(self.driver).move_to_element(self.driver.find_element_by_id('fm-login-id')).click().send_keys(
            user).perform()#登录账号
        ActionChains(self.driver).move_to_element(self.driver.find_element_by_id('fm-login-password')).click().send_keys(
            passwd).perform()#填写密码
        # 验证码识别（但还是有点问题，希望有想法的可以提出来）
        # action = ActionChains(driver)
        # action.click_and_hold(driver.find_element_by_xpath("//div[@id='nc_1__scale_text']/span")).perform()
        # time.sleep(2)
        # action.move_by_offset(379,0)
        # time.sleep(1)
        # action.release().perform()
        # time.sleep(10)
        # ActionChains(driver).click_and_hold(driver.find_element_by_xpath('//*[@id="nc_1_n1z"]')).move_by_offset(xoffset=360,yoffset=0).perform()
        # time.sleep(1)
        self.driver.find_element_by_xpath('//*[@id="login-form"]/div[6]/button').click()#点击确定
        time.sleep(10)
        WebDriverWait(self.driver, 100).until(
            EC.presence_of_element_located((By.CLASS_NAME, 'active-icon'))
        )#等待跳转
        self.driver.switch_to_window(self.driver.window_handles[0])
        self.driver.refresh()
        res = etree.HTML(self.driver.page_source)
        div1 = res.xpath('//*[@id="content"]/div[5]//div[@class="page-num-content"]')
        page, a, ls = 0, 0, 0
        q = Queue()
        for i in div1:
            ls = i.xpath('./a/text()')#页面出现的页码数量
        a = int(ls[-2])#尾页
        for i in range(a):
            time.sleep(0.5)
            self.driver.switch_to_window(self.driver.window_handles[0])
            res = etree.HTML(self.driver.page_source)
            div = res.xpath('//*[@id="content"]/div[5]/div[1]/div[1]/div')
            place_url, place_name = 0, 0
            for div_li in div:
                place_url = div_li.xpath('.//div[@class="product-left"]/a/@href')
                place_name = div_li.xpath('.//h3[@class="main-title"]/div/text()')
            q.put((place_name, place_url))#提取景区名字以及相应的跳转url
            if i < a - 1:
                time.sleep(0.5)
                ActionChains(self.driver).move_to_element(self.driver.find_element_by_class_name('page-next')).click().perform()#点击下一页
        self.driver.quit()

selenium爬取完信息后记得关闭，防止过载

写入文件

因为数据可能会很多，所有这里使用线程池提高写入速率，保存入place.csv文件里

        def write(self,message):
        	place_name,place_url=message
        	with open('./place.csv', 'a', encoding='utf-8', newline='') as f:
            	writer_ = csv.writer(f)
            	for i in range(len(place_url)):
                	writer_.writerows([(place_name[i], place_url[i])])
        with ThreadPoolExecutor(10) as task:
            while not q.empty():
                task.submit(self.write, q.get())

处理文件

爬取的数据有很多是重复的，回到页面我们再看看
在这里插入图片描述
你会发现带 [ ] 的都是景点的一个项目门票，所以带 [ ] 的景点要去除掉

    def deal_detail(self):
        """
        将重复的景区去除掉
        """
        text = pd.read_csv('./place.csv',names=['name','url'],encoding='utf-8')
        count= 0
        for s in text['name']:
            if s.startswith('['):
                text.drop([count], inplace=True)
            count += 1
        text = text.reset_index(drop=True)
        text.to_csv('./place.csv',index=False)