python遍历网页节点并记录xpath

网页链接与事件检测自动化

最新推荐文章于 2024-12-19 11:12:34 发布

原创最新推荐文章于 2024-12-19 11:12:34 发布 · 1.2k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#python #lxml #xpath

经验同时被 2 个专栏收录

92 篇文章

订阅专栏

python

6 篇文章

订阅专栏

该博客介绍了一种方法来提取网页中的所有链接，包括静态和动态，以及处理含有onclick事件的元素。通过使用lxml和XPath进行初步提取，结合selenium检查动态加载的链接，并遍历DOM树以寻找onclick属性。对于无id的元素，博客提供了记录其xpath的策略。此外，还展示了如何通过selenium模拟点击并捕获新的页面路径。

需求：

提取页中所有的链接，包括静态与动态；

方案：

1）加载网页后，使用lxml，与xpath机制提取所有的<a href>属性的链接；

2）selenium可以检查加载网页过程中，脚本自动访问的地址；

3）etree遍历html中所有的节点，如果有onclick属性，并且href未检测到过，则需要执行一下记录下id或者xpath，使用selenium点击并获取点击后的当前路径；

测试发现遍历时候，etree的遍历速度比selenium快几十倍；

这里主要是记录一下如何记录无id元素的xpath;

    # 只从URL获取一层链接
    def getLinks(self, url, prefix, excludes):
        try:
            #使用selenium加载页面，并返回页面内容
            content = self.getUrlContent(url, excludes)     
           
            # etree查找href
            html = etree.HTML(content)

            self.parsehtml(html, "//a/@ssohref", prefix, excludes)
            self.parsehtml(html, "//a/@href", prefix, excludes)
            self.parsehtml(html, "//script/@src", prefix, excludes)
            
            # 检测节点带有onclikc事件
            #self.checkAllElement()
            self.checkAllElementHtml(html)


        except:
            
            print(traceback.format_exc())
        return

检测过程函数如下：

def deepSortElementHtml(self, tree):
        while len(self.elements) > 0:       
            parent = self.elements[0]
            self.elements.popleft()
            #print("------------->{}, id={}".format(parent.tag, parent.get('id')) )
            #children = parent.xpath("./*")
            children = parent.getchildren()
            if not children:
                continue

            for item in children:
                tag_name = item.tag
                id = item.get('id')
               
                #print("{} id={}".format(tag_name, id))

                if tag_name == 'script':
                    continue
                             
                click = item.get("onclick")
                
                
                if click:
                    if tag_name == 'a':
                        href = item.get("href")
                        if href and (href in self.urls):
                            #print("~{} id={}, onclick={}".format(tag_name, id, href))
                            continue

                    
                    
                    if id:
                        self.selected.append(id)
                        print("{} id={}, onclick={}()".format(tag_name, id, click))
                    else:
                        path = tree.getpath(item)
                        print("{} xpath={}, onclick={}()".format(tag_name, path, click))
                        self.selected.append(path)
                    continue
                
                if tag_name == 'a':
                    continue
                if tag_name == 'input':
                    continue
                

                self.elements.append(item)

        return

    

    def checkAllElementHtml(self, html):

        bodys = html.xpath('//body')
        if bodys:
            for item in bodys:
                tag_name = item.tag
                self.elements.append(item)
                #print(tag_name)
                
                
        
        tree = etree.ElementTree(html)
        self.deepSortElementHtml(tree)
        
        return True

最后的点击动作如下：

def clickIds(self, url, ids):
        for id in ids:
            try:
                self.driver.set_page_load_timeout(self.time_wait)
                self.driver.get(url)

                
                if not id:
                    continue

                print("click id=" + id)
                if id[0] != '/':
                    ele = self.driver.find_element(By.ID, id)
                    ele.click()
                else:
                    ele = self.driver.find_element(By.XPATH, id)
                    onclick = ele.get_attribute('onclick')
                    print(onclick)

                    href = ele.get_attribute('href')
                    print(href)

                    ele.click()
                #time.sleep(3)
                if not self.driver.current_url in self.urls:
                    print(self.driver.current_url)
                self.urls.add(self.driver.current_url)
            except Exception as ex:
                print(ex)
            
        return