爬虫 XPath

安装

  • pip install lxml
  • 这样应该是装不上的,去手动吧

Error:ImportError: DLL load failed: %1 不是有效的 Win32 应用程序。

  • 这个是因为你安装了64位的python,然后安装32位的模块,或者你安装了32位的python,然后安装64位的模块
  • 先确定自己的python是是多少位
  • 下载对应的模块就对啦

查看自己的Python是多少位的

  • 打开IDLE
  • 第一行显示的就是
  • 或者在cmd中直接输入python

知识点讲解

  • 在网页源代码中右键,Copy XPath
  • // 定位根节点
  • / 往下层寻找
  • 提取文本内容:/text()
  • 提取属性内容: /@xxxx
  • 以相同的字符开头 starts-with(@属性名称, 属性字符相同部分)
  • 标签套标签 string(.)

测试用例

普通的XPath应用

    # coding=utf=8
    from lxml import etree
    import sys
    reload(sys)
    sys.setdefaultencoding("utf-8")
    html = '''
    <!DOCTYPE html>
    <html>
    <head lang="en">
        <meta charset="UTF-8">
        <title>测试-常规用法</title>
    </head>
    <body>
    <div id="content">
        <ul id="useful">
            <li>这是第一条信息</li>
            <li>这是第二条信息</li>
            <li>这是第三条信息</li>
        </ul>
        <ul id="useless">
            <li>不需要的信息1</li>
            <li>不需要的信息2</li>
            <li>不需要的信息3</li>
        </ul>
        <div id="url">
            <a href="http://jikexueyuan.com">极客学院</a>
            <a href="http://jikexueyuan.com/course/" title="极客学院课程库">点我打开课程库</a>
        </div>
    </div>
    </body>
    </html>
    '''
    selector = etree.HTML(html)
    # 提取文本
    content = selector.xpath('//ul[@id="useful"]/li/text()')
    for each in content:
        print each
    # 提取属性
    link = selector.xpath('//a/@href')
    for each in link:
        print each
    # 提取标题属性
    title = selector.xpath('//a/@title')
    print title[0]

特殊的XPath应用

    #-*-coding:utf8-*-
    from lxml import etree
    import sys
    reload(sys)
    sys.setdefaultencoding("utf-8")
    html1 = '''
    <!DOCTYPE html>
    <html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <div id="test-1">需要的内容1</div>
        <div id="test-2">需要的内容2</div>
        <div id="testfault">需要的内容3</div>
    </body>
    </html>
    '''
    html2 = '''
    <!DOCTYPE html>
    <html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <div id="test3">
            我左青龙,
            <span id="tiger">
                右白虎,
                <ul>上朱雀,
                    <li>下玄武。</li>
                </ul>
                老牛在当中,
            </span>
            龙头在胸口。
        </div>
    </body>
    </html>
    '''
    # 测试starts-with
    selector = etree.HTML(html1)
    content = selector.xpath('//div[starts-with(@id,"test")]/text()')
    for each in content:
        print each
    # 测试
    selector = etree.HTML(html2)
    content_1 = selector.xpath('//div[@id="test3"]/text()')
    for each in content_1:
        print each  # 只可以提取直接文字内容,标签取不出来
    data = selector.xpath('//div[@id="test3"]')[0]
    info = data.xpath('string(.)')
    content_2 = info.replace('\n', '').replace(' ', '')
    print content_2

测试单线程和多线程

    #-*-coding:utf8-*-
    from multiprocessing.dummy import Pool as ThreadPool
    import requests
    import time
    def getsource(url):
        html = requests.get(url)
    urls = []
    #  测试单线程时间
    for i in range(1, 21):
        newpage = 'http://tieba.baidu.com/p/3522395718?pn=' + str(i)
        urls.append(newpage)
    time1 = time.time()
    for i in urls:
        print i
        getsource(i)
    time2 = time.time()
    print '单线程耗时:' + str(time2 - time1)
    # 测试多线程时间
    pool = ThreadPool(4)
    time3 = time.time()
    results = pool.map(getsource, urls)
    pool.close()
    pool.join()
    time4 = time.time()
    print '并行耗时:' + str(time4 - time3)

爬取百度贴吧的评论

    #-*-coding:utf8-*-
    from lxml import etree
    from multiprocessing.dummy import Pool as ThreadPool
    import requests
    import json
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    '''重新运行之前请删除content.txt,因为文件操作使用追加方式,会导致内容太多。'''
    def towrite(contentdict):
        f.writelines(u'回帖时间:' + str(contentdict['topic_reply_time']) + '\n')
        f.writelines(u'回帖内容:' + unicode(contentdict['topic_reply_content']) + '\n')
        f.writelines(u'回帖人:' + contentdict['user_name'] + '\n\n')
    def spider(url):
        html = requests.get(url)
        selector = etree.HTML(html.text)
        content_field = selector.xpath('//div[@class="l_post l_post_bright "]')  # 这里有bug,百度贴吧已经改了
        item = {}
        for each in content_field:
            reply_info = json.loads(each.xpath(
                '@data-field')[0].replace('&quot', ''))
            author = reply_info['author']['user_name']
            content = each.xpath(
                'div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content "]/text()')[0]
            reply_time = reply_info['content']['date']
            print content
            print reply_time
            print author
            print 'hello world'
            item['user_name'] = author
            item['topic_reply_content'] = content
            item['topic_reply_time'] = reply_time
            towrite(item)
    if __name__ == '__main__':
        pool = ThreadPool(4)
        f = open('content.txt', 'a')
        page = []
        for i in range(1, 21):
            newpage = 'http://tieba.baidu.com/p/3522395718?pn=' + str(i)
            page.append(newpage)
        results = pool.map(spider, page)
        pool.close()
        pool.join()
        f.close()
提取,编辑和轻松评估XPath查询。 XPath的助手很容易提取,编辑,并在任何网页评估XPath查询。 重要提示:安装此扩展后,必须重新加载任何现有的选项卡或重新启动Chrome浏览器扩展工作。 说明: 1.打开一个新的标签,并导航到任何网页。 2.按Ctrl-Shift键-X(或OS X命令移-X),或单击工具栏上的XPath的助手按钮,以打开XPath助手控制台。 3.按住Shift键将鼠标悬停在页面上的元素。查询框将不断更新,以显示鼠标指针下方的元件XPath查询,结果框将显示当前查询的结果。 4.如果需要,请在控制台直接编辑XPath查询。结果框会立即反映更改。 5.重复步骤(2)关闭控制台。 如果控制台在你的方式获得,按住Shift键,然后将鼠标移动到它; 它会移动到页面的相对侧。 一个忠告:当渲染HTML表格,浏览器插入人工<TBODY>标记到DOM,这将在随后通过该扩展提取查询显示出来。 Extract, edit, and evaluate XPath queries with ease. XPath Helper makes it easy to extract, edit, and evaluate XPath queries on any webpage. IMPORTANT: After installing this extension, you must reload any existing tabs or restart Chrome for the extension to work. Instructions: 1. Open a new tab and navigate to any webpage. 2. Hit Ctrl-Shift-X (or Command-Shift-X on OS X), or click the XPath Helper button in the toolbar, to open the XPath Helper console. 3. Hold down Shift as you mouse over elements on the page. The query box will continuously update to show the XPath query for the element below the mouse pointer, and the results box will show the results for the current query. 4. If desired, edit the XPath query directly in the console. The results box will immediately reflect your changes. 5. Repeat step (2) to close the console. If the console gets in your way, hold down Shift and then move your mouse over it; it will move to the opposite side of the page. One word of caution: When rendering HTML tables, Chrome inserts artificial <tbody> tags into the DOM, which will consequently show up in queries extracted by this extension.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值