python抓取静态网页_python静态爬虫数据-CSDN博客

本文链接：https://blog.csdn.net/qq_34761385/article/details/122925711

本文介绍了一位作者如何使用Python2.7编写爬虫，通过Lofter的作者专区文章搜索接口，循环爬取特定作者按序号排列的文章，并存储为本地TXT文件。过程中涉及编码问题、头部信息设置以及BeautifulSoup库的使用，最终实现文章内容的抓取与保存。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

lofter的同人文都是一篇一篇的，懒得找，所以就花了点时间写个爬虫，爬取文本数据存储成本地text。这里主要通过lofter的作者专区文章搜索接口地址进行爬取数据。

示例：我是走高冷路线的该作者的文章搜索地址为：http://sanliubixian.lofter.com/search?q=

后面输入文章名就能搜索到该作者对应的文章。而且还有一个特点，她的文章顺序是根据序号来的，如征服欲1，征服欲2 ...这样，我们就可以进行循环爬取数据了。

1.准备工作

前面踩了很多坑，这里也不一一详细叙述了。我的本地python版本是2.7的。这个注意一下，因为2.7和3.x有一些区别。在这里最主要的区别是使用的urllib模块。这里可以参考一下这位博主。

python 2.xx使用import urllib.request报错no module named request_典笛安的博客-CSDN博客

第二个就是安装web模块，pip install web.py即可安装。

第三个就是编码问题，这里建议使用python的开发工具，我用的是submit text。

其他的就没了，反正就一个py文件，直接上代码吧

index.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import re
import urllib
import urllib2
import web
import json
urls = (
     '/', 'hello'
)
app = web.application(urls, globals())

# 定义函数
def gettext( i ):
    url = 'http://sanliubixian.lofter.com/search?q='
    keyword = i.encode(encoding='utf-8')
    key_code = urllib.quote(keyword)  # 对请求进行编码
    url_all = url+key_code
    header = {
        'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }   #头部信息
    request = urllib2.Request(url_all,headers=header)
    reponse = urllib2.urlopen(request).read()

    from bs4 import BeautifulSoup
    html_doc = reponse;


    #创建一个BeautifulSoup解析对象
    soup = BeautifulSoup(html_doc.replace('&nbsp;', ' '),"html.parser",from_encoding="utf-8")
    #获取文本
    title = soup.find('h2')
    print title
    if title==None:
        print "全文数据抓取完成！！！"
        return "false"
    else:
        p_nodes = soup.find_all('p')
        fh = open("./"+title.get_text()+".txt","wb")    # 将文件写入到当前目录中
        fh.write(title.get_text().encode(encoding='utf-8'))
        fh.write('\r\n')
        for p_node in p_nodes:
            #print p_node.get_text()
            fh.write(p_node.get_text().encode(encoding='utf-8'))
            fh.write('\r\n')
        fh.close()
        print "抓取："+title.get_text().encode(encoding='utf-8')
        return "true"


class hello:
    def __init__(self):
        web.header('content-type', 'text/json')
        web.header('Access-Control-Allow-Origin', '*')
        web.header('Access-Control-Allow-Methods', 'GET, POST')
    def GET(self):
        i = web.input(name=None)
        for num in range(1,30):
            s=i.name+str(num)
            result=gettext(s)
            if result=="false":
                break
        
        '''
        t={'msg':'开始爬取数据...','title':title.get_text()}
        s={}
        s['data']=t
        return json.dumps(s,ensure_ascii=False)
        '''

    def POST(self):
        a = int(web.input().a)
        b = int(web.input().b)
        return a + b

if __name__ == "__main__":
    app.run()

运行结果：