使用Python关键字爬取小红书笔记,用户列表所有笔记（api版）

Bug制造者——

已于 2024-09-13 11:52:47 修改

阅读量2.8k

点赞数 7

文章标签：笔记 python 爬虫

于 2024-08-14 10:18:27 首次发布

本文链接：https://blog.csdn.net/gs00125/article/details/141181903

版权

app = FastAPI()
options = Options()

# 定义一个异步函数来启动爬虫，注意这里没有包含命令行参数解析
async def start_crawler(crawler_type: str = "search"):
    account_pool = proxy_account_pool.create_account_pool()
    if config.IS_SAVED_DATABASED:
        await db.init_db()

    crawler = CrawlerFactory.create_crawler(platform='xhs')
    crawler.init_config(
        account_pool=account_pool,
        crawler_type=crawler_type
    )
    await crawler.start()
    # 注意：通常 start 方法不应该阻塞或需要返回值给前端，
    # 这里只是为了示例。实际中你可能需要设计一个异步的回调函数或使用其他机制来处理爬虫结果。


# 创建一个 FastAPI 路由来调用 start_crawler 函数
@app.post("/fethch")
# async def start_crawler_route(crawler_type: Optional[str] = None):
async def start_crawler_route(data: dict):
    crawler_type = data.get("crawler_type")
    config.KEYWORDS = data.get("keywords")
    if crawler_type is None:
        crawler_type = "search"  # 使用默认值

    try:
        await start_crawler(crawler_type)
        return {"message": "Crawler started successfully"}
    except Exception as e:
        # 这里可以捕获和处理各种异常，但通常不会捕获所有异常
        # 出于安全考虑，你可能只想捕获并处理你预期可能会发生的异常
        raise HTTPException(status_code=500, detail=str(e))

3.逆向

一种实现思路，提供大家进行参考！相对于纯自动化效率肯定要好，具体方案可以考虑使用Playwright实现，通过浏览器的JavaScript注入来获取到加密参数，实现方案Demo分别如下所示：

Playwright实现示例

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as playwright:
        browser = await playwright.chromium.launch(headless=True)
        page = await browser.new_page()
		# 注入stealth.min.js脚本
        await page.add_init_script(path="stealth.min.js")
        url = "" 
        data = "" 
        encrypt_params = await page.evaluate('([url, data]) => window._webmsxyw(url, data)', [url, data])
        local_storage = await page.evaluate('() => window.localStorage')

        print(encrypt_params)
        print(local_storage)

        await browser.close()
asyncio.run(main())




pass

上面的stealth.min.js脚本是一位大佬开源的！注入的作用是为了防止被检测，另外CK参数需要设置属性来避免Web端出现滑动验证码当然，这个都是最终工程化需要考虑的事情，这里主要还是通过非逆向分析的方式去解决加密参数问题！

JS注入方式运行结果如下所示：

4.调用

public static void main(String[] args) {
        try{

            CloseableHttpClient httpClient = HttpClients.createDefault();
            Map<String,String> map = new HashMap<>();
            map.put("keywords","python,旅游");

            ObjectMapper objectMapper = new ObjectMapper();
            String jsonString = objectMapper.writeValueAsString(map);

            HttpPost httpPost = new HttpPost("http://127.0.0.1:6006/fethch");
            StringEntity entity = new StringEntity(jsonString, "UTF-8");
            httpPost.setEntity(entity);
            httpPost.setHeader("Accept", "application/json");
            httpPost.setHeader("Content-type", "application/json");

            try (CloseableHttpResponse response = httpClient.execute(httpPost)) {
                // 检查响应状态码
                int statusCode = response.getStatusLine().getStatusCode();
                if (statusCode >= 200 && statusCode < 300) {
                    String responseString = EntityUtils.toString(response.getEntity(), "UTF-8");
                    ObjectMapper mapper = new ObjectMapper();
                    JsonNode rootNode = mapper.readTree(responseString);
                    System.out.println(rootNode.toString()); // 或者进行其他处理
                } else {
                    throw new RuntimeException("Failed : HTTP error code : " + statusCode);
                }
            }
        }
        catch(Exception ex){
            System.out.println("抓取失败!");
            ex.printStackTrace();
        }
    }