目录
1.介绍
本文章用于python通过关键字抓取小红书笔记 用户列表笔记 文章评论数据
仅供参考学习
2.代码部分
启动main端口设置
if __name__ == '__main__':
import uvicorn
# 注意:在生产环境中,你应该使用命令行来启动 uvicorn,而不是在脚本中这样做
# 但为了演示目的,我们可以这样做
uvicorn.run(app, host="127.0.0.1", port=6006)
设置post请求路径
app = FastAPI()
options = Options()
# 定义一个异步函数来启动爬虫,注意这里没有包含命令行参数解析
async def start_crawler(crawler_type: str = "search"):
account_pool = proxy_account_pool.create_account_pool()
if config.IS_SAVED_DATABASED:
await db.init_db()
crawler = CrawlerFactory.create_crawler(platform='xhs')
crawler.init_config(
account_pool=account_pool,
crawler_type=crawler_type
)
await crawler.start()
# 注意:通常 start 方法不应该阻塞或需要返回值给前端,
# 这里只是为了示例。实际中你可能需要设计一个异步的回调函数或使用其他机制来处理爬虫结果。
# 创建一个 FastAPI 路由来调用 start_crawler 函数
@app.post("/fethch")
# async def start_crawler_route(crawler_type: Optional[str] = None):
async def start_crawler_route(data: dict):
crawler_type = data.get("crawler_type")
config.KEYWORDS = data.get("keywords")
if crawler_type is None:
crawler_type = "search" # 使用默认值
try:
await start_crawler(crawler_type)
return {"message": "Crawler started successfully"}
except Exception as e:
# 这里可以捕获和处理各种异常,但通常不会捕获所有异常
# 出于安全考虑,你可能只想捕获并处理你预期可能会发生的异常
raise HTTPException(status_code=500, detail=str(e))
3.逆向
一种实现思路,提供大家进行参考!相对于纯自动化效率肯定要好,具体方案可以考虑使用Playwright实现,通过浏览器的JavaScript注入来获取到加密参数,实现方案Demo分别如下所示:
- Playwright实现示例
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(headless=True)
page = await browser.new_page()
# 注入stealth.min.js脚本
await page.add_init_script(path="stealth.min.js")
url = ""
data = ""
encrypt_params = await page.evaluate('([url, data]) => window._webmsxyw(url, data)', [url, data])
local_storage = await page.evaluate('() => window.localStorage')
print(encrypt_params)
print(local_storage)
await browser.close()
asyncio.run(main())
pass
上面的stealth.min.js脚本是一位大佬开源的!注入的作用是为了防止被检测,另外CK参数需要设置属性来避免Web端出现滑动验证码 当然,这个都是最终工程化需要考虑的事情,这里主要还是通过非逆向分析的方式去解决加密参数问题!
JS注入方式运行结果如下所示:
4.调用
public static void main(String[] args) {
try{
CloseableHttpClient httpClient = HttpClients.createDefault();
Map<String,String> map = new HashMap<>();
map.put("keywords","python,旅游");
ObjectMapper objectMapper = new ObjectMapper();
String jsonString = objectMapper.writeValueAsString(map);
HttpPost httpPost = new HttpPost("http://127.0.0.1:6006/fethch");
StringEntity entity = new StringEntity(jsonString, "UTF-8");
httpPost.setEntity(entity);
httpPost.setHeader("Accept", "application/json");
httpPost.setHeader("Content-type", "application/json");
try (CloseableHttpResponse response = httpClient.execute(httpPost)) {
// 检查响应状态码
int statusCode = response.getStatusLine().getStatusCode();
if (statusCode >= 200 && statusCode < 300) {
String responseString = EntityUtils.toString(response.getEntity(), "UTF-8");
ObjectMapper mapper = new ObjectMapper();
JsonNode rootNode = mapper.readTree(responseString);
System.out.println(rootNode.toString()); // 或者进行其他处理
} else {
throw new RuntimeException("Failed : HTTP error code : " + statusCode);
}
}
}
catch(Exception ex){
System.out.println("抓取失败!");
ex.printStackTrace();
}
}
5.结果
结果可 导出为表格
6、获取开箱即用源码
没有编程与爬虫经验的小伙伴!有研究、学习的需求也可以找作者地球mrguo0114领取开箱即用的完整项目源码进行学习!
7、警戒
可以适当的利用工具与技术来助力。但切记不要滥用
参考博客: