爬取大众点评店铺信息和评论内容

最新推荐文章于 2025-04-04 18:30:14 发布

江饮溪

最新推荐文章于 2025-04-04 18:30:14 发布

阅读量1.2k

点赞数 12

文章标签： python 爬虫

本文链接：https://blog.csdn.net/m0_46700310/article/details/146376245

版权

目标数据

(1) 店铺信息
• 店铺 ID (shopId)
• 店铺名称
• 地址
• 电话
• 评分
• 人均消费
• 特色菜品
• 营业时间

(2) 评论内容
• 用户昵称
• 评分
• 评论内容
• 评论时间
• 点赞数
• 图片/视频

⸻

爬取方案

(1) 获取店铺列表

入口：
• 直接访问搜索页获取店铺列表，如：

https://www.dianping.com/search/keyword/2/0_火锅

•	也可以使用 百度/Google 搜索 site:dianping.com 火锅，获取店铺 URL。

方法：
• 使用 Selenium 爬取动态加载的店铺信息：

from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time

options = Options()
options.add_argument("--headless")  # 无头模式
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")

driver = webdriver.Chrome(options=options)
driver.get("https://www.dianping.com/search/keyword/2/0_火锅")
time.sleep(3)

shop_elements = driver.find_elements(By.CSS_SELECTOR, ".tit a")
for shop in shop_elements:
    print(shop.text, shop.get_attribute("href"))

driver.quit()```

•	这样可以获取店铺名称和链接，链接格式通常是：

https://www.dianping.com/shop/12345678

其中 12345678 是 shopId，后续用于抓取详细信息和评论。

⸻

(2) 获取店铺详情

使用 requests 直接抓取店铺页面：

from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Referer": "https://www.dianping.com/"
}

shop_id = "12345678"
url = f"https://www.dianping.com/shop/{shop_id}"

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")```

# 提取店铺名称
shop_name = soup.find("h1").text.strip()

# 提取评分
score = soup.select_one(".brief-info .mid-score").text if soup.select_one(".mid-score") else "无评分"

# 提取地址
address = soup.select_one(".address-info").text if soup.select_one(".address-info") else "无地址"

print(f"店铺名称: {shop_name}, 评分: {score}, 地址: {address}")```

优化点：
• 部分数据是 JavaScript 加载的，需要用 Selenium 或解析 Ajax 请求
• 可以通过正则提取 JSON 数据

⸻

(3) 爬取评论数据

评论数据是通过 Ajax 请求加载的，接口类似：

https://www.dianping.com/ajax/json/shopDynamic/reviewAndStar?shopId=12345678

代码示例：


shop_id = "12345678"
url = f"https://www.dianping.com/ajax/json/shopDynamic/reviewAndStar?shopId={shop_id}"

response = requests.get(url, headers=headers)
data = json.loads(response.text)

for review in data["review"]["list"]:
    user = review["user"]["nickName"]
    rating = review["reviewStar"]
    content = review["reviewData"]
    print(f"用户: {user}, 评分: {rating}, 评论: {content}")```

优化点：
• 需要带上 Cookie 才能获取数据
• 评论数据可能加密，需要逆向分析

⸻

反爬策略

(1) 处理 Cookie 和 Headers
• 手动登录获取 Cookie
• Selenium 登录后存储 Cookie
• 每次请求前更新 User-Agent

(2) 代理池
• 采用 scrapy-rotating-proxies 或 requests + proxy：

    "http": "http://your_proxy:port",
    "https": "https://your_proxy:port"
}
response = requests.get(url, headers=headers, proxies=proxies)```

(3) 处理验证码
• 手动识别（适合小规模爬取）
• 使用 chaojiying 或打码平台进行自动识别
• 模拟鼠标滑动（Selenium + ActionChains）

⸻

数据存储

爬取的数据可以存入数据库或 CSV：


df = pd.DataFrame(comments)  # `comments` 是爬取的数据
df.to_csv("dianping_comments.csv", index=False, encoding="utf-8")```

或者存入 MySQL：


conn = pymysql.connect(host="localhost", user="root", password="123456", database="dianping", charset="utf8mb4")
cursor = conn.cursor()

sql = "INSERT INTO reviews (user, rating, content) VALUES (%s, %s, %s)"
cursor.executemany(sql, comments)
conn.commit()
cursor.close()
conn.close()```

内容仅供学习参考