与风景对话_交互式旅游推荐系统_数据收集与预处理

本文链接：https://blog.csdn.net/chenxucn/article/details/139338177

文章目录

- 2. 爬虫设计
- - 2.3 小红书爬虫

2. 爬虫设计

爬虫设计是数据收集过程中至关重要的环节，直接影响到数据的质量和数量。下面详细介绍爬虫设计的各个方面，包括爬虫工具选择、爬虫策略、反爬机制应对、数据提取与存储、数据清洗等内容。通过精心设计和实施爬虫，我们将确保从去哪儿网、马蜂窝和小红书等平台高效获取高质量的旅游数据。

2.3 小红书爬虫

XiaoHongShuSpider.py是一个用于爬取小红书网站上有关“东北旅游”相关信息的爬虫。代码通过Selenium控制浏览器模拟用户行为，以获取动态加载的数据，并使用BeautifulSoup解析页面内容。下面是对代码的详细分析：

2.3.1 初始化

result_list = []

chrome_driver_path = 'D:\\App\\chromedriver\\chromedriver.exe'  # 替换为你的ChromeDriver路径
s = Service(chrome_driver_path)
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(service=s, options=options)

result_list = []：初始化一个空列表，用于存储爬取的结果。
chrome_driver_path：设置ChromeDriver的路径。
s = Service(chrome_driver_path)：创建ChromeDriver服务。
options = webdriver.ChromeOptions()：初始化Chrome浏览器的选项。
options.add_experimental_option('excludeSwitches', ['enable-automation']) 和 options.add_argument("--disable-blink-features=AutomationControlled")：配置浏览器以防止被检测到是自动化控制。
driver = webdriver.Chrome(service=s, options=options)：启动Chrome浏览器。

2.3.2 数据处理函数

chinese_months = {
    1: '一月',
    2: '二月',
    3: '三月',
    4: '四月',
    5: '五月',
    6: '六月',
    7: '七月',
    8: '八月',
    9: '九月',
    10: '十月',
    11: '十一月',
    12: '十二月'
}

chinese_months：定义一个字典，将数字月份转换为中文月份。

def scroll_page(driver, distance):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight - " + str(distance) + ");")
    time.sleep(1)  # 等待页面加载新的内容

scroll_page(driver, distance)：定义一个函数，通过JavaScript滚动页面一定距离，并等待1秒。

def remove_after_comment(text):
    index = text.find("【备注】")
    if index != -1:
        return text[:index]
    else:
        return text

remove_after_comment(text)：定义一个函数，去除文本中“备注”之后的内容。

def get_season(month):
    if 3 <= month <= 5:
        return '春天'
    elif 6 <= month <= 8:
        return '夏天'
    elif 9 <= month <= 11:
        return '秋天'
    else:
        return '冬天'

get_season(month)：定义一个函数，根据月份确定季节。

def change(text):
    if '/' in text:
        try:
            date_obj = datetime.strptime(text, '%Y/%m/%d')
            month_name = chinese_months[date_obj.month]
            season = get_season(date_obj.month)
            result_list.append(f'{month_name}')
            result_list.append(f'{season}')
        except ValueError:
            pass
    elif text.isdigit():
        if int(text) < 99:
            result_list.append(f'{text}天')
        else:
            result_list.append(f'{text}元')
    else:
        if text not in result_list:
            result_list.append(text)

change(text)：定义一个函数，根据输入文本格式，处理日期和数值并添加到result_list。

2.3.3 获取文章链接

driver.get('https://www.xiaohongshu.com/search_result?keyword=%25E4%25B8%259C%25E5%258C%2597%25E6%2597%2585%25E6%25B8%25B8&source=web_explore_feed')
time.sleep(5)  # 等待页面加载

driver.get(...)：访问指定的搜索结果页面。
time.sleep(5)：等待页面加载完成。

for _ in range(10):  # 根据需要调整范围，确保加载更多内容
    scroll_page(driver, 1000)

使用scroll_page函数滚动页面以加载更多内容。

soup = BeautifulSoup(driver.page_source, 'html.parser')
article_links = []
for a_tag in soup.find_all('a', href=True):
    href = a_tag['href']
    if '/explore/' in href:
        article_links.append('https://www.xiaohongshu.com' + href)

soup = BeautifulSoup(driver.page_source, 'html.parser')：使用BeautifulSoup解析页面内容。
遍历所有a标签，提取带有/explore/的链接并存储在article_links列表中。

with open('小红书东北链接.txt', 'w', encoding='utf-8') as file:
    for link in article_links:
        file.write(link + '\n')

将所有文章链接写入小红书东北链接.txt文件。

2.3.4 获取文章内容

with open('小红书东北链接.txt', 'r') as file:
    for line in file:
        result_list = []
        line = line.strip()
        last_part = line.rsplit('/', 1)[-1]
        time.sleep(1)
        driver.get(line)
        driver.maximize_window()

读取之前保存的链接文件，逐行处理每个链接。
访问每个链接并最大化浏览器窗口。

        try:
            wait = WebDriverWait(driver, 10)
            wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'note-content')))
        except TimeoutException:
            continue

使用显式等待，直到note-content元素可见。

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        content_div = soup.find('div', {'class': 'note-content'})
        if not content_div:
            continue
        text_content = content_div.get_text(strip=True, separator='\n')

使用BeautifulSoup解析文章内容，获取note-content中的文本。

        data = {
            'instruction': '',
            'summary': '',
            'output': remove_after_comment(text_content)
        }

        with open('小红书东北.json', 'a', encoding='utf-8') as file:
            json.dump(data, file, ensure_ascii=False, indent=4)
            file.write('\n')