1. 搜索工具

在使用代理过程中,我们经常会涉及到工具的使用,尤其是对于未知知识等检索,根据课程安排,主要是熟悉检索工具的使用,理解主动搜索和标准搜索的区别。

2. 简单的代理搜索示例

2.1 导入环境

# libraries
from dotenv import load_dotenv
import os
from tavily import TavilyClient

# load environment variables from .env file
_ = load_dotenv()

# connect
client = TavilyClient(api_key=os.environ.get("TAVILY_API_KEY"))

2.2 运行搜索测试

# run search
result = client.search("What is in Nvidia's new Blackwell GPU?",
                       include_answer=True)

# print the answer
result["answer"]

输出如下:

'The Nvidia Blackwell GPU is part of the Blackwell architecture set to power the RTX 50-series graphics cards. It is designed to deliver significant performance improvements, potentially quadrupling the performance of its predecessor. The Blackwell B200 GPU is expected to deliver up to 20 petaflops of compute power and is aimed at powering the next generation of AI supercomputers. This GPU is also highlighted for its compatibility with AWS, enabling advanced virtualization and networking capabilities for AI research and development projects. The exact specifications, such as TPCs, GPCs, memory bus, and GDDR configuration of the Nvidia Blackwell GPUs, have also been leaked.'

3. 常规搜索

# choose location (try to change to your own city!)

city = "San Francisco"

query = f"""
    what is the current weather in {city}?
    Should I travel there today?
    "weather.com"
"""

3.1 匹配搜索内容的网址

import requests
from bs4 import BeautifulSoup
from duckduckgo_search import DDGS
import re

ddg = DDGS()

def search(query, max_results=6):
    try:
        results = ddg.text(query, max_results=max_results)
        return [i["href"] for i in results]
    except Exception as e:
        print(f"returning previous results due to exception reaching ddg.")
        results = [ # cover case where DDG rate limits due to high deeplearning.ai volume
            "https://weather.com/weather/today/l/USCA0987:1:US",
            "https://weather.com/weather/hourbyhour/l/54f9d8baac32496f6b5497b4bf7a277c3e2e6cc5625de69680e6169e7e38e9a8",
        ]
        return results  


for i in search(query):
    print(i)

输出如下:

https://weather.com/weather/tenday/l/San Francisco CA USCA0987:1:US
https://weather.com/weather/today/l/San+Francisco+CA+USCA0987

3.2 爬取网页内容

def scrape_weather_info(url):
    """Scrape content from the given URL"""
    if not url:
        return "Weather information could not be found."
    
    # fetch data
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return "Failed to retrieve the webpage."

    # parse result
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup
# use DuckDuckGo to find websites and take the first result
url = search(query)[0]

# scrape first wesbsite
soup = scrape_weather_info(url)

print(f"Website: {url}\n\n")
print(str(soup.body)[:50000]) # limit long outputs

3.3 提取所需的数据

# extract text
weather_data = []
for tag in soup.find_all(['h1', 'h2', 'h3', 'p']):
    text = tag.get_text(" ", strip=True)
    weather_data.append(text)

# combine all elements into a single string
weather_data = "\n".join(weather_data)

# remove all spaces from the combined text
weather_data = re.sub(r'\s+', ' ', weather_data)
    
print(f"Website: {url}\n\n")
print(weather_data)

输出如下:

Website: https://weather.com/weather/tenday/l/San Francisco CA USCA0987:1:US


recents Specialty Forecasts 10 Day Weather - San Francisco, CA Coastal Flood Advisory Tonight Sun 21 | Night Partly to mostly cloudy. Low near 55F. Winds SW at 10 to 20 mph. Mon 22 Mon 22 | Day Partly cloudy. Expect mist and reduced visibilities at times. High 72F. Winds WSW at 10 to 20 mph. Mon 22 | Night Partly cloudy early with increasing clouds overnight. Low 56F. Winds W at 10 to 15 mph. Tue 23 Tue 23 | Day Intervals of clouds and sunshine. High 74F. Winds W at 10 to 20 mph. Tue 23 | Night Partly cloudy skies. Low 59F. Winds WNW at 10 to 15 mph. Wed 24 Wed 24 | Day Partly cloudy skies. High 73F. Winds W at 10 to 20 mph. Wed 24 | Night Partly cloudy skies during the evening. Fog developing overnight. Low 56F. Winds W at 10 to 15 mph. Thu 25 Thu 25 | Day Sunshine and clouds mixed. High around 70F. Winds W at 10 to 20 mph. Thu 25 | Night A few clouds from time to time. Low around 55F. Winds W at 10 to 20 mph. Fri 26 Fri 26 | Day Sunshine and clouds mixed. High around 65F. Winds W at 10 to 20 mph. Fri 26 | Night Partly cloudy skies during the evening will give way to cloudy skies overnight. Low 54F. Winds W at 10 to 20 mph. Sat 27 Sat 27 | Day Considerable clouds early. Some decrease in clouds later in the day. High 66F. Winds W at 10 to 20 mph. Sat 27 | Night Mostly cloudy. Low around 55F. Winds W at 10 to 20 mph. Sun 28 Sun 28 | Day Intervals of clouds and sunshine. High 67F. Winds W at 10 to 20 mph. Sun 28 | Night Partly cloudy skies in the evening, then becoming cloudy overnight. Low 54F. Winds W at 10 to 20 mph. Mon 29 Mon 29 | Day Cloudy skies early, followed by partial clearing. High 69F. Winds W at 10 to 20 mph. Mon 29 | Night Partly cloudy during the evening followed by cloudy skies overnight. Low around 55F. Winds W at 10 to 20 mph. Tue 30 Tue 30 | Day Cloudy skies early, followed by partial clearing. High 71F. Winds W at 10 to 20 mph. Tue 30 | Night Partly cloudy. Low near 55F. Winds W at 10 to 20 mph. Wed 31 Wed 31 | Day Intervals of clouds and sunshine. High 74F. Winds W at 10 to 20 mph. Wed 31 | Night Partly cloudy. Low 56F. Winds WNW at 10 to 15 mph. Thu 01 Thu 01 | Day Sunshine and clouds mixed. High 74F. Winds WNW at 10 to 20 mph. Thu 01 | Night Partly to mostly cloudy. Low 56F. Winds WNW at 10 to 20 mph. Fri 02 Fri 02 | Day Partly cloudy skies. High 76F. Winds WNW at 10 to 20 mph. Fri 02 | Night Partly cloudy. Low 56F. Winds W at 10 to 15 mph. Sat 03 Sat 03 | Day Sunshine and clouds mixed. High 77F. Winds W at 10 to 20 mph. Sat 03 | Night Partly cloudy early with increasing clouds overnight. Low 57F. Winds WNW at 10 to 15 mph. Sun 04 Sun 04 | Day Partly cloudy skies. High 77F. Winds WNW at 10 to 20 mph. Sun 04 | Night A few clouds from time to time. Low 57F. Winds WNW at 10 to 20 mph. Don't Miss Radar Summer Skin Essentials Home, Garage & Garden That's Not What Was Expected Outside Size Of A Cruise Ship To Infinity & Beyond Fact Versus Fiction Stay Safe Air Quality Index Air quality is considered satisfactory, and air pollution poses little or no risk. Health & Activities Seasonal Allergies and Pollen Count Forecast Grass pollen is low in your area Cold & Flu Forecast Flu risk is low in your area We recognize our responsibility to use data and technology for good. We may use or share your data with our data vendors. Take control of your data. © The Weather Company, LLC 2024

4. 代理搜索

4.1 执行搜索

# run search
result = client.search(query, max_results=1)

# print first result
data = result["results"][0]["content"]

print(data)

输出如下:

{'location': {'name': 'San Francisco', 'region': 'California', 'country': 'United States of America', 'lat': 37.78, 'lon': -122.42, 'tz_id': 'America/Los_Angeles', 'localtime_epoch': 1721608080, 'localtime': '2024-07-21 17:28'}, 'current': {'last_updated_epoch': 1721607300, 'last_updated': '2024-07-21 17:15', 'temp_c': 16.2, 'temp_f': 61.2, 'is_day': 1, 'condition': {'text': 'Sunny', 'icon': '//cdn.weatherapi.com/weather/64x64/day/113.png', 'code': 1000}, 'wind_mph': 13.0, 'wind_kph': 20.9, 'wind_degree': 248, 'wind_dir': 'WSW', 'pressure_mb': 1012.0, 'pressure_in': 29.89, 'precip_mm': 0.0, 'precip_in': 0.0, 'humidity': 77, 'cloud': 1, 'feelslike_c': 16.2, 'feelslike_f': 61.2, 'windchill_c': 16.2, 'windchill_f': 61.2, 'heatindex_c': 16.2, 'heatindex_f': 61.2, 'dewpoint_c': 12.4, 'dewpoint_f': 54.4, 'vis_km': 10.0, 'vis_miles': 6.0, 'uv': 5.0, 'gust_mph': 16.5, 'gust_kph': 26.5}}

4.2 格式化输出

import json
from pygments import highlight, lexers, formatters

# parse JSON
parsed_json = json.loads(data.replace("'", '"'))

# pretty print JSON with syntax highlighting
formatted_json = json.dumps(parsed_json, indent=4)
colorful_json = highlight(formatted_json,
                          lexers.JsonLexer(),
                          formatters.TerminalFormatter())

print(colorful_json)

输出如下:

{
    "location": {
        "name": "San Francisco",
        "region": "California",
        "country": "United States of America",
        "lat": 37.78,
        "lon": -122.42,
        "tz_id": "America/Los_Angeles",
        "localtime_epoch": 1721608080,
        "localtime": "2024-07-21 17:28"
    },
    "current": {
        "last_updated_epoch": 1721607300,
        "last_updated": "2024-07-21 17:15",
        "temp_c": 16.2,
        "temp_f": 61.2,
        "is_day": 1,
        "condition": {
            "text": "Sunny",
            "icon": "//cdn.weatherapi.com/weather/64x64/day/113.png",
            "code": 1000
        },
        "wind_mph": 13.0,
        "wind_kph": 20.9,
        "wind_degree": 248,
        "wind_dir": "WSW",
        "pressure_mb": 1012.0,
        "pressure_in": 29.89,
        "precip_mm": 0.0,
        "precip_in": 0.0,
        "humidity": 77,
        "cloud": 1,
        "feelslike_c": 16.2,
        "feelslike_f": 61.2,
        "windchill_c": 16.2,
        "windchill_f": 61.2,
        "heatindex_c": 16.2,
        "heatindex_f": 61.2,
        "dewpoint_c": 12.4,
        "dewpoint_f": 54.4,
        "vis_km": 10.0,
        "vis_miles": 6.0,
        "uv": 5.0,
        "gust_mph": 16.5,
        "gust_kph": 26.5
    }
}

5. 总结

可以看到搜索工具,基于代理的搜索和常规搜索代码量确实有差距,代理搜索更加简洁和高效,且无需进行过多内容的过滤。