概要
在学习网页爬取时,查了很多资料发现网上教程很多都是针对校园信息网这类结构简单的静态页面,对动态页面信息捕捉的教程不是很完整,本文以在fobers商业信息网站上爬取200篇文章为例,提供网页爬取详细教程。
配置chromedriver
一个办法是手动安装和当前chrome版本相同的chromedriver,详见 https://blog.csdn.net/m0_63229791/article/details/139396077
和 https://blog.csdn.net/weixin_67531112/article/details/128207021
。
但是官方的chromedriver最新版本只到114.x,我的chrome版本已经到了127.x,显然不适用。
于是可以用更方便的办法,在最开始使用webdriver_manager自动安装并设置ChromeDriver:
def init_driver():
# 初始化 ChromeDriver
# 配置Chrome选项
options = webdriver.ChromeOptions()
# 使用webdriver_manager自动安装并设置ChromeDriver
service = Service(ChromeDriverManager().install())
# 启动浏览器并打开网页
driver = webdriver.Chrome(service=service, options=options)
return driver
自动点击主页按钮
因为官网主页上不会把所有文章都完整复现出来,一般有个“load more”的按钮,所以为了提取更多文章,需要先点击按钮再进行爬取。
# 尝试点击“更多文章”按钮
try:
button = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable(
(By.CSS_SELECTOR, 'button._18BedXz4.iWceBwQC.Sn26m-xQ.st6yY9Jv[data-testid="variants"]'))
)
button.click()
print("按钮点击成功")
time.sleep(5)
except Exception as e:
print(f"无法点击按钮: {e}")
WebDriverWait(driver, 10)
的意思是提供一个timeout的时间,即最多可以让它加载多长时间,在这个时间内加载好了就可以提前开始。
By.CSS_SELECTOR, 'button._18BedXz4.iWceBwQC.Sn26m-xQ.st6yY9Jv[data-testid="variants"]')
是fobers商业网站 load more button的按钮具体信息
- < type>是button
- < class>是_18BedXz4.iWceBwQ那一长串
- [data-testid=“variants”]是更具体的定位信息
搞不清的话直接复制它的xpath,用By.XPATH
定位也可以。
爬取主页有效url
等到出现足够多的文章之后就可以爬取页面信息了:
# 定位到包含文章链接的div块
csf_block = driver.find_element(By.CSS_SELECTOR, 'div.ZQt9W[data-test-e2e="stream articles"]')
# 从块内获取所有文章链接
articles = csf_block.find_elements(By.CSS_SELECTOR, 'a[href^="https://www.forbes.com/sites/"]')
urls = [article.get_attribute('href') for article in articles]
# 将链接列表转换为 JSON 字符串
urls_json = json.dumps(urls, indent=4)
# 将 JSON 字符串保存到文件中
with open('articles.json', 'w') as file:
file.write(urls_json)
with open('articles.json', 'r') as file:
urls = json.load(file)
# 去重并过滤没有日期和文章标题的 URL
filtered_urls = list(set(url for url in urls if '/2024' in url))
# 将过滤后的 URL 列表保存回 JSON 文件
with open('filtered_articles.json', 'w') as file:
json.dump(filtered_urls, file, indent=4)
with open('filtered_articles.json', 'r') as file:
urls = json.load(file)
print(f"已获取到 {len(filtered_urls)} 条有效 URL。开始信息提取。")
其中一些打开和保存json文档的操作可以省略,当时写出来是为了查看代码运行的结果(查看之后才添加了filtered_urls
的操作,因为提取时会出现大量重复的url,究其原因是在于文章标题和图片的链接共同同一个url,只要能点击的地方到达的都是同一个url所在位置)
提取单篇文章信息
以下是提取文章的具体代码,功能是提取标题,日期,作者名,文章文字内容,并将这些信息与url一起保存下来。
def extract_article_data(driver, article_url):
clear_cache(driver)
driver.get(article_url)
time.sleep(2) # 停顿2秒,模拟人类操作
try:
# title = driver.find_element(By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[1]/div/h1').text
title_element = WebDriverWait(driver, 60).until(
EC.visibility_of_element_located((By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[1]/div/h1'))
)
title = title_element.text
time.sleep(1) # 停顿1秒,模拟人类操作
# 等待时间以确保 <time> 标签内容加载完成
try:
date_element = WebDriverWait(driver, 90).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.content-data time'))
)
except:
try:
date_element = WebDriverWait(driver, 90).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.content-data.metrics-text.color-body.light-text.color-base.default-color.masthead-metrics-text.metrics-text.masthead-alignment time'))
)
except Exception as e:
print(f"提取内容时出错: {e}")
date_element = None
date = date_element.text
time.sleep(1) # 停顿1秒,模拟人类操作
try:
author_element = WebDriverWait(driver, 60).until(
EC.visibility_of_element_located((By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[2]/div/div[1]/div/div/div[1]/span'))
)
except:
try:
author_element = WebDriverWait(driver, 60).until(
EC.visibility_of_element_located((By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[2]/div/div[1]/div/div/div/span'))
)
except Exception as e:
print(f"提取作者时出错: {e}")
author_element = None
author = author_element.text
# author = driver.find_element(By.CSS_SELECTOR, 'span.fs-author-name a').text
time.sleep(1) # 停顿1秒,模拟人类操作
# 使用 WebDriverWait 等待指定的元素加载完成
# 尝试找到第一种选择器的元素
try:
content_element = WebDriverWait(driver, 240).until(
EC.visibility_of_element_located(
(By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[3]/div[1]')
)
)
except:
# 如果找不到,尝试使用第二种选择器
try:
content_element = WebDriverWait(driver, 240).until(
EC.visibility_of_element_located(
(By.CSS_SELECTOR, 'div.article-body.fs-article.fs-responsive-text.current-article')
)
)
except Exception as e:
print(f"提取内容时出错: {e}")
content_element = None
paragraphs = content_element.find_elements(By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[3]/div[1]//p')
# paragraphs = content_element.find_elements(By.CSS_SELECTOR,'div.article-body.fs-article.fs-responsive-text.current-article > p')
content = "\n".join([p.text for p in paragraphs])
return {"title": title, "date": date, "author": author, "url": article_url, "content": content}
except Exception as e:
print(f"提取文章时出错: {e}")
return None
在提取过程中我分别采用两种定位方式,一种是By.CSS_SELECTOR
,另一种是By.XPATH
,尽量让所有信息都被搜索到。
内容的提取添加了一个简化搜索+拼接的功能,即当我定位到内容板块时,只提取定位区间下一级的所有 < p >信息(不做再下一级的搜索),并将每条信息用回车符号连接起来。
这样做的原因是减少不必要的搜索时间(这个之后会说)。
一些细节
具备上述功能之后在进行爬取的过程中发现,前几条爬取可以顺利进行,后面的内容几乎爬取不到了,而且总是报堆栈的错误。
考虑到是堆栈容量的问题,所以设计了清空cache和重启drive的功能:
下面展示一些 内联代码片
。
def clear_cache(driver):
# 清除浏览器缓存和Cookies
driver.delete_all_cookies()
以及
# 每4篇文章重启一次 WebDriver 实例
if (idx + 1) % 4 == 0:
driver.quit()
driver = init_driver()
总结
完整代码见下:
def init_driver():
# 初始化 ChromeDriver
# 配置Chrome选项
options = webdriver.ChromeOptions()
# 使用webdriver_manager自动安装并设置ChromeDriver
service = Service(ChromeDriverManager().install())
# 启动浏览器并打开网页
driver = webdriver.Chrome(service=service, options=options)
return driver
def clear_cache(driver):
# 清除浏览器缓存和Cookies
driver.delete_all_cookies()
# 提取单篇文章的数据
def extract_article_data(driver, article_url):
clear_cache(driver)
driver.get(article_url)
time.sleep(2) # 停顿2秒,模拟人类操作
try:
# title = driver.find_element(By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[1]/div/h1').text
title_element = WebDriverWait(driver, 60).until(
EC.visibility_of_element_located((By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[1]/div/h1'))
)
title = title_element.text
time.sleep(1) # 停顿1秒,模拟人类操作
# 等待时间以确保 <time> 标签内容加载完成
try:
date_element = WebDriverWait(driver, 90).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.content-data time'))
)
except:
try:
date_element = WebDriverWait(driver, 90).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.content-data.metrics-text.color-body.light-text.color-base.default-color.masthead-metrics-text.metrics-text.masthead-alignment time'))
)
except Exception as e:
print(f"提取内容时出错: {e}")
date_element = None
date = date_element.text
time.sleep(1) # 停顿1秒,模拟人类操作
try:
author_element = WebDriverWait(driver, 60).until(
EC.visibility_of_element_located((By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[2]/div/div[1]/div/div/div[1]/span'))
)
except:
try:
author_element = WebDriverWait(driver, 60).until(
EC.visibility_of_element_located((By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[2]/div/div[1]/div/div/div/span'))
)
except Exception as e:
print(f"提取作者时出错: {e}")
author_element = None
author = author_element.text
# author = driver.find_element(By.CSS_SELECTOR, 'span.fs-author-name a').text
time.sleep(1) # 停顿1秒,模拟人类操作
# 使用 WebDriverWait 等待指定的元素加载完成
# 尝试找到第一种选择器的元素
try:
content_element = WebDriverWait(driver, 240).until(
EC.visibility_of_element_located(
(By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[3]/div[1]')
)
)
except:
# 如果找不到,尝试使用第二种选择器
try:
content_element = WebDriverWait(driver, 240).until(
EC.visibility_of_element_located(
(By.CSS_SELECTOR, 'div.article-body.fs-article.fs-responsive-text.current-article')
)
)
except Exception as e:
print(f"提取内容时出错: {e}")
content_element = None
paragraphs = content_element.find_elements(By.XPATH, '/html/body/div[1]/article/main/div[2]/div[2]/div[1]/div[3]/div[1]//p')
# paragraphs = content_element.find_elements(By.CSS_SELECTOR,'div.article-body.fs-article.fs-responsive-text.current-article > p')
content = "\n".join([p.text for p in paragraphs])
return {"title": title, "date": date, "author": author, "url": article_url, "content": content}
except Exception as e:
print(f"提取文章时出错: {e}")
return None
# 主函数:爬取文章并保存为JSON格式
def scrape_forbes_articles():
driver = init_driver()
url = "https://www.forbes.com/business/?sh=74da965e535f"
try:
filtered_urls = []
# 循环直到获取到至少200个有效URL
driver.get(url)
while len(filtered_urls) < 400:
# 获取页面上最前面的文章链接
time.sleep(3) # 停顿3秒,模拟人类操作
# 尝试点击“更多文章”按钮
try:
button = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable(
(By.CSS_SELECTOR, 'button._18BedXz4.iWceBwQC.Sn26m-xQ.st6yY9Jv[data-testid="variants"]'))
)
button.click()
print("按钮点击成功")
time.sleep(5)
except Exception as e:
print(f"无法点击按钮: {e}")
# 定位到包含文章链接的div块
csf_block = driver.find_element(By.CSS_SELECTOR, 'div.ZQt9W[data-test-e2e="stream articles"]')
# 从块内获取前num_articles篇文章链接
articles = csf_block.find_elements(By.CSS_SELECTOR, 'a[href^="https://www.forbes.com/sites/"]')
urls = [article.get_attribute('href') for article in articles]
# 将链接列表转换为 JSON 字符串
urls_json = json.dumps(urls, indent=4)
# 将 JSON 字符串保存到文件中
with open('articles.json', 'w') as file:
file.write(urls_json)
with open('articles.json', 'r') as file:
urls = json.load(file)
# 去重并过滤没有日期和文章标题的 URL
filtered_urls = list(set(url for url in urls if '/2024' in url))
# 将过滤后的 URL 列表保存回 JSON 文件
with open('filtered_articles.json', 'w') as file:
json.dump(filtered_urls, file, indent=4)
with open('filtered_articles.json', 'r') as file:
urls = json.load(file)
print(f"已获取到 {len(filtered_urls)} 条有效 URL。开始信息提取。")
articles_data = []
for idx, article_url in enumerate(urls):
article_data = extract_article_data(driver, article_url)
if article_data:
articles_data.append(article_data)
# 每次提取完一篇文章的信息后,将数据保存到 JSON 文件中
with open('forbes_business_articles.json', 'w', encoding='utf-8') as f:
json.dump(articles_data, f, ensure_ascii=False, indent=4)
if len(articles_data) >= 200:
break
time.sleep(3) # 在每篇文章抓取之间停顿3秒,模拟人类阅读行为
# 每4篇文章重启一次 WebDriver 实例
if (idx + 1) % 4 == 0:
driver.quit()
driver = init_driver()
finally:
driver.quit()
# 保存数据到JSON文件
with open('forbes_business_articles.json', 'w', encoding='utf-8') as f:
json.dump(articles_data, f, ensure_ascii=False, indent=4)
print(f"已将 {len(articles_data)} 篇文章保存到 'forbes_business_articles.json' 文件中。")
# 执行爬取函数
scrape_forbes_articles()
最后爬取效果如下:
{
"title": "FC Barcelona Turns Attention To Signing Rafael Leao, Reports Mundo Deportivo",
"date": "Aug 20, 2024,",
"author": "Tom Sanderson\nSenior Contributor",
"url": "https://www.forbes.com/sites/tomsanderson/2024/08/20/fc-barcelona-turns-attention-to-signing-rafael-leao-reports-mundo-deportivo/",
"content": "FC Barcelona has turned its attention towards trying to sign highly-rated AC Milan forward Rafael Leao according to Mundo Deportivo, which cited anonymous sources in an exclusive report.\nBarca has thus far failed to land Euro 2024 winner Nico Williams, as raising his Athletic Club release clause of $68.7 million (€62 million) including taxes is a difficult task during times of financial hardship.\nThe Catalans were able to hash out a deal for his Spain teammate Dani Olmo, though that involved less fixed money and variables to take the final amount RB Leipzig will receive for the former La Masia product past $60 million (€54 million).\nIf MD's report is accurate, however, it is confusing that Barca would try to approach Rafael Leao in these circumstances.\nThe newspaper quotes the Portuguese winger's asking price as $100 million (€90 million), but it is hoped that President Joan Laporta can make the most of his relationship with super agent Jorge Mendes to work something out with Leao's Italian employers.\nMendes represents Leao at the San Siro, and though there are alternatives to both the 25-year-old and Williams elsewhere, such as Kingsley Coman and Federico Chiesa, Laporta wants to finish what has been one of the most underwhelming summer transfer windows in club history by making a marquee signing that would perhaps hush some of the grumbles that many Culers are beginning to have about his tenure and the dealings of Sporting Director Deco.\nLaporta is said to be an admirer of Leao's, just as he was of another Mendes client in Joao Felix.\nAfter his loan at Barca finished on June 30, however, it now looks as though Felix is on his way to sign permanently for Chelsea from Atletico Madrid as part of a complicated operation which also - whether directly or not - entails Conor Gallagher heading to the Spanish capital.\nSome potential developments that could help free up much-needed cash for Barca to deliver Leao include the potential exits of Vitor Roque and Ilkay Gundogan.\nWhile the Brazilian is being linked to Sporting Lisbon, Gundogan could possibly rejoin his former club Manchester City in what would be a sensational return to the Etihad for the German."
},
{
"title": "Oprah Endorses Kamala Harris At DNC: Her Political History—And Rare Endorsements—Explained",
"date": "Aug 21, 2024,",
"author": "Sara Dorn\nForbes Staff",
"url": "https://www.forbes.com/sites/saradorn/2024/08/21/oprah-appearing-at-dnc-tonight-what-to-know-about-her-political-history-and-rare-endorsements/",
"content": "Oprah Winfrey spoke Wednesday at the Democratic National Convention—offering her coveted endorsement to Vice President Kamala Harris and continuing her support for Democratic candidates.\nWinfrey made her political convention debut in Wednesday’s surprise appearance, telling the crowd that “soon, very soon, we’re going to be teaching our daughters and sons about how this child of an Indian mother and a Jamaican father . . . grew up to become the 47th President of the United States,” referring to Harris. Criticizing GOP vice presidential nominee Sen. JD Vance, R-Ohio, over his widely criticized insult of “childless cat ladies,” Winfrey said “when a house is on fire . . . if the place happens to belong to a childless cat lady — well we try to get that cat out too.” In an appeal to independent voters, Winfrey said she is “a registered independent who is proud to vote again and again and again because I’m a proud American and that’s what Americans do.”\nSome research on Winfrey’s Obama endorsement contradict the University of Maryland findings. Nearly 70% of Americans said Winfrey’s endorsement would not influence their vote and 15% said it would make them less likely to vote for a candidate, according to a 2007 Pew Research Center study.\nWinfrey has donated relatively small sums (in comparison with her net worth) to various state and federal Democratic political candidates and committees since 1992, according to OpenSecrets, which shows her most recent donations were in 2022.\nWe estimate Winfrey’s net worth at about $3 billion, making her America’s 14th-wealthiest self-made woman. She has parlayed her success as a daytime talk show host—which ended in 2011—into a sizable media empire.\nWinfrey spoke after former President Bill Clinton and former House Speaker Nancy Pelosi. Democratic vice presidential nominee, Minnesota Gov. Tim Walz, will close out the third day of the convention with the final speech of the night. Singer Stevie Wonder performed earlier in the evening Wednesday, and singer John Legend is scheduled to perform during the 10 p.m. EDT hour.\nKamala Harris’ Approval Rating Jumps In Weeks After Becoming Nominee As Democrats Increasingly Back Her (Forbes)\nTim Walz-JD Vance Polls: Walz More Popular Than Vance In Early Surveys (Forbes)\nTrump Vs. Harris 2024 Polls: Harris Leads By 3 Points Halfway Through DNC (Forbes)"
},
{
"title": "Tim Walz-JD Vance Polls: Walz More Popular Than Vance In Early Surveys",
"date": "Aug 21, 2024,",
"author": "Molly Bohannon\nForbes Staff",
"url": "https://www.forbes.com/sites/mollybohannon/2024/08/21/tim-walz-jd-vance-polls-walz-more-popular-than-vance-in-early-surveys/",
"content": "Ohio Sen. JD Vance and Minnesota Gov. Tim Walz have each been thrust into the spotlight as the running mate picks of former President Donald Trump and Vice President Kamala Harris, and a new poll released Wednesday shows more Americans hold Walz in a higher regard than Vance—though many voters are still unfamiliar with both candidates.\nGet Forbes Breaking News Text Alerts: We’re launching text message alerts so you'll always know the biggest stories shaping the day’s headlines. Text “Alerts” to (201) 335-0739 or sign up here.\nWalz will address the Democratic National Convention in Chicago late Wednesday, ahead of Harris’ acceptance speech Thursday.\nHow the numbers shift as Walz becomes more of a household name. Recent polling reflects that Walz is still significantly lesser known than Vance: By the time Walz was announced as Harris’ running mate, Vance had been campaigning as Trump’s pick for more than two weeks, which could mean his favorability and unfavorability numbers will be higher as voters have had more time to form an opinion on him. Before getting the nod from Trump, Vance was a fairly well-known senator, whereas Walz was somewhat lower-profile.\nIn the past, experts have been skeptical of whether vice presidential picks have much influence on an election’s outcome, but with November's race expected to be highly contested, even small boosts from vice presidential picks could be game changers. In the latest FiveThirtyEight averages, Harris had virtually erased the growing lead Trump had on President Joe Biden and is now ahead by nearly three points. Both campaigns have cast their vice presidential picks as politicians who can speak to voters in key midwestern swing states, though Vance is also seen as a Trump loyalist and an appeal to his MAGA base, and some parts of the Democratic base pushed for Walz due to his support for some progressive priorities. Joel Goldstein, professor emeritus at Saint Louis University and an expert on vice presidencies, recently told Minnesota Public Radio that while most people are going to vote on their perception of the presidential candidates, the vice presidential picks provide insight into how they make decisions, which can help voters decide who to support. Running mates can also make differences in their home states, Goldstein said—though in this case, both Minnesota and Ohio are unlikely to be key swing states.\nWe estimate Vance is worth about $10 million, while Walz has an estimated net worth of just north of $1 million. Vance has made his money on his best-selling memoir, “Hillbilly Elegy,” along with real estate investments that we estimate total about $4 million. Walz, on the other hand, owns no property, stocks or bonds and his wealth is based on his and his wife’s pensions for their work in teaching and government."
},
{
"title": "Prince Charts A Pair Of Top 10s On The Same Ranking With Two Longtime Favorites",
"date": "Aug 20, 2024,",
"author": "Hugh McIntyre\nSenior Contributor",
"url": "https://www.forbes.com/sites/hughmcintyre/2024/08/20/prince-charts-a-pair-of-top-10s-on-the-same-ranking-with-two-longtime-favorites/",
"content": "Prince was the kind of musician who loved to use his music for different purposes. He didn’t just record songs and albums and release them—he tried new things and entered new fields with his work. Known as an incredible live performer, he also found great success in the movie world, in addition to the music industry.\nTwo of Prince’s most popular forays into the film business are performing incredibly well in the U.K. this week. The rocker fills several spaces on the Official Soundtrack Albums chart, and both of his current wins are living inside the highest tier on the ranking of the bestselling soundtrack albums in the country.\nPrince occupies a pair of spaces inside the top 10 alone on the Official Soundtrack Albums tally. One of his projects climbs, while the other returns to the list—impressively inside the loftiest tier.\nPurple Rain rises ever so slightly this time around. The blockbuster release lifts from No. 6 to No. 5, returning to the highest half of the top 10.\nMeanwhile, the full-length that accompanied the Batman movie released in 1989 is back on the chart. The set reappears at No. 10 on the Official Soundtrack Albums list in the U.K. this week. It’s previously climbed as high as No. 4, a position it’s not too far away from at the moment.\nPurple Rain is one of the longest-charting and most successful soundtracks of all time. The title has now spent 924 frames on the Official Soundtrack Albums tally in the U.K. Batman, meanwhile, has only managed 14 stays.\nPrince’s Batman project is one of several returning champions to the Official Soundtrack Albums chart, and it ranks as the loftiest of the bunch. Further down on the roster come titles like High School Musical, Drive, Les Miserables (a staged concert version), and Dirty Dancing, among others."
},
完成!适用任何一个网站!