java抓取网页数据_借我一双慧眼吧！网页中抓取关键数据只要三分钟-CSDN博客

全文共1589字，预计学习时长10分钟

图源：unsplash

有人说，数据会取代石油的地位，成为未来最珍稀的资源之一。无论这个命题是否成立，毫无疑问，数据或信息(任意形式)已然成为21世纪最宝贵的无形资产之一。

数据极其强大，用途颇广：可以预测销售的未来趋势以获利，可以在医疗保健行业中用于诊断早期结核病，从而挽救患者的生命……而数据科学家要做的是，如何从各种资源中提取有价值的数据。

本文将帮助你掌握这个数据时代的必备技能——如何使用python中的库从网站提取数据。笔者将演示从inshorts网站提取与板球、羽毛球和网球等不同运动有关的新闻报道。

步骤1：导入相关库

import requests             from bs4 importBeautifulSoup             import pandas as pd

步骤2：发出Web请求并使用BeautifulSoup进行解析

先要查看特定新闻类别的源代码。进入网页后将看到不同种类的新闻，关注某一特定的新闻，使用Beautiful Soup提取源代码。在右侧可以看到新闻文章及相应的源代码。

图源：unsplash

使用请求库，并在URL上使用.get()从网页访问HTML脚本。然后，使用beautiful soup库在python中解析此HTML语言。根据要提取的信息类型，可以使用.find()函数从不同的html标签(例如

， )中过滤该信息。

dummy_url="https://inshorts.com/en/read/badminton"                                                                  data_dummy=requests.get(dummy_url)                                                                    soup=BeautifulSoup(data_dummy.content,'html.parser')                                                                    soup

完成上述步骤并解析HTML语言后，此特定新闻的部分解析如下所示：

我们看到该文章的标题位于-

类别下，进一步可以看到标题位于标记中，并且属性为“ itemprop”和“ headline”，可以使用.find()函数进行访问。

news1=soup.find_all('div',class_=["news-card-title news-right-box"])[0]title=news1.find('span',attrs={'itemprop':"headline"}).stringprint(title)We get the following outputgiven below-Shuttler Jayaram wins Dutch OpenGrand Prix

同样，如果要访问新闻内容，则将该新闻设置为

类别。我们还可以看到新闻的正文位于

标记中，该标记的属性为“ itemprop”和“ articleBody”，可以使用.find()函数进行访问。

news1=soup.find_all('div',class_=["news-card-content news-right-box"])[0]content=news1.find('div',attrs={'itemprop':"articleBody"}).stringprint(content)Indian Shuttler Ajay Jayaramclinched $50k Dutch Open Grand Prix at Almere in Netherlands on Sunday,becoming the first Indian to win badminton Grand Prix tournament under a newscoring system. Jayaram defeated Indonesia's Ihsan Maulana Mustofa 10-11, 11-6,11-7, 1-11, 11-9 in an exciting final clash. The 27-year-old returned to thecircuit in August after a seven-month injury layoff.

以类似的方式，我们可以提取图像、作者姓名、时间等任何信息。

步骤3：建立资料集

接下来，我们对3种新闻类别实施此操作，然后将所有文章相应的内容和类别存储在数据框中。笔者将使用三个不同的Urls，对每个URL实施相同的步骤，并将所有文章及其内容设置类别存储为列表形式。

urls=["https://inshorts.com/en/read/cricket","https://inshorts.com/en/read/tennis",     "https://inshorts.com/en/read/badminton"]                                                                      news_data_content,news_data_title,news_data_category=[],[],[]                                                                      for url in urls:                                                                        category=url.split('/')[-1]                                                                      data=requests.get(url) soup=BeautifulSoup(data.content,'html.parser')                                                                        news_title=[]                                                                        news_content=[]                                                                        news_category=[]                                                                        for headline,article inzip(soup.find_all('div', class_=["news-card-titlenews-right-box"]),                                                                                                  soup.find_all('div',class_=["news-card-contentnews-right-box"])):                                                                          news_title.append(headline.find('span',attrs={'itemprop':"headline"}).string)                  news_content.append(article.find('div',attrs={'itemprop':"articleBody"}).string)                                                                          news_category.append(category)                                                                        news_data_title.extend(news_title)                                                                        news_data_content.extend(news_content) news_data_category.extend(news_category)               df1=pd.DataFrame(news_data_title,columns=["Title"])                                                                      df2=pd.DataFrame(news_data_content,columns=["Content"])                                                                      df3=pd.DataFrame(news_data_category,columns=["Category"])                                                                      df=pd.concat([df1,df2,df3],axis=1)                                                                      df.sample(10)

输出为：