BeautifulSoup爬取博客实例
- 爬取对象はてなブックマーク博客(日本网站)
- 用for循环爬取每个类别博客的前两页博客
- 使用python BeautifulSoup库
第一步: 爬取所有类别的文本以及链接
'''
导入beautifulsoup库用来解析网页
导入request库用来获得HTML文件
导入pandas库用来生成表格
'''
from bs4 import BeautifulSoup
import requests
import pandas as pd
columns=["TAG", "URL"]
df = pd.DataFrame(columns=columns)
base_url = "https://b.hatena.ne.jp"
home_url = requests.get("https://b.hatena.ne.jp/hotentry/all").content
soup = BeautifulSoup(home_url, 'html.parser',from_encoding="utf8")
li_tag = []
li_url =[]
tree = soup.find_all("div", class_="navi-link js-navi-link")
for branch in tree:
tag = branch.a.string
link = branch.find("a")["href"]
li_tag.append(tag)
li_url.append(base_url + link)