目的
由于毕业设计需求,需要爬取草鱼网站的价格转换为dataframe对象方便处理
实践
对网页源代码进行分析之后得出所需的数据“日期”“价格”的html地址选择器如下
#_date #_price
故使用request库爬取到网页全部源代码,代码如下:
url = "https://price.xxxxx.cn/product/1321-p2.html"
strhtml = requests.get(url)
之后导入bs4库对源代码进行分析,并提取出所需数据标签,代码如下:
soup = BeautifulSoup(strhtml.text, 'lxml')
data1 = soup.select("#_date")
data2 = soup.select("#_price")
得到以下文本
由于只需要其中的value值,所以通过tag对象的方法提取value值
values1 = [tag.get("value").split(",") if tag.get("value") else [] for tag in data1] if data1 else []
values2 = [tag.get("value").split(",") if tag.get("value") else [] for tag in data2] if data2 else []
输出value值后发现,单价value值比日期少三组数据,找不到原因,于是自行填补与其余单价近似的3个数据
missing_data = ["16.0", "16.0", "16.0"]
missing_data = pd.Series(missing_data, name="Value2")
df.loc[df.index[-1] + 1:df.index[-1] + 3, "Value2"] = missing_data
这时输出数据发现组数相等,可以使用。
发现数据是以逗号分隔的数值组,为导入excel观察方便于是使用explode展开
df = df.explode("Value1").reset_index(drop=True)
df = df.explode("Value2").reset_index(drop=True)
但发现重复展开导致Dataframe df中存在600行数据,发现是重复展开了两次
错误代码如下
import lxml
import pandas as pd
import requests
import pandas
from bs4 import BeautifulSoup
url = "https://price.xxxxxx.cn/product/1321-p2.html"
strhtml = requests.get(url)
print(strhtml)
soup = BeautifulSoup(strhtml.text,'lxml')
data1 = soup.select("#_date")
data2 = soup.select("#_price")
values1 = [tag.get("value").split(",") if tag.get("value") else [] for tag in data1] if data1 else []
values2 = [tag.get("value").split(",") if tag.get("value") else [] for tag in data2] if data2 else []
print(values1)
print(values2)
df = pd.DataFrame({"Value1": values1, "Value2": values2})
missing_data = ["16.0", "16.0", "16.0"]
missing_data = pd.Series(missing_data, name="Value2")
df.loc[df.index[-1] + 1:df.index[-1] + 3, "Value2"] = missing_data
print(df)
df = df.explode("Value1").reset_index(drop=True)
df = df.explode("Value2").reset_index(drop=True)
df.drop_duplicates(subset=["Value1", "Value2"])
print(df)
分析后发现可以
将其中的Value1和Value2分别建立为两个dataframe对象,再分别对他们进行explode展开,最后进行合并为同一dataframe中的两列
修改代码如下
import requests
from bs4 import BeautifulSoup
import pandas as pd
import lxml
url = "https://price.xxxxx.cn/product/1321-p2.html"
strhtml = requests.get(url)
soup = BeautifulSoup(strhtml.text, 'lxml')
data1 = soup.select("#_date")
data2 = soup.select("#_price")
print(data1)
print(data2)
values1 = [tag.get("value").split(",") if tag.get("value") else [] for tag in data1] if data1 else []
values2 = [tag.get("value").split(",") if tag.get("value") else [] for tag in data2] if data2 else []
df1 = pd.DataFrame({"Value1": values1})
df2 = pd.DataFrame({"Value2": values2})
df1_exploded = df1.explode("Value1").reset_index(drop=True)
df2_exploded = df2.explode("Value2").reset_index(drop=True)
df_merged = pd.concat([df1_exploded, df2_exploded], axis=1)
print(df_merged)
运行后成功得出正确结果