终于来到下啦!!马上就要完结了,先展示一下最终的成品吧!
下面就是我们对上次那个最终版的处理,当然,不要在意我的命名,这个就是处理中篇最后的那个csv,如果我的文件名字对不上大家也不要care了,存了太多副本了,这个total1,2,3是因为平均一个车次会有10个详情站左右,大概一天会有8万多条数据,由于excel有显示上限,所以建议10天处理一次,我这应该是最后第三批剩下的几天,前面的改一下date_list就能用了,最后对这三个total1,2,3的csv分别去重,注意哦,要根据车次和站点名称两列进行去重,因为存在在不同的日期中间经停站不一样的情况。最后将去重完的三个total再综合一下,进行去重操作,得到的结果就是我们想要的了。
import pandas as pd
import requests
import csv
import json
import os
import time
# url = "https://kyfw.12306.cn/otn/queryTrainInfo/query?leftTicketDTO.train_no=24000013030W&leftTicketDTO.train_date={}&rand_code="
# test_url_wu="https://kyfw.12306.cn/otn/queryTrainInfo/query?leftTicketDTO.train_no=2400000K211K&leftTicketDTO.train_date=2020-12-11&rand_code="
base_url = "https://kyfw.12306.cn/otn/queryTrainInfo/query?leftTicketDTO.train_no={}&leftTicketDTO.train_date={}&rand_code="
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
}
train_date_list = ["2021-01-02", "2021-01-03",
"2021-01-04", "2021-01-05", "2021-01-06", "2021-01-07", "2021-01-08"]
proxies = {
# 格式: 协议:协议://ip地址:端口号
"HTTP": "http://113.121.36.73:9999"
} # 使用代理
##########################################################################
# 这块是测试数据
# reponse = requests.get(url=test_url_wu, headers=headers, proxies=proxies)
# data_list = json.loads(reponse.content)["data"]["data"]
# if data_list is not None:
# print("有数据")
# else:
# print("无数据")
###########################################################################
if not os.path.exists("12306爬虫原始数据处理"):
os.mkdir("112306爬虫原始数据处理")
# with open(f"12306爬虫原始数据处理/车次详情站点信息_total版.csv", "w", newline="", encoding="utf-8")as f:
with open(f"12306爬虫原始数据处理/车次详情站点信息_total版3.csv", "w", newline="", encoding="utf-8")as f:
writer = csv.writer(f)
writer.writerow(["车次", "车次编号", "站点序号", "站点名称", "出发站", "终点站", "经停站总数", "时间"])
for i in range(0, 7):
count = 0
df = pd.read_csv(r"D:\Contest_test\pachong_self\实习爬虫\12306爬虫原始数据处理\所有车次信息5.csv")
for index, row in df.iterrows():
train_name = row["车次"]
train_no = row["编号"]
start_station = row["出发站"]
end_station = row["终点站"]
total_num = row["经停站站总数"]
train_no_use = train_no.strip("#")
ful_url = base_url.format(train_no_use, train_date_list[i])
reponse = requests.get(url=ful_url, headers=headers, proxies=proxies)
flag = 1
while (flag == 1):
try:
data_list = json.loads(reponse.content)["data"]["data"]
if data_list is not None:
for tmp in data_list:
station_no = tmp["station_no"]
station_name = tmp["station_name"]
writer.writerow(
[train_name, train_no, station_no, station_name, start_station, end_station, total_num,
train_date_list[i]])
# print(train_name, train_no, station_no, station_name, start_station, end_station, total_num,
# train_date_list[i])
else:
count = count + 1
flag = 0
except Exception as e:
reponse = requests.get(url=ful_url, headers=headers, proxies=proxies)
print("{}天匹配不到的车次数量为{}".format(train_date_list[i], count))
# time.sleep(0.1)
纪念第一次写博文,版式什么的都很丑,而且一点都不可爱,哎,不过就这样吧,我的方法属于暴力法,希望对你有一点点帮助!一起加油吧!!!