1.1 爬下12306--爬取信息

最新推荐文章于 2024-07-08 14:21:22 发布

adream307

最新推荐文章于 2024-07-08 14:21:22 发布

阅读量2.9w

点赞数 2

分类专栏： fetch12306 Linux 文章标签： linux 12306 curl bash

本文链接：https://blog.csdn.net/adream307/article/details/51406247

版权

本文详细介绍了如何使用curl命令在Linux环境下爬取12306火车票信息，包括忽略证书校验、设置user-agent以及获取查询URL的方法。通过浏览器的开发者工具解析网络请求，获取所需参数，解析出站点名称与代号。

摘要由CSDN通过智能技术生成

1.1爬取信息

#!/bin/bash
curl --insecure --user-agent "Mozilla/5.0 (X11; Linux i686; rv:38.0) Gecko/20100101 Firefox/38.0" "https://kyfw.12306.cn/otn/lcxxcx/query?purpose_codes=ADULT&queryDate=$1&from_station=SHH&to_station=BJP" | grep -oP "(?<={)[^{}]+(?=})" | sed -r 's/.*station_train_code":"([^"]+).*start_station_name":"([^"]+).*end_station_name":"([^"]+).*start_time":"([^"]+).*arrive_time":"([^"]+).*ze_num":"([^"]+).*zy_num":"([^"]+).*swz_num":"([^"]+).*/\1 \2 \3 \4 \5 \6 \7 \8/'

对于上图的fetch_sh-bj.sh脚本程序，也许现在你还看得一头雾水。
但请不要着急，熬过了黑夜就可以见到黎明的曙光。
先喝一口24K纯度的凉白开压压惊，下面听我为你娓娓道来关于fetch_sh-bj.sh前世今生。
前文提到fetch_sh-bj.sh一共可以分为三部分。
本小节我们先聊聊和爬取信息相关的那一部分—curl。
curl命令可以分为三段：

第一段： `--insecure` 选项

insecure选项用于告知curl不对网站的证书做校验。
我相信很多童鞋在第一次使用12306网站定票时，都有过类似的体验，打开订票页面时，浏览器爆出个“当前网页不受信任，是否继续”之类的警告信息。
FireFox当前网页不受信任警告

curl在爬取订票信息时，干着和浏览器类似的事。
如果不指明insecure选项，则会显示当前网页认证失败。

cyf@cyf$curl --user-agent "Mozilla/5.0 (X11; Linux i686; rv:38.0) Gecko/20100101 Firefox/38.0" "https://kyfw.12306.cn/otn/lcxxcx/query?purpose_codes=ADULT&queryDate=2016-05-14&from_station=SHH&to_station=BJP"
curl: (60) SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
 the bundle, the certificate v