小白学爬虫---正则表达式re模块

一、数据分类

我们从网站获取的数据文件一般归类为两种,结构数据和非结构数据。可以把结构化数据看作是有规律可循的,非结构化数据是规律不明显或混杂的数据。

1.结构化数据

例:json文件,一般转成字典然后使用字典取值

2.非结构化数据

例:网页源代码,一般混合使用lxml,bs,parsel,正则表达式取值

下面将爬取某商品评论进行实践

二、爬虫步骤

2.1 发送请求(上一篇文章)

2.2 获取数据

打开网站找到网站评论,使用ctrl+F搜索文件位置

代码

url = "https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1722161535569&body=%7B%22productId%22%3A100128967368%2C%22score%22%3A0%2C%22sortType%22%3A5%2C%22page%22%3A0%2C%22pageSize%22%3A10%2C%22isShadowSku%22%3A0%2C%22fold%22%3A1%2C%22bbtf%22%3A%22%22%2C%22shield%22%3A%22%22%7D&h5st=20240728181215573%3Bzygzgzgt5n69mgg3%3Bfb5df%3Btk03wa5cf1c2718n6i3e1l1V4shzsNxLA9yL0zKaGsuGsUzPdG0q5YgTwF3NLp0PqLwKMLcnKhT3CMB_CzF4l3rpy8Rr%3B6e2fa33c92dd4052e66fda0f714a5b14%3B4.7%3B1722161535573%3BVKZ1m5ZwhDVgj9rNkk8oppAIAvifUJbzS2DoBNIrvSrGf2_AgJcuqKg-gbIS-gTyexG5hGTu66LkypwmOuHTKEqlVzxcr9vHxz2WTwCPrh_s3EuyuBuaTyr0YgXpYo4H4hwE_OnVxo57-FrqA4mpDJSdz28ZTeLekHnOAwl7jH7_pHg9uy76eM_09DQ9nm_tgqnc-zSF8OFhDIPhPV21p-Jb6-rjveGNfEa6knTzalxosRT3cJr_LfKCjcmSy-vEvJZN2oICxL8UD3NTptwQFphFKWIO6VVyTZBtbsvBJOus_rczNtd-xB8g84H8xuF9vo3WqqN7RzKLgpkRsiEjI8SR9KpJyOF1HptguJyAcmpz9FVa-_T-g6J4q5nDjO4hPpAWp-cec9Q_-mS7FdRC9srUJubl-eO9P2bZ9xv8mS0LeglHLB7D7Xp4SHc8_S4BZFpQQZt5vphfUMkRHoQrAkNwRxgUx_ac2JkEwzCI0z9TSWVNQ2l8IniXQdz5GGL6_tV9xS56w7vk5ey8BXFY4RqOq3cPNsbA77CD9gDvmkWdNApyR6QeR5uk1rPMyhr7pQljp5LKEBpK9DipKa--Z1kZEsmeoFQYdlaSjERAl5d0ggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3%3B2fa3cd43c062efe2ecffd02104eecb33&x-api-eid-token=jdd03KQF6GQUPDEBWOYLXVYEBFFH2A6CYEAZKJIAQZ4WBOF3VV4SVPXCF465HQIDUIETXVBPAYHUAFGQHBK7URLLAEL3QRUAAAAMQ7DJV4IAAAAAAC4GZOFHD4NGYV4X&loginType=3&uuid=181111935.1676817357.1722098687.1722136653.1722161521.3"
#发送Http请求
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0',
    "authority":"api.m.jd.com",
    "Cookie":"unpl=JF8EAJpnNSttDEIEDR4FSUERTVlTWwhbTx4GOmBRAQ5QGV1QEwobFBh7XlVdWBRKER9sbhRVXFNOUQ4eBCsSEXteU11bD00VB2xXVgQFDQ8WUUtBSUt-SVxRWFULShIAaWYHZG1bS2QFGjIbFRZMVFRYXwxNJwJfYDVkbVpOXAYeCysTIEptFgoBD0kRB2piSFRaXkxdBR0AHxQgSm1X; __jdv=76161171|baidu-search|t_262767352_baidusearch|cpc|304792042703_0_e8a946cb07564db584d6edb9c8d99179|1722098687985; __jdu=1676817357; areaId=19; PCSYCityID=CN_440000_440100_0; shshshfpa=8ce41290-fad5-fb9d-f83d-1d16eb11ca5e-1722098689; shshshfpx=8ce41290-fad5-fb9d-f83d-1d16eb11ca5e-1722098689; jsavif=1; mba_muid=1676817357; mt_xid=V2_52007VwMVVFpRUlwdTR9sVTNWQgZbXwdGG0EZXhliUBoHQQsADhtVTFgFNQQbUVULBw1KeRpdBmIfE1RBW1tLHkgSWQFsAhJiX2hSahdIG1gEZwcbUlVdWlMeSR1eBWQzF1NUXg%3D%3D; ipLoc-djd=19-1601-50258-129167; mba_sid=17220986914544814580665214385.3; TrackID=12ZmKyVqv2jxp36JwNCPAx7T9eynNnF5TWvNUsibQjLjUkTwpVEppGJIM3S7Y2appP_Lj6ngaWMpkq_fYn91DmqYDvZnpDi7JPIvOT6D_szkg_n-mVGZFxhQ5RfjsrMbM; thor=40DBD2CDABF0F02944485497D092C6905D9DBA8C119C981D47FD5305C0205D57624B3C54ED2C39C2B4ADF4185024054AFD3632583359727DB25295BCE5D4B98E59F052EAA39AF7261D3283201380908D721413D8B29182DFB893C8FA083E6213FD6025E8A8F101A0A7B4816BC422D6245136B73CA5ACAB53122BA23B14C4C694073B172EC39F149514C4C68768761BEE; pinId=WVRVFL3KGNDsti_0vFSaPA; pin=wdCcURIgChYhDF; unick=%E6%B2%A1%E5%90%83%E9%A5%B1%E4%BD%86%E5%BE%88%E9%97%B2; ceshi3.com=103; _tp=vjWyFcCiJK0r9gVMsWBvUQ%3D%3D; _pst=wdCcURIgChYhDF; token=27709efed065e98865d5be5da33077d9,3,956721; 3AB9D23F7A4B3C9B=KQF6GQUPDEBWOYLXVYEBFFH2A6CYEAZKJIAQZ4WBOF3VV4SVPXCF465HQIDUIETXVBPAYHUAFGQHBK7URLLAEL3QRU; __jda=181111935.1676817357.1722098687.1722098687.1722098688.1; __jdc=181111935; __jdb=181111935.14.1676817357|1.1722098688; 3AB9D23F7A4B3CSS=jdd03KQF6GQUPDEBWOYLXVYEBFFH2A6CYEAZKJIAQZ4WBOF3VV4SVPXCF465HQIDUIETXVBPAYHUAFGQHBK7URLLAEL3QRUAAAAMQ6UPXKDAAAAAADPGGPTW2M7AHRQX; _gia_d=1; flash=3_ThhCk8KdzE8-gbbOUJW9uQrmeT6Zo-g70t_2uIj3uza6P9c6FtjWBeWGTc-pB_FGDNDMJGmuamKpwuvljJqM1ZHoZkrwLX4rBpZqgGj5QT4TFqN3j4KMMDw4sBllDg1qFQ9LCDBx1kClUbB6Is60ZKaOZy5d7TqcOd1r3_JANIXYPDgQdes*;"

}
json_data = requests.get(url,headers=headers).json()

2.3 解析数据

数据获得结果如图,此时需要对数据进行解析。

利用字典进行数据提取

开发者模式中,文件内的comment即为json_data['comments']的数据,还可以提取其余例如:imageListCount等

comment_list  = (json_data['comments'])
for comment in comment_list:
    content = comment['content']
    print(content)

假如需要更多的数据则可以加入循环去批量获取。

三、整体代码

import requests
import json

url = "https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1722161535569&body=%7B%22productId%22%3A100128967368%2C%22score%22%3A0%2C%22sortType%22%3A5%2C%22page%22%3A0%2C%22pageSize%22%3A10%2C%22isShadowSku%22%3A0%2C%22fold%22%3A1%2C%22bbtf%22%3A%22%22%2C%22shield%22%3A%22%22%7D&h5st=20240728181215573%3Bzygzgzgt5n69mgg3%3Bfb5df%3Btk03wa5cf1c2718n6i3e1l1V4shzsNxLA9yL0zKaGsuGsUzPdG0q5YgTwF3NLp0PqLwKMLcnKhT3CMB_CzF4l3rpy8Rr%3B6e2fa33c92dd4052e66fda0f714a5b14%3B4.7%3B1722161535573%3BVKZ1m5ZwhDVgj9rNkk8oppAIAvifUJbzS2DoBNIrvSrGf2_AgJcuqKg-gbIS-gTyexG5hGTu66LkypwmOuHTKEqlVzxcr9vHxz2WTwCPrh_s3EuyuBuaTyr0YgXpYo4H4hwE_OnVxo57-FrqA4mpDJSdz28ZTeLekHnOAwl7jH7_pHg9uy76eM_09DQ9nm_tgqnc-zSF8OFhDIPhPV21p-Jb6-rjveGNfEa6knTzalxosRT3cJr_LfKCjcmSy-vEvJZN2oICxL8UD3NTptwQFphFKWIO6VVyTZBtbsvBJOus_rczNtd-xB8g84H8xuF9vo3WqqN7RzKLgpkRsiEjI8SR9KpJyOF1HptguJyAcmpz9FVa-_T-g6J4q5nDjO4hPpAWp-cec9Q_-mS7FdRC9srUJubl-eO9P2bZ9xv8mS0LeglHLB7D7Xp4SHc8_S4BZFpQQZt5vphfUMkRHoQrAkNwRxgUx_ac2JkEwzCI0z9TSWVNQ2l8IniXQdz5GGL6_tV9xS56w7vk5ey8BXFY4RqOq3cPNsbA77CD9gDvmkWdNApyR6QeR5uk1rPMyhr7pQljp5LKEBpK9DipKa--Z1kZEsmeoFQYdlaSjERAl5d0ggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3%3B2fa3cd43c062efe2ecffd02104eecb33&x-api-eid-token=jdd03KQF6GQUPDEBWOYLXVYEBFFH2A6CYEAZKJIAQZ4WBOF3VV4SVPXCF465HQIDUIETXVBPAYHUAFGQHBK7URLLAEL3QRUAAAAMQ7DJV4IAAAAAAC4GZOFHD4NGYV4X&loginType=3&uuid=181111935.1676817357.1722098687.1722136653.1722161521.3"
#发送Http请求
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0',
    "authority":"api.m.jd.com",
    "Cookie":"unpl=JF8EAJpnNSttDEIEDR4FSUERTVlTWwhbTx4GOmBRAQ5QGV1QEwobFBh7XlVdWBRKER9sbhRVXFNOUQ4eBCsSEXteU11bD00VB2xXVgQFDQ8WUUtBSUt-SVxRWFULShIAaWYHZG1bS2QFGjIbFRZMVFRYXwxNJwJfYDVkbVpOXAYeCysTIEptFgoBD0kRB2piSFRaXkxdBR0AHxQgSm1X; __jdv=76161171|baidu-search|t_262767352_baidusearch|cpc|304792042703_0_e8a946cb07564db584d6edb9c8d99179|1722098687985; __jdu=1676817357; areaId=19; PCSYCityID=CN_440000_440100_0; shshshfpa=8ce41290-fad5-fb9d-f83d-1d16eb11ca5e-1722098689; shshshfpx=8ce41290-fad5-fb9d-f83d-1d16eb11ca5e-1722098689; jsavif=1; mba_muid=1676817357; mt_xid=V2_52007VwMVVFpRUlwdTR9sVTNWQgZbXwdGG0EZXhliUBoHQQsADhtVTFgFNQQbUVULBw1KeRpdBmIfE1RBW1tLHkgSWQFsAhJiX2hSahdIG1gEZwcbUlVdWlMeSR1eBWQzF1NUXg%3D%3D; ipLoc-djd=19-1601-50258-129167; mba_sid=17220986914544814580665214385.3; TrackID=12ZmKyVqv2jxp36JwNCPAx7T9eynNnF5TWvNUsibQjLjUkTwpVEppGJIM3S7Y2appP_Lj6ngaWMpkq_fYn91DmqYDvZnpDi7JPIvOT6D_szkg_n-mVGZFxhQ5RfjsrMbM; thor=40DBD2CDABF0F02944485497D092C6905D9DBA8C119C981D47FD5305C0205D57624B3C54ED2C39C2B4ADF4185024054AFD3632583359727DB25295BCE5D4B98E59F052EAA39AF7261D3283201380908D721413D8B29182DFB893C8FA083E6213FD6025E8A8F101A0A7B4816BC422D6245136B73CA5ACAB53122BA23B14C4C694073B172EC39F149514C4C68768761BEE; pinId=WVRVFL3KGNDsti_0vFSaPA; pin=wdCcURIgChYhDF; unick=%E6%B2%A1%E5%90%83%E9%A5%B1%E4%BD%86%E5%BE%88%E9%97%B2; ceshi3.com=103; _tp=vjWyFcCiJK0r9gVMsWBvUQ%3D%3D; _pst=wdCcURIgChYhDF; token=27709efed065e98865d5be5da33077d9,3,956721; 3AB9D23F7A4B3C9B=KQF6GQUPDEBWOYLXVYEBFFH2A6CYEAZKJIAQZ4WBOF3VV4SVPXCF465HQIDUIETXVBPAYHUAFGQHBK7URLLAEL3QRU; __jda=181111935.1676817357.1722098687.1722098687.1722098688.1; __jdc=181111935; __jdb=181111935.14.1676817357|1.1722098688; 3AB9D23F7A4B3CSS=jdd03KQF6GQUPDEBWOYLXVYEBFFH2A6CYEAZKJIAQZ4WBOF3VV4SVPXCF465HQIDUIETXVBPAYHUAFGQHBK7URLLAEL3QRUAAAAMQ6UPXKDAAAAAADPGGPTW2M7AHRQX; _gia_d=1; flash=3_ThhCk8KdzE8-gbbOUJW9uQrmeT6Zo-g70t_2uIj3uza6P9c6FtjWBeWGTc-pB_FGDNDMJGmuamKpwuvljJqM1ZHoZkrwLX4rBpZqgGj5QT4TFqN3j4KMMDw4sBllDg1qFQ9LCDBx1kClUbB6Is60ZKaOZy5d7TqcOd1r3_JANIXYPDgQdes*;"

}
json_data = requests.get(url,headers=headers).json()

#print(json_data)
#解析json数据
comment_list  = (json_data['comments'])
for comment in comment_list:
    content = comment['content']
    print(content)

  • 10
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值