1.头条 其中包括了视频地址的解析 (涉及到了一个加密参数的算法 往视频地址发送请求 需要携带2个参数 r:
和s)
头条搜索页面返回数据的接口:https://m.toutiao.com/search/?keyword= 返回数据的接口 https://www.toutiao.com/api/search/content/? aid=24&app_name=web_search &offset=0&format=json &keyword=%E4%BC%98&autoload=true &count=20&en_qc=1&cur_tab=1 &from=search_tab&pd=synthesis ×tamp=1559618259319
下面是涉及到的一个视频加密的一个算法 crc32 西瓜视频 def right_shift(val, n): return val >> n if val >= 0 else (val + 0x100000000) >> n
r = str(random.random())[2:] #下面2个地址都可以的 url_ship = 'http://i.snssdk.com/video/urls/v/1/toutiao/mp4/%s' % vid (vid可以使用正则从html页面上获取 是唯一码) url_ship = 'https://ib.365yg.com/video/urls/v/1/toutiao/mp4/%s' % vid n = urlparse.urlparse(url_ship).path + '?r=' + r c = binascii.crc32(n.encode()) #CRC32 加密方式 方法在上方 s = right_shift(c, 0)
获取到之后那么就可以发送请求了
response_json = requests.get(url_ship + "?r=%s&s=%s" % (r, s)) dict_main_url = response_json.json()
会返回一个json包括了真实的地址(虽然他娘的还是有时效的 我qutm的)
2搜狗微信端 根据关键字搜文章 爬取文章内容(里面还有重定向获取真实url的操作)
接口: 直接访问url获取文章 这个是获取url的接口
first_url = "https://weixin.sogou.com/weixin?" params = { "type": 2, "s_form": input, "query": key, "ie": "utf8", "page": page } url = first_url + urllib.parse.urlencode(params)
得到的url:
http://weixin.sogou.com/api/share?timestamp=1560158525& signature=qIbwY*nI6KU9tBso4VCd8lYSesxOYgLcHX5tlbqlMR8N6flDHs4LLcFgRw7FjTAOiGnso7stPxQuVFh3xMrL-7Imwh3ol2kxfJyecefRvF1AHehq*TmWfdp4Sz1aPeJyCaeMYnZc-PGu1FNA3aMAydF37auv5nWCwzCtikq-2mRyGoUc-4m*rXX2ulbWQgE6QFCk8-c95LNUk7chCSPYuZNxv8Ico0X7L2lG1gqXlhc=
那么这个就是获取到的一个有时效性的url文章链接 是6个小时的时效(目前知道的唯一获取永久的链接那么就是 微信中分享出来的链接是,永久的短链接)
涉及到一个sunid参数的获取。访问一个页面 那么就会返回一个sunid的值 如果不够了,直接多次访问生产redis池子就行。
附上代码
headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Cache-Control": "max-age=0", "Connection": "keep-alive", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) " "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36", "Cookie":"ABTEST=5|1562224152|v1; " "IPLOC=CN4401; " "SUID=23198C3D3020910A000000005CF4E970; " "PHPSESSID=tqprlvdubu26a7dg6946ao7ln1;" "SUV=0047C07F3D8C19235CF5C96993884216; " "weixinIndexVisited=1; " "ad=0qd8bZllll2tM2cAlllllV1tNZ6lllll5YZY@lllll9llllll8UiYK@@@@@@@@@@; " } #获取SNUID response = requests.get(url, headers=headers,proxies = currentnow_proxies) # print(requests.utils.dict_from_cookiejar(response.cookies)) #获取cookies字典 try: SNUID = requests.utils.dict_from_cookiejar(response.cookies)["SNUID"] except: SNUID,currentnow_proxies_new= digui_sunid(currentnow_proxies) currentnow_proxies = currentnow_proxies_new # print(SNUID,"得到了 得到了 得到了 ") else: # print(SNUID, "第一次成功") pass finally:
redis_Verification.sunid_cun(SNUID)
这样suid就解决了
{ 'User-Agent': random.choice(USER_AGENTS), 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8,application/signed-exchange;v=b3', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate, br', "Cookie":'CXID=0470605E8267AD799D585C978932B771; ssuid=1913477052; ' 'IPLOC=CN4401; ' 'SUID=23198C3D3020910A000000005CF4E970; sw_uuid=7451527757; start_time=1559554451931; weixinIndexVisited=1; ' 'SUV=0047C07F3D8C19235CF5C96993884216; JSESSIONID=aaaovhQyMPOyvetfBxjRw; PHPSESSID=tqprlvdubu26a7dg6946ao7ln1; ad=0qd8bZllll2tM2cAlllllV1tNZ6lllll5YZY@lllll9llllll8UiYK@@@@@@@@@@; ' 'ABTEST=5|1562224152|v1; pgv_pvi=2210944000; pgv_si=s8315080704; ' 'SNUID='+snuid+'; sct=45', }
这个是访问接口必须带上的headers包含了获取的snuid 那么这个参数是不停的尝试,使用postman 来尝试获取 试出来的所需的必要参数,
snuid 是根据关键词获取url必须的 后期获取的url访问得到文章真实内容并不需要这个参数
3.ali企业黄页,这个没有什么难度,只需要做ip切换 ua切换就行。
接口:搜索接口。用来获取memberid企业的唯一码
url = "https://s.1688.com/selloffer/offer_search.htm?" \ "keywords=%BA%BC%D6%DD&" \ "button_click=top&" \ "earseDirect=false&" \ "n=y&" \ "netType=1%2C11&beginPage=3"
详情页接口:
https://corp.1688.com/page/index.htm?" \ "memberId=%s&" \ "fromSite=company_site&" \ "tab=companyWeb_detail" % memberid
4.微淘数据,关键的是获取cookie 那么这就涉及到一个账号量的问题,在手机上访问是不用登陆账号的,那么在网页端是需要的。(永动机逻辑:根据contentid找到相关的tags(相关标签),根据文章的标签找contentid ,重复执行)
获取必要的参数 关于token中的3个子参数
index = 2 num = 20 url = 'https://acs.m.taobao.com/h5/mtop.taobao.social.feed.aggregate/1.0/' appKey = '12574478' # 获取当前时间戳 # t = str(int(time.time() * 1000)) data = '{"params":"{\\"nodeId\\":\\"\\",\\"sellerId\\":\\"50852803\\",\\"pagination\\":{\\"direction\\":\\"1\\",\\"hasMore\\":\\"true\\",\\"pageNum\\":\\"' + str( index) + '\\",\\"pageSize\\":\\"' + str(num) + '\\"}}","cursor":"' + str( index) + '","pageNum":"' + str( index) + '","pageId":5703,"env":"1"}' params = { 'appKey': appKey, 'data': data } # 请求空获取cookies html = requests.get(url, params=params) tk_dic["_m_h5_tk"] = html.cookies['_m_h5_tk'] tk_dic["_m_h5_tk_enc"] = html.cookies['_m_h5_tk_enc'] # token = _m_h5_tk.split('_')[0] tk_dic["cookie_t"] = html.cookies['t']
headers构成:
headers = { "User-Agent": random.choice(setting.USER_AGENTS), "Referer": "https://market.m.taobao.com/apps/market/content/index.html?contentId=233072184671", "Sec-Fetch-Mode": "no-cors", "Cookie": "t=" + html.cookies['t'] + "; _m_h5_tk=" + html.cookies['_m_h5_tk'] + "; _m_h5_tk_enc=" + html.cookies['_m_h5_tk_enc'] +"; " "_nk_=tb17056546156; isg=BPLyKbVF-17lO8diUf2_CS5dQz7Ug_YdvPUJ1bzLHqWQT5JJpBNGLfgtP6rWJG61; sg=62b; skt=e8e9db2c46a59606; dnk=tb17056546156; _cc_=URm48syIZQ%3D%3D; _tb_token_=e3b5b65748e5b; lgc=tb17056546156; csg=6294031f; unb=2206426511902; munb=2206426511902; tracknick=tb17056546156; " }
cookie前面的三个参数可以根据我的方法无限量的获取 并没有关系,后面的一大串就是一个账号的信息,需有实现登陆获取,
根据
contentid 是文章的唯一码
sgin生成 :传入params:(通过寻找参数然后找js然后断点,然后执行,一步一步看生成)
query_params = '{"contentId":' + str(contentid) + ',"source":"daren","type":"h5","params":"","business_spm":"","track_params":""}' md5_1 = token + "&" + str(t) + "&" + "12574478" + "&" + query_params #下面是md5加密 m = hashlib.md5() m.update(bytes(md5_1, encoding="utf-8"))
sign = m.hexdigest()
t是时间戳
根据tags就是文章相关的标签来获取contnenid的接口: "https://h5api.m.taobao.com/h5/mtop.taobao.beehive.list.findcontentlist/1.0/?jsv=2.4.5&appKey=12574478&t=" + str(t) + "&sign=" + sign + "&api=mtop.taobao.beehive.list.findContentList&v=1.0&preventFallback=true&type=jsonp&dataType=jsonp&callback=mtopjsonp2&data=" + params
具体内容的接口:
url = 'https://h5api.m.taobao.com/h5/mtop.taobao.beehive.detail.contentservicenewv2/1.0/?jsv=2.5.1&appKey=12574478&t=' + str(millis) + '&sign=' + sign + '&api=mtop.taobao.beehive.detail.contentservicenewv2&v=1.0&AntiCreep=true&AntiFlood=true&type=jsonp&dataType=jsonp&callback=mtopjsonp1&data=' + query_params