Python递归爬取今日头条指定用户一个月内发表的所有文章，视频，微头条(1)

最新推荐文章于 2024-07-27 23:04:11 发布

2401_84139095

最新推荐文章于 2024-07-27 23:04:11 发布

阅读量1k

点赞数 24

分类专栏：程序员文章标签： python 开发语言

本文链接：https://blog.csdn.net/2401_84139095/article/details/138494167

版权

程序员专栏收录该内容

146 篇文章 1 订阅

订阅专栏

在这里插入图片描述

感谢每一个认真阅读我文章的人，看着粉丝一路的上涨和关注，礼尚往来总是要有的：

①　2000多本Python电子书（主流和经典的书籍应该都有了）

②　Python标准库资料（最全中文版）

③　项目源码（四五十个有趣且经典的练手项目及源码）

④　Python基础入门、爬虫、web开发、大数据分析方面的视频（适合小白学习）

⑤ Python学习路线图（告别不入流的学习）

网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。

需要这份系统化学习资料的朋友，可以戳这里获取

一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！

def headers():

各种PC端

user_agent_list = [

Opera

“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60”,

“Opera/8.0 (Windows NT 5.1; U; en)”,

“Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50”,

“Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50”,

Firefox

“Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0”,

“Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10”,

Safari

“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2”,

chrome

“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36”,

“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11”,

“Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16”,

360

“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36”,

“Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko”,

淘宝浏览器

“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11”,

猎豹浏览器

“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER”,

“Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)”,

“Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)”,

QQ浏览器

“Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)”,

“Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)”,

sogou浏览器

“Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0”,

“Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)”,

maxthon浏览器

“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36”,

UC浏览器

“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36”,

]

UserAgent = random.choice(user_agent_list)

headers = {‘User-Agent’: UserAgent}

return headers

headers_a = {

“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36”,

}

代理ip

proxy = {

‘http’: ‘183.57.44.62:808’

}

cookies值

cookies = {‘s_v_web_id’: ‘b68312370162a4754efb0510a0f6d394’}

获取_signature

def get_signature(user_id, max_behot_time):

with open(‘newsign.js’, ‘r’, encoding=‘utf-8’) as f:

jsData = f.read()

execjs.get()

ctx = execjs.compile(jsData).call(‘tac’, str(user_id) + str(

max_behot_time)) # 复原TAC.sign(userInfo.id + “” + i.param.max_behot_time)

return ctx

获取as,cp

def get_as_cp(): # 该函数主要是为了获取as和cp参数，程序参考今日头条中的加密js文件：home_4abea46.js

zz = {}

now = round(time.time())

print(now) # 获取当前计算机时间

e = hex(int(now)).upper()[2:] # hex()转换一个整数对象为16进制的字符串表示

print(‘e:’, e)

a = hashlib.md5() # hashlib.md5().hexdigest()创建hash对象并返回16进制结果

print(‘a:’, a)

a.update(str(int(now)).encode(‘utf-8’))

i = a.hexdigest().upper()

print(‘i:’, i)

if len(e) != 8:

zz = {‘as’: ‘479BB4B7254C150’,

‘cp’: ‘7E0AC8874BB0985’}

return zz

n = i[:5]

a = i[-5:]

r = ‘’

s = ‘’

for i in range(5):

s = s + n[i] + e[i]

for j in range(5):

r = r + e[j + 3] + a[j]

zz = {

‘as’: ‘A1’ + s + e[-3:],

‘cp’: e[0:3] + r + ‘E1’

}

print(‘zz:’, zz)

return zz

获取as,cp,_signature(弃用)

def get_js():

f = open(r"juejin.js", ‘r’, encoding=‘UTF-8’) ##打开JS文件

line = f.readline()

htmlstr = ‘’

while line:

htmlstr = htmlstr + line

line = f.readline()

ctx = execjs.compile(htmlstr)

return ctx.call(‘get_as_cp_signature’)

print(json.loads(get_js())[‘as’])

文章数据

break_flag = []

def wenzhang(url=None, max_behot_time=0, n=0, csv_name=0):

max_qingqiu = 50

headers1 = [‘发表时间’, ‘标题’, ‘来源’, ‘所有图片’, ‘文章内容’]

first_url = ‘https://www.toutiao.com/c/user/article/?page_type=1&user_id=%s&max_behot_time=%s&count=20&as=%s&cp=%s&_signature=%s’ % (

url.split(‘/’)[-2], max_behot_time, get_as_cp()[‘as’], get_as_cp()[‘cp’],

get_signature(url.split(‘/’)[-2], max_behot_time))

while n < max_qingqiu and not break_flag:

try:

print(url)

r = requests.get(first_url, headers=headers_a, cookies=cookies)

data = json.loads(r.text)

print(data)

max_behot_time = data[‘next’][‘max_behot_time’]

if max_behot_time:

article_list = data[‘data’]

for i in article_list:

try:

if i[‘article_genre’] == ‘article’:

res = requests.get(‘https://www.toutiao.com/i’ + i[‘group_id’], headers=headers(),

cookies=cookies)

time.sleep(1)

article_title = re.findall(“title: ‘(.*?)’”, res.text)

article_content = re.findall(“content: ‘(.*?)’”, res.text, re.S)[0]

pattern = re.compile(r"[(a-zA-Z~-_!@#$%^+&\/?|:.<>{}()';=)|\d]")

article_content = re.sub(pattern, ‘’, article_content[0])

article_content = article_content.replace(‘"’, ‘’).replace(‘u003C’, ‘<’).replace(

‘u003E’,

‘>’).replace(

‘=’,

‘=’).replace(

‘u002F’, ‘/’).replace(‘\’, ‘’)

article_images = etree.HTML(article_content)

article_image = article_images.xpath(‘//img/@src’)

article_time = re.findall(“time: ‘(.*?)’”, res.text)

article_source = re.findall(“source: ‘(.*?)’”, res.text, re.S)

result_time = []

[result_time.append(i) for i in

str(article_time[0]).split(’ ‘)[0].replace(’-‘, ‘,’).split(’,')]

print(result_time)

cha = (datetime.now() - datetime(int(result_time[0]), int(result_time[1]),

int(result_time[2]))).days

print(cha)

if 30 < cha <= 32:

print(‘完成’)

break_flag.append(1)

break

continue

if cha > 32:

print(‘完成’)

break_flag.append(1)

break

row = {‘发表时间’: article_time[0], ‘标题’: article_title[0].strip(‘"’),

‘来源’: article_source[0],‘所有图片’:article_image,

‘文章内容’: article_content.strip()}

with open(‘/toutiao/’ + str(csv_name) + ‘文章.csv’, ‘a’, newline=‘’, encoding=‘gb18030’)as f:

f_csv = csv.DictWriter(f, headers1)

f_csv.writeheader()

f_csv.writerow(row)

print(‘正在爬取文章:’, article_title[0].strip(‘"’), article_time[0],

‘https://www.toutiao.com/i’ + i[‘group_id’])

time.sleep(1)

else:

pass

except Exception as e:

print(e, ‘https://www.toutiao.com/i’ + i[‘group_id’])

wenzhang(url=url, max_behot_time=max_behot_time, csv_name=csv_name, n=n)

else:

pass

except KeyError:

n += 1

print(‘第’ + str(n) + ‘次请求’, first_url)

time.sleep(1)

if n == max_qingqiu:

print(‘请求超过最大次数’)

break_flag.append(1)

else:

pass

except Exception as e:

print(e)

else:

pass

print(max_behot_time)

print(data)

文章详情页数据(已合并到文章数据)

def get_wenzhang_detail(url, csv_name=0):

headers1 = [‘发表时间’, ‘标题’, ‘来源’, ‘文章内容’]

res = requests.get(url, headers=headers_a, cookies=cookies)

time.sleep(1)

article_title = re.findall(“title: ‘(.*?)’”, res.text)

article_content = re.findall(“content: ‘(.*?)’”, res.text, re.S)

pattern = re.compile(r"[(a-zA-Z~-_!@#$%^+*&\/?|:.<>{}()';=)*|\d]")

article_content = re.sub(pattern, ‘’, article_content[0])

article_time = re.findall(“time: ‘(.*?)’”, res.text)

article_source = re.findall(“source: ‘(.*?)’”, res.text, re.S)

result_time = []

[result_time.append(i) for i in str(article_time[0]).split(’ ‘)[0].replace(’-‘, ‘,’).split(’,')]

print(result_time)

cha = (datetime.now() - datetime(int(result_time[0]), int(result_time[1]), int(result_time[2]))).days

print(cha)

if cha > 8:

return None

row = {‘发表时间’: article_time[0], ‘标题’: article_title[0].strip(‘"’), ‘来源’: article_source[0],

‘文章内容’: article_content.strip()}

with open(‘/toutiao/’ + str(csv_name) + ‘文章.csv’, ‘a’, newline=‘’)as f:

f_csv = csv.DictWriter(f, headers1)

f_csv.writeheader()

f_csv.writerow(row)

print(‘正在爬取文章:’, article_title[0].strip(‘"’), article_time[0], url)

time.sleep(0.5)

return ‘ok’

视频数据

break_flag_video = []

def shipin(url, max_behot_time=0, csv_name=0, n=0):

max_qingqiu = 20

headers2 = [‘视频发表时间’, ‘标题’, ‘来源’, ‘视频链接’]

first_url = ‘https://www.toutiao.com/c/user/article/?page_type=0&user_id=%s&max_behot_time=%s&count=20&as=%s&cp=%s&_signature=%s’ % (

url.split(‘/’)[-2], max_behot_time, get_as_cp()[‘as’], get_as_cp()[‘cp’],

get_signature(url.split(‘/’)[-2], max_behot_time))

while n < max_qingqiu and not break_flag_video:

try:

res = requests.get(first_url, headers=headers_a, cookies=cookies)

data = json.loads(res.text)

print(data)

max_behot_time = data[‘next’][‘max_behot_time’]

if max_behot_time:

video_list = data[‘data’]

for i in video_list:

try:

start_time = i[‘behot_time’]

video_title = i[‘title’]

video_source = i[‘source’]

detail_url = ‘https://www.ixigua.com/i’ + i[‘item_id’]

resp = requests.get(detail_url, headers=headers())

r = str(random.random())[2:]

url_part = “/video/urls/v/1/toutiao/mp4/{}?r={}”.format(

re.findall(‘“video_id”:“(.*?)”’, resp.text)[0], r)

s = crc32(url_part.encode())

api_url = “https://ib.365yg.com{}&s={}”.format(url_part, s)

resp = requests.get(api_url, headers=headers())

j_resp = resp.json()

video_url = j_resp[‘data’][‘video_list’][‘video_1’][‘main_url’]

video_url = b64decode(video_url.encode()).decode()

print((int(str(time.time()).split(‘.’)[0])-start_time)/86400)

if 30 < (int(str(time.time()).split(‘.’)[0]) - start_time) / 86400 <= 32:

print(‘完成’)

break_flag_video.append(1)

continue

if (int(str(time.time()).split(‘.’)[0]) - start_time) / 86400 > 32:

print(‘完成’)

break_flag_video.append(1)

break

row = {‘视频发表时间’: time.strftime(‘%Y-%m-%d %H:%M:%S’, time.localtime(start_time)),

‘标题’: video_title, ‘来源’: video_source,

‘视频链接’: video_url}

with open(‘/toutiao/’ + str(csv_name) + ‘视频.csv’, ‘a’, newline=‘’, encoding=‘gb18030’)as f:

f_csv = csv.DictWriter(f, headers2)

f_csv.writeheader()

f_csv.writerow(row)

print(‘正在爬取视频：’, video_title, detail_url, video_url)

time.sleep(3)

except Exception as e:

print(e, ‘https://www.ixigua.com/i’ + i[‘item_id’])

shipin(url=url, max_behot_time=max_behot_time, csv_name=csv_name, n=n)

except KeyError:

n += 1

print(‘第’ + str(n) + ‘次请求’, first_url)

time.sleep(3)

if n == max_qingqiu:

print(‘请求超过最大次数’)

break_flag_video.append(1)

except Exception as e:

print(e)

else:

pass

微头条

break_flag_weitoutiao = []

def weitoutiao(url, max_behot_time=0, n=0, csv_name=0):

max_qingqiu = 20

headers3 = [‘微头条发表时间’, ‘来源’, ‘标题’, ‘文章内图片’, ‘微头条内容’]

while n < max_qingqiu and not break_flag_weitoutiao:

try:

first_url = ‘https://www.toutiao.com/api/pc/feed/?category=pc_profile_ugc&utm_source=toutiao&visit_user_id=%s&max_behot_time=%s’ % (

url.split(‘/’)[-2], max_behot_time)

print(first_url)

res = requests.get(first_url, headers=headers_a, cookies=cookies)

data = json.loads(res.text)

print(data)

max_behot_time = data[‘next’][‘max_behot_time’]

weitoutiao_list = data[‘data’]

for i in weitoutiao_list:

try:

detail_url = ‘https://www.toutiao.com/a’ + str(i[‘concern_talk_cell’][‘id’])

print(detail_url)

resp = requests.get(detail_url, headers=headers(), cookies=cookies)

start_time = re.findall(“time: ‘(.*?)’”, resp.text, re.S)

weitoutiao_name = re.findall(“name: ‘(.*?)’”, resp.text, re.S)

weitoutiao_title = re.findall(“title: ‘(.*?)’”, resp.text, re.S)

weitoutiao_images = re.findall(‘images: [“(.*?)”]’,resp.text,re.S)

print(weitoutiao_images)

if weitoutiao_images:

weitoutiao_image = ‘http:’ + weitoutiao_images[0].replace(‘u002F’,‘/’).replace(‘\’,‘’)

做了那么多年开发，自学了很多门编程语言，我很明白学习资源对于学一门新语言的重要性，这些年也收藏了不少的Python干货，对我来说这些东西确实已经用不到了，但对于准备自学Python的人来说，或许它就是一个宝藏，可以给你省去很多的时间和精力。

别在网上瞎学了，我最近也做了一些资源的更新，只要你是我的粉丝，这期福利你都可拿走。

我先来介绍一下这些东西怎么用，文末抱走。

（1）Python所有方向的学习路线（新版）

这是我花了几天的时间去把Python所有方向的技术点做的整理，形成各个领域的知识点汇总，它的用处就在于，你可以按照上面的知识点去找对应的学习资源，保证自己学得较为全面。

最近我才对这些路线做了一下新的更新，知识体系更全面了。

在这里插入图片描述

（2）Python学习视频

包含了Python入门、爬虫、数据分析和web开发的学习视频，总共100多个，虽然没有那么全面，但是对于入门来说是没问题的，学完这些之后，你可以按照我上面的学习路线去网上找其他的知识资源进行进阶。

在这里插入图片描述

（3）100多个练手项目

我们在看视频学习的时候，不能光动眼动脑不动手，比较科学的学习方法是在理解之后运用它们，这时候练手项目就很适合了，只是里面的项目比较多，水平也是参差不齐，大家可以挑自己能做的项目去练练。

在这里插入图片描述

（4）200多本电子书

这些年我也收藏了很多电子书，大概200多本，有时候带实体书不方便的话，我就会去打开电子书看看，书籍可不一定比视频教程差，尤其是权威的技术书籍。

基本上主流的和经典的都有，这里我就不放图了，版权问题，个人看看是没有问题的。

（5）Python知识点汇总

知识点汇总有点像学习路线，但与学习路线不同的点就在于，知识点汇总更为细致，里面包含了对具体知识点的简单说明，而我们的学习路线则更为抽象和简单，只是为了方便大家只是某个领域你应该学习哪些技术栈。

在这里插入图片描述

（6）其他资料

还有其他的一些东西，比如说我自己出的Python入门图文类教程，没有电脑的时候用手机也可以学习知识，学会了理论之后再去敲代码实践验证，还有Python中文版的库资料、MySQL和HTML标签大全等等，这些都是可以送给粉丝们的东西。

在这里插入图片描述

这些都不是什么非常值钱的东西，但对于没有资源或者资源不是很好的学习者来说确实很不错，你要是用得到的话都可以直接抱走，关注过我的人都知道，这些都是可以拿到的。

网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。

需要这份系统化学习资料的朋友，可以戳这里获取

2401_84139095

关注

24
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录