【Python】爬取并下载Instagram帖子的信息、图片和视频

最新推荐文章于 2024-08-08 16:19:04 发布

greatmasonw

最新推荐文章于 2024-08-08 16:19:04 发布

阅读量3.9k

点赞数 1

文章标签： python html

本文链接：https://blog.csdn.net/aloe_gel/article/details/105345095

版权

该项目旨在通过Python爬虫获取Instagram账号的帖子详情，包括时间、用户名、全名、文字、点赞数、评论数、图片描述、图片/视频链接，并下载图片和视频。项目涉及HTML解析、HTTPS代理和浏览器自动化。

摘要由CSDN通过智能技术生成

目录
0. 项目介绍
1. 爬取账户页中所有帖子的链接
2. 爬取并下载帖子页中的信息、图片和视频
3. 完整代码

0. 项目介绍

本项目的目的是输入Instagram账号或账户页链接或帖子页链接，输出该账户帖子的：① 包含当前时间、用户上传帖子的时间（当前时区）、用户名称（Username）、用户全称（Full name）、帖子文字、点赞数、评论数、图片描述（当帖子中有图片时）、图片链接（当帖子中有图片时）、视频观看数（当帖子中有视频时）、视频链接（当帖子中有视频时）的文本文档；② 图片（当帖子中有图片时）、视频（当帖子中有视频时）。

本项目需要先导入如下库：

from selenium import webdriver
from multiprocessing import Pool
import json, time, os

本项目的全局变量如下：

sslPort = 
fxBinaryPath = ''
geckodriverPath = ''
pageDownJS = 'document.documentElement.scrollTop = 100000000'
outputPath = ''
ariaPath = ''
httpsProxy = 'https://127.0.0.1:{}/'.format(str(sslPort))

解释：

sslPort：可用于访问Instagram的HTTPS代理的本地端口。
fxBinaryPath：Firefox浏览器的firefox.exe的绝对路径。
geckodriverPath：geckodriver.exe的绝对路径。
pageDownJS：用于下拉页面的JavaScript代码。
outputPath：输出路径。
ariaPath：aria2c.exe的绝对路径。
httpsProxy：用于GNU Wget for Windows的HTTPS代理。

本项目的基本结构如下：

def Driver():
	# Driver函数用于构造Firefox浏览器实例，输出浏览器实例

class DOWNLOAD:
	# DOWNLOAD类是一个多进程下载工具

class POST:
	# POST类用于爬取并下载帖子页中的信息、图片和视频

class PROFILE:
	# PROFILE类用于爬取账户页中所有帖子的链接

def Main():
	# Main函数是主函数，输入Instagram账号或账户页链接或帖子页链接，控制各类和函数

if __name__ == '__main__':
	Main()

本项目的运行流程可见Main函数：

def Main():
	fxDriver = Driver()
	inputUrl = input('Please input instagram link or username: ')
	
	if '/p/' in inputUrl:
		POST(fxDriver, inputUrl).Main()
	else:
		if not 'www.instagram.com' in inputUrl:
			inputUrl = 'https://www.instagram.com/{}/'.format(inputUrl)
		urlList = PROFILE(fxDriver, inputUrl).Main()
		if urlList:
			l = len(urlList)
			i = 0
			for url in urlList:
				POST(fxDriver, url).Main()
				i += 1
				print('\n\n{:.2f} % completed.\n\n'.format(i / l * 100))
	
	fxDriver.quit()
	Main()

1. 爬取账户页中所有帖子的链接

这一步的基本结构如下：

def Main(self):
	try:
		fxDriver.get(self.profileUrl)
		urlList = self.GetWholePage()
		return urlList
	except Exception as e:
		print(e)

解释：① 浏览器访问账户页。② self.GetWholePage()负责爬取账户页中所有帖子的链接，生成链接列表urlList。

self.GetWholePage()如下：

def GetWholePage(self):
	updateCount = self.Update()
	fxDriver.execute_script(pageDownJS)
	
	try:
		fxDriver.find_element_by_xpath('//div[contains(text(), "更多帖子")]').click()
	except Exception as e:
		print(e)
	
	locY, urlDict = self.GetLocY()
	
	while 1:
		fxDriver.execute_script(pageDownJS)						
		while 1:
			locYNew, urlDictNew = self.JudgeLoading(locY, urlDict)				
			urlList = [t[0] for t in sorted(urlDictNew.items(), key = lambda x:x[1])]
			
			if len(urlList) >= updateCount:
				return urlList[: updateCount]
			
			if locYNew == None:
				continue
			else:
				locY = locYNew
				urlDict = urlDictNew
				break

解释：

self.Update()用于计算需要更新的贴子数。
fxDriver.execute_script(pageDownJS)可以通过执行JS代码pageDownJS把页面拉到最下面。
self.GetLocY()可以获得账户页HTML中每个帖子链接所在tag的Y坐标locY和当前加载的所有帖子的链接字典urlDict。
self.JudgeLoading(locY, urlDict)可以对比输入的Y坐标和0.5秒之后的Y坐标来判断pageDownJS有没有执行完毕。

self.Update()如下：

def Update(self):
	for e in fxDriver.find_elements_by_xpath('//script[@type="text/javascript"]'):
		try:
			jsonText = e.get_attribute('textContent')
			if 'viewerId' in jsonText:
				jsonData = json.loads(jsonText[jsonText.find('{'): jsonText.rfind('}') + 1])['entry_data']['ProfilePage'][0]['graphql']['user']
				break
		except:
			continue
	
	postCount = jsonData['edge_owner_to_timeline_media']['count']
	username = jsonData['username']
	folder = '{}\\{}'.format(outputPath, username)
	
	if os.path.exists(folder):
		downloadCount = len([x for x in os.listdir(folder) if os.path.isdir('{}\\{}'.format(folder, x))])
	else:
		downloadCount = 0
	
	updateCount = postCount - downloadCount
	
	return updateCount

解释：① 解析网页中的贴子数。② 统计已经下载了多少帖子。③ 计算需要更新的贴子数。

self.GetLocY()如下：

def GetLocY(self):
	urlDict = {
   }
	
	for e in fxDriver.find_elements_by_xpath('//a[contains(@href, "/p/")]'):
		locY = e.location['y']
		locX = e.location['x']
		url = e.get_attribute('href')
		urlDict[url] = locX/1000 + locY
	
	return locY, urlDict

解释：通过循环判断'/p/'有没有在a标签的'href'属性中来获得帖子链接及其所在tag的Y坐标。

self.JudgeLoading(locY, urlDict)如下：

def JudgeLoading(self, locY, urlDict):
	time.sleep(0.5)		
	locYNew, urlDictNew = self.GetLocY()
	
	if locYNew > locY:
		urlDictNew.update(urlDict)
	else:
		locYNew = None
	
	return locYNew, urlDictNew

把上述模块如下整合到类中：

class PROFILE:
	
	def __init__(self, profileUrl):
		self.profileUrl = profileUrl
	
	def Update(self):
		for e in fxDriver.find_elements_by_xpath('//script[@type="text/javascript"]'):
			try:
				jsonText = e.get_attribute('textContent')
				if 'viewerId' in jsonText:
					jsonData = json.loads(jsonText[jsonText.find('{'): jsonText.rfind('}') + 1])['entry_data']['ProfilePage'][0]['graphql']['user']
					break
			except:
				continue
		
		postCount = jsonData['edge_owner_to_timeline_media']['count']
		username = jsonData['username']
		folder = '{}\\{}'.format(outputPath, username)
		
		if os.path.exists(folder):
			downloadCount = len([x for x in os.listdir(folder) if os.path.isdir('{}\\{}'.format(folder, x))])
		else:
			downloadCount = 0
		
		updateCount = postCount - downloadCount
		
		return updateCount
	
	def GetLocY(self):
		urlDict = {
   }
		
		for e in fxDriver.find_elements_by_xpath('//a[contains(@href, "/p/")]'):
			locY = e.location['y']
			locX = e.location['x']
			url = e.get_attribute('href')
			urlDict[url] = locX/1000 + locY
		
		return locY, urlDict
	
	def JudgeLoading(self, locY, urlDict):
		time.sleep(0.5)		
		locYNew, urlDictNew = self.GetLocY()
		
		if locYNew > locY:
			urlDictNew.update(urlDict)
		else:
			locYNew = None
		
		return locYNew, urlDictNew
	
	def GetWholePage(self)

最低0.47元/天解锁文章

greatmasonw

关注

1
点赞
踩
25

收藏

觉得还不错? 一键收藏
7
评论
【Python】爬取并下载Instagram帖子的信息、图片和视频

目录0. 项目介绍1. 构造浏览器实例2. 登录Instagram账户3. 爬取账户页中所有帖子的链接4. 爬取并下载帖子页中的信息、图片和视频5. 完整代码0. 项目介绍本项目的目的是输入指定Instagram账户页的链接，输出该账户每一个帖子的：① 包含当前时间、用户上传帖子的时间（当前时区）、用户名称（Username）、用户全称（Fu...
复制链接

扫一扫