【学习记录】基于python爬取Flickr图片及元数据

为复现师姐论文成果,爬取Flickr网站数据,只需爬取图片元数据,无需爬取图片:

(一已成功,二失败了,这里记录给自己看。)

一、用Python的icrawler包

icrawler是一个轻型框架,自带爬取Flickr的方法,但该方法:

①无法爬取flickr照片的元数据,本人水平有限也不会修改或正确调用源代码;

②不会调用由feeder、parser、downloader组成的crawler;想把feeder、parser、downloader分开用,又不明白它们之间的url_queue和task_queue是如何实现连接的。

学习了python类的继承,方法的重写之后,成功调用icrawler爬取自己想要的东西。

but同二、中一样,无法多次请求。(放一段正在进行的代码和后面的报错。)

2022-08-16 22:37:38,497 - INFO - downloader - image #197	https://www.flickr.com/photos/79191095@N00/51821259303/
save....
2022-08-16 22:37:40,248 - INFO - downloader - image #198	https://www.flickr.com/photos/79191095@N00/51821258453/
2022-08-16 22:37:42,918 - INFO - downloader - image #199	https://www.flickr.com/photos/greathan/51820563139/
save....
2022-08-16 22:37:43,764 - INFO - downloader - image #200	https://www.flickr.com/photos/tomros_pics/51588916180/
save....
2022-08-16 22:37:45,271 - INFO - downloader - image #201	https://www.flickr.com/photos/rebelsabu/51516613806/
save....
2022-08-16 22:37:47,779 - INFO - downloader - image #202	https://www.flickr.com/photos/rebelsabu/51489771857/
save....
2022-08-16 22:37:48,984 - INFO - downloader - image #203	https://www.flickr.com/photos/hysnikapo/51225667481/
save....
2022-08-16 22:37:50,819 - INFO - downloader - image #204	https://www.flickr.com/photos/rebelsabu/51210131734/
save....
2022-08-16 22:37:52,343 - INFO - downloader - image #205	https://www.flickr.com/photos/shyish/51170912410/
save....
save....
2022-08-16 22:37:53,178 - INFO - downloader - image #206	https://www.flickr.com/photos/shyish/51163693309/
2022-08-16 22:37:57,853 - INFO - downloader - image #207	https://www.flickr.com/photos/shyish/51157978260/
save....
save....
2022-08-16 22:37:59,355 - INFO - downloader - image #208	https://www.flickr.com/photos/shyish/51140112694/
2022-08-16 22:37:59,734 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=2, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 2
2022-08-16 22:38:00,098 - INFO - downloader - image #209	https://www.flickr.com/photos/shyish/51139336116/
save....
save....
2022-08-16 22:38:00,762 - INFO - downloader - image #210	https://www.flickr.com/photos/shyish/51130136567/
2022-08-16 22:38:01,612 - INFO - downloader - image #211	https://www.flickr.com/photos/shyish/51128767872/
save....
save....
2022-08-16 22:38:03,031 - INFO - downloader - image #212	https://www.flickr.com/photos/shyish/51128728456/
2022-08-16 22:38:05,131 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=2, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 1
2022-08-16 22:38:08,047 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:10,847 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=2, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 0
2022-08-16 22:38:13,051 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:16,248 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=3, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 2
2022-08-16 22:38:18,055 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:21,715 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=3, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 1
2022-08-16 22:38:23,058 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:27,169 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=3, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 0
2022-08-16 22:38:28,066 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:33,080 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:33,567 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=4, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 2
2022-08-16 22:38:38,085 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:39,123 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=4, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 1
2022-08-16 22:38:43,096 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:44,513 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=4, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 0

如何解决:

用icrawler爬取flickr是需要翻墙的,每爬几千个image换一个站点即可。

二、根据B站爬虫课程中学到的一些知识,自己上手爬flickr。

(一)调用icrawler.builtin中flickr的FlickrFeeder,获取存有我所需图片的页面的url链接

1. 改写FlickrFeeder的源代码

因为feeder和parser之间的url_queue是一个python的“generator object”,我不会打印出来。故在FlickrFeeder类的feed方法中,加入"urllist=[]"、"urllist.append(complete_url)"、"print(urllist)"。可以得到urllist。

urllist = []   # 自己加入
for i in range(page, page + page_max):
    if self.signal.get('reach_max_num'):
        break
    complete_url = '{}&page={}'.format(url, i)
    while True:
        try:
            self.output(complete_url, block=False)
        except:
            if self.signal.get('reach_max_num'):
                break
        else:
            break
    self.logger.debug('put url to url_queue: {}'.format(complete_url))   # complete_url是str类型
    urllist.append(complete_url)   # 自己加入
print(urllist)   # type(urllist)-->list   # 自己加入

2. FlickrFeeder用的是

flickr.photos.search方法

输入参数即可:         

signal = {'signal1': 'reach_max_num'}
session = requests.Session()
apikey = apikey
feeder = FlickrFeeder(thread_num=1, signal=signal, session=session)   # 实例化一个对象
feeder.feed(apikey=apikey, max_num=4000, tags=['Hong Kong'], min_taken_date=datetime.date(2013, 1, 1), max_taken_date=datetime.date(2013, 1, 31), has_geo=1)

① apikey在这里申请:((Flickr 上的應用程式園地

② flickr api 方法说明文档在这里:    Flickr 服務 

(二)解析(一)中得到的urllist,取得photo_id,再用flickr api的flickr.photos.getInfo方法获得所需元数据。

1. (一)中print出的urllist是“base_url+page{1至40}”,有时就算搜索结果有140pages,它也只返回第1至40page。故任选一条url,把最后的page={}删掉,page前面的url作为“base_url” 

请求“base_url”的内容得到pages的数值。

response = json.loads(requests.get(get_pages_url).content.decode(encoding='utf-8'))
pages = int(response['photos']['pages'])

for i in range(pages):即可循环每页。

2. 此时每页url链接https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=958bac39f63d627982cebbffc6733a4e&format=json&nojsoncallback=1&tags=%5B%27hong+kong%27%5D&min_taken_date=2014-04-01&max_taken_date=2014-04-30&has_geo=1可得到:所找的每一张photo的id,,,这是我们下一步请求及解析图片元数据的必要参数。

 两层循环:先循环每一页page,,在循环每页page的每一个photo_id,即可得到全部photos。

headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
    }  # UA伪装

proxies = {"https": None}  # 可自己找代理ip
photolist = []  # search图片时,满足参数的所有图片的list
    for i in range(rages):  # 循环每个页面 
        session = requests.Session()
        response = session.get(get_url, params={'pages': str(i + 1)}, proxies=proxies, headers=headers)  # 请求每个页面的url
        response.close()  
        content_json = response.content.decode(encoding='utf-8')
        content = json.loads(content_json)
        photos = content['photos']['photo']  # 得到一个照片列表
        for photo in photos:  # 循环每一张照片
            photo_id = photo['id']  # 得到每一张照片的id
            base_url = 'https://api.flickr.com/services/rest/?'
            params = {
                'method': 'flickr.photos.getInfo',   # flickr.photos.getInfo方法可获取taken_time,longitude,latitude,tags,image_url等元数据
                'api_key': apikey,
                'photo_id': photo_id,
                'format': 'json',
                'nojsoncallback': 1
            }
            session = requests.Session()
            ret = session.get(base_url + urlencode(params), proxies=proxies)
            info = json.loads(ret.content.decode())
            infolist = []  # 每个张图片所需的info的list

            infolist.append(photo_id)

            nsid = info['photo']['owner']['nsid']
            infolist.append(nsid)

            username = info['photo']['owner']['username']
            infolist.append(username)

            taken_time = info['photo']['dates']['taken']
            infolist.append(taken_time)

            lon = info['photo']['location']['longitude']
            infolist.append(lon)

            lat = info['photo']['location']['latitude']
            infolist.append(lat)

            # locality = info['photo']['location']['locality']['_content']  
            # infolist.append(locality)

            # url = info['photo']['urls']['url'][0]['_content']
            # infolist.append(url)

            tags = info['photo']['tags']['tag']  
            tag_str = ""
            for tag in tags:
                tag_str = tag_str + tag['raw'] + ", "
            infolist.append(tag_str)

            photolist.append(infolist)

    return photolist

(三)将元数据存入excel表

不细说了。

三、尚未解决的问题

根据二、中代码确已将元数据下载到并存入excel表中,,但Flickr对于爬虫访问次数的限制我实在无法破解,因为我要爬取的数据量蛮大,一年就有3万多张符合要求的图片,而我需要爬6年的数据。现在好像已经被flickr服务器限制,实在无法继续爬取了。

换了wifi

换了代理ip

试着加了try except

加了time.sleep()

加了response.close()

用了fiddler软件抓包工具

设置socket默认的等待时间

试了下面链接的方法,不行。

python 爬虫:https; HTTPSConnectionPool(host='z.jd.com', port=443) - 简书 (jianshu.com)

都没成功。呜呜┭┮﹏┭┮

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,以下是基于Python当当网女士外套数据并做可视化分析的步骤: ## 数据 我们首先需要当当网女士外套的数据,可以使用Python的requests和BeautifulSoup库来实现,代码如下: ```python import requests from bs4 import BeautifulSoup url = 'http://search.dangdang.com/?key=%C5%AE%CA%C7%CE%C0&act=input' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} res = requests.get(url, headers=headers) soup = BeautifulSoup(res.text, 'html.parser') items = soup.select('.bigimg > li') data = [] for item in items: title = item.select('.name > a')[0].text.strip() price = item.select('.price > p > span')[0].text.strip() data.append({'title': title, 'price': price}) print(data) ``` 在这段代码中,我们首先定义了要的当当网女士外套的搜索页面链接,并设置了请求头。然后使用requests库发送HTTP请求,并使用BeautifulSoup库解析HTML文档。我们使用CSS选择器来获每个女士外套的标题和价格,并将它们保存在一个列表中。 ## 数据清洗 在获数据后,我们需要对其进行清洗和处理,以便后续的可视化分析。我们可以使用pandas库来实现数据清洗和处理,代码如下: ```python import pandas as pd df = pd.DataFrame(data) df['price'] = df['price'].str.extract('(\d+\.\d+)', expand=False).astype(float) df = df.dropna() print(df.head()) ``` 在这段代码中,我们首先将数据列表转换为pandas的DataFrame格式。然后,我们使用正则表达式从价格中提出数字,并将其转换为浮点数。最后,我们使用dropna()函数删除任何包含NaN值的行,并打印出前五行数据。 ## 数据可视化 在对数据进行清洗和处理后,我们可以使用各种数据可视化工具来分析数据并得出结论。这里我们使用matplotlib库来绘制女士外套价格分布的直方图和箱线图,代码如下: ```python import matplotlib.pyplot as plt plt.figure(figsize=(12, 6)) plt.subplot(1, 2, 1) plt.hist(df['price'], bins=20, color='skyblue') plt.xlabel('Price') plt.ylabel('Count') plt.title('Price Distribution') plt.subplot(1, 2, 2) plt.boxplot(df['price'], vert=False) plt.xlabel('Price') plt.title('Price Boxplot') plt.show() ``` 在这段代码中,我们首先创建一个12x6的画布,并使用subplot函数将画布分成两个子图。第一个子图绘制女士外套价格的直方图,第二个子图绘制女士外套价格的箱线图。最后,我们使用show函数显示图形。 ## 结论 通过对当当网女士外套数据和分析,我们可以得出以下结论: - 女士外套价格主要分布在200元到500元之间,且呈现出右偏分布的趋势; - 大部分女士外套的价格集中在300元到400元之间,其中中位数为360元; - 少数女士外套的价格高达1000元以上,但数量很少。 这些结论可以为女士外套的销售策略和市场营销提供有用的参考和指导。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值