@简介
项目需求
由于抓取的微信公众号中的图片链接插入新的html文本后出现访问问题,因此几乎用云主机的链接下载图片后,替换原来的微信默认链接,保证图片访问正常。
解决思路
- 下载图片到云主机(scrapy-imagePipeline)
- 用云主机的链接地址替换微信默认链接
- 启用Nginx服务
详细流程
下载图片:
- 修改settings.py,添加image pipeline
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
- 同样在settings.py中添加图片存储路径
IMAGES_STORE = '/path/to/valid/dir'
- 在items.py中添加用来存储图片的k值
import scrapy
class MyItem(scrapy.Item):
# ... other item fields ...
image_urls = scrapy.Field()
images = scrapy.Field()
- 在spider中获取链接,并存入对应的k值
## other field
item['image_urls'] = response.css('div#js_content').css('img::attr(data-src)').getall()
##other field
此部分详细参考 Scrapy image pipeline
替换云主机的链接地址
from bs4 import Tag
def change_img_links(df):
'''change src tag of image
args:
df, pandas dataframe
'''
base_url = '*****'
for row in df.itertuples():
for ele in row.new_content:
img_link_local = ast.literal_eval(row.images)
if isinstance(ele, Tag) and ele.name == 'img' and img_link_local \
and ele.has_attr('data-src'):
img_link = ele.attrs['data-src']
for img_local in img_link_local :
if img_link == img_local['url']:
ele.attrs['src'] = base_url +img_local.get('path')
break
continue
搭建nginx服务(OS:linux ubuntu 18.04)
- 安装nginx服务: 安装nginx.其中的问题,当make 后出现如下错误
cc1: all warnings being treated as errors
objs/Makefile:460: recipe for target 'objs/src/core/ngx_murmurhash.o' failed
make[1]: *** [objs/src/core/ngx_murmurhash.o] Error 1
make[1]: Leaving directory '/usr/local/nginx-1.11.3'
Makefile:8: recipe for target 'build' failed
make: *** [build] Error 2
然后参照下面链接修复make error的问题:
安装nginx: 修复make.
2. 配置server
server {
listen 99; ##因为80占用,修改了端口
server_name localhost;
location / {
root html;
index index.html index.htm;
}
location /full/ { ##当访问/full/时映射云主机物理地址为 /mnt/weixin_images/full/
root /mnt/weixin_images;
autoindex on;
}
}
- 启用Nginx服务
然后又碰到403错误。参照如下链接
链接: 修复403 forbidden .
有了各位大神的文章,nginx服务终于启用成功~