小型抓图爬虫程序实现c++

最新推荐文章于 2020-08-23 08:15:48 发布

草一由点

最新推荐文章于 2020-08-23 08:15:48 发布

阅读量907

点赞数

分类专栏：爬虫c++ 文章标签：网络编程网络爬虫 socket

本文链接：https://blog.csdn.net/h624496791/article/details/75271174

版权

小型抓图爬虫程序实现c++

最近得到偶然机会到公司实习，实践发现有很多东西不懂。特别是网络编程方面，故写一个基于socket的网络爬虫来稍微了解一下网络结构之类的东西。哈哈。

这个程序主要功能是实现了给定一个主网页，然后抓取它以及它的链接网页的图片。
实现方法是，先把主网页从page_url队列中提取，再提取其中的a标签href链接和img标签的src链接，分别加入page_url队列和image_url队列。再重复上述步骤，即，删除page_url头结点，提取其中链接。

因为新手，程序性能也比较低下，所以设置了当提取页面数太多时，停止提取url的操作。

放代码！

functions.h头文件：

#ifndef FUNCTIONS_H
#define FUNCTIONS_H

#include <string>  
#include <iostream>
#include <fstream>
#include <list>
#include "winsock2.h"
#include <time.h>
#include <queue>
#include <hash_set>

#pragma comment(lib, "ws2_32.lib")


bool GetHttpRespond(const std::string& url, char * &respond, int& bytes_read);

bool ParseUrl(const std::string& url, std::string& host, std::string& resource);

bool GetPagesAndImages(std::string main_url, std::list<std::string>& images_url);

bool UrlDownloadToFile(std::list<std::string> images_url);

std::string GetFileName(const std::string url);
#endif

functions.cpp:

#include "functions.h"
#include <iostream>
#include <urlmon.h>

#define DEFAULT_PAGE_BUF_SIZE 1048576
#define A_TAG_SIZE 500
#define IMAGE_TAG_SIZE 500
#define TEMP_SIZE 500

bool GetHttpRespond(const std::string& url, char * &respond, int& bytes_read){

    //初始化WSA
    WSADATA wsaData;
    if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0)
    {
        std::cout << "WSA failed to start!\n";
        return false;
    }

    //解析url
    std::string host,resource;
    if (!ParseUrl(url, host, resource)){
        std::cout &l