C#调用python脚本的思路

最新推荐文章于 2024-08-28 21:00:07 发布

Charlotte_jc

最新推荐文章于 2024-08-28 21:00:07 发布

阅读量950

点赞数

分类专栏： C# 文章标签： C# Python

本文链接：https://blog.csdn.net/a914541185/article/details/87101283

版权

C# 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

前段时间工作需要，写了一个基于scrapy的爬虫程序，但是对数据的处理是用C#脚本完成的，因此涉及到调用问题。

这里记录一下解决方案：

解决思路一：

用命令行运行爬虫，检测到程序执行完毕，再进行数据处理。

这里使用的是C#调用powershell来执行脚本

static void Main(string[] args)
{
    var getPath = @"cd 目标目录";
    var startSpider = "scrapy crawl 爬虫名";
    List<string> getshellcmdlist = new List<string>();
    List<EntityModel.ShellParameter> getpatalist = new List<ShellParameter>
    {
        new ShellParameter{ ShellKey="Name",ShellValue="Spider*"}
    };

    getshellcmdlist.Add(getPath);
    getshellcmdlist.Add(startSpider);

    Console.WriteLine("正在执行，请稍等");
    string getresult = Program.ExecuteShellScript(getshellcmdlist, getpatalist);
    if (getresult != null)
    {
        //输出结果
        Console.WriteLine("执行结果:");
    }
}


/// <summary>
/// 执行PowserShell 脚本核心方法
/// </summary>
/// <param name="getshellstrlist">Shell脚本集合</param>
/// <param name="getshellparalist">脚本中涉及对应参数</param>
/// <returns>执行结果返回值</returns>
public static string ExecuteShellScript(List<string> getshellstrlist, List<ShellParameter> getshellparalist)
{
    string getresutl = null;
    try
    {
        //Create Runspace
        Runspace newspace = RunspaceFactory.CreateRunspace();
        Pipeline newline = newspace.CreatePipeline();

        //open runspace
        newspace.Open();

        if (getshellstrlist.Count > 0)
        {
            foreach (string getshellstr in getshellstrlist)
            {
                //Add Command ShellString
                Console.WriteLine(getshellstr);
                newline.Commands.AddScript(getshellstr);
            }
        }

        //Check Parameter
        if (getshellparalist != null && getshellparalist.Count > 0)
        {
            int count = 0;
            foreach (EntityModel.ShellParameter getshellpar in getshellparalist)
            {
                //Set parameter
                //注入脚本一个.NEt对象 这样在powershell脚本中就可以直接通过$key访问和操作这个对象
                //newspace.SessionStateProxy.SetVariable(getshellpar.ShellKey,getshellpar.ShellValue);
                CommandParameter cmdpara = new CommandParameter(getshellpar.ShellKey, getshellpar.ShellValue);
                newline.Commands[count].Parameters.Add(cmdpara);
            }
        }

        //Exec Restult
        var getresult = newline.Invoke();
        if (getresult != null)
        {
            StringBuilder getbuilder = new StringBuilder();
            foreach (var getresstr in getresult)
            {
                getbuilder.AppendLine(getresstr.ToString());
            }
            getresutl = getbuilder.ToString();
        }

        //close 
        newspace.Close();
    }
    catch (Exception se)
    {
        //catch Excetption 
    }
    return getresutl;
}

调用powershell的代码来自网络转载，原作者个人疏忽已找不到。

解决思路二：

由于该程序需要在客户电脑上执行，因此思路一的方式显然不够

方案二是将scrapy程序打包

python打包问题个人感觉十分麻烦，查阅众多资料之后，使用pyinstaller成功打包。

现贴方案如下：

在scrapy程序根目录创建crawl.py文件，此文件作用等同于main.py文件。

见官方文档https://doc.scrapy.org/en/latest/topics/practices.html

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
process.crawl('dailymail', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished

之后在crawl.py文件中引入整个框架所有需要用到的第三方库

本人引入方式如下：

import win32com
import urllib

import scrapy.spiderloader
import scrapy.statscollectors
import scrapy.logformatter
import scrapy.dupefilters
import scrapy.squeues
 
import scrapy.extensions.spiderstate
import scrapy.extensions.corestats
import scrapy.extensions.telnet
import scrapy.extensions.logstats
import scrapy.extensions.memusage
import scrapy.extensions.memdebug
import scrapy.extensions.feedexport
import scrapy.extensions.closespider
import scrapy.extensions.debug
import scrapy.extensions.httpcache
import scrapy.extensions.statsmailer
import scrapy.extensions.throttle

import scrapy.core.scheduler
import scrapy.core.engine
import scrapy.core.scraper
import scrapy.core.spidermw
import scrapy.core.downloader
 
import scrapy.downloadermiddlewares.stats
import scrapy.downloadermiddlewares.httpcache
import scrapy.downloadermiddlewares.cookies
import scrapy.downloadermiddlewares.useragent
import scrapy.downloadermiddlewares.httpproxy
import scrapy.downloadermiddlewares.ajaxcrawl
import scrapy.downloadermiddlewares.chunked
import scrapy.downloadermiddlewares.decompression
import scrapy.downloadermiddlewares.defaultheaders
import scrapy.downloadermiddlewares.downloadtimeout
import scrapy.downloadermiddlewares.httpauth
import scrapy.downloadermiddlewares.httpcompression
import scrapy.downloadermiddlewares.redirect
import scrapy.downloadermiddlewares.retry
import scrapy.downloadermiddlewares.robotstxt
 
import scrapy.spidermiddlewares.depth
import scrapy.spidermiddlewares.httperror
import scrapy.spidermiddlewares.offsite
import scrapy.spidermiddlewares.referer
import scrapy.spidermiddlewares.urllength
 
import scrapy.pipelines
 
import scrapy.core.downloader.handlers.http
import scrapy.core.downloader.contextfactory
 
import scrapy.pipelines.images  # 用到图片管道

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
process.crawl('dailymail', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished

之后运行pyinstaller打包命令

pyinstaller --clean --win-private-assemblies <python files>

美滋滋执行exe文件，结果依旧闪退。

用cmd执行发现，缺少scrapy的一些文件。

遂在C:\Users\用户名\AppData\Local\Programs\Python\Python36-32\Lib\site-packages\scrapy目录下找到文件mime.types和VERSION

复制两个文件，进入路径 \dist\crawl\创建文件夹scrapy。将文件粘贴进去

运行dist文件夹下exe文件，成功运行。

回到C#脚本，将scrapy crawl 爬虫命令行修改为./crawl.exe

以上。

欢迎大家关注我的博客爱吃回锅肉的胖子技术文章我会先发布到我的个人博客中

Charlotte_jc

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录