短视频批量关键词采集提取 2024最新版第二篇

一:概述

  1. :在上一篇中我们讲解了视频下载工具的性能,在继续提升我们的视频下载工具功能时,我们考虑整理出更多功能。这一功能-不仅允许用户输入特定的关键词搜索批量下载,还完成了单个视频分享链接批量下载、博主单个视频批量解析下载。

(2):通常情况下如果直接访问详情页进行提取会导致访问过快或者是长时间访问被屏蔽。

所以直接通过详情页在生产环境中正式使用时不方便。会影响使用。 换另外一种方法,不通过详情页进行提取。如果不用详情页提取解析时会很麻烦,用了详情页解析很简单。但是为了考虑到长期稳定使用就废弃了通过视频详情页获取相关数据。

下面我们介绍更新后的部分解析源码和开发流程。Wxexchange

二:解析内容的源码

因为上一篇不完整,这里其实就是补充第二篇关于源码思路。

这段的源码只给出视频批量下载所用到的解析

 (注:不同的视频地址解析方法不一样,这里给出的是经过我们在使用过程中分析出来的,可以保持不用cookie 不会因为访问过快导致 IP屏蔽等)

视频名称 对应的解析标签和源码 这里使用的是 正则表达式

  string title_b = "";

            // 使用正则表达式提取 content 属性的值

            string pattern = @"<meta\s+name=""lark:url:video_title""\s+content=""([^""]+)""";

            Match match = Regex.Match(html, pattern);

            if (match.Success)

            {

                // 获取匹配到的 content 属性值

                string contentValue = match.Groups[1].Value;

                title_b = contentValue;

                if (title_b.Trim() == "抖音-记录美好生活")

                {

                    title_b = "";

                }

                Console.WriteLine("Content Value: " + contentValue);

            }

            else

            {

                Console.WriteLine("No meta tag found or content attribute not present.");

三:开发流程

反向获取分享链接的真实地址-

获取到分享链接时,分享链接是DY加密后的,无法获取到对应的视频ID。直接访问会直接进入详情页。刚才说了如果进入详情页会导致被屏蔽或者是被验证。所以我们不能直接访问。

第一步是我们先通过字符串函数获取到分享地址中的http加密URL。然后通过 HttpWebRequest进行访问获取到DY然后获得反向真实链接。获得后责是详情页的真实链接

代码如下

  string video_code = "";

   string referer = "";

string cookie = "";

 HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);

                                req.Method = "HEAD";

                                req.Referer = referer;

                                req.AllowAutoRedirect = false;

                                WebResponse response = req.GetResponse();

                            

                                video_code= response.Headers["Location"];

                                response.Close();

第二步获得DY真实链接后的操作;获得真实跳转链接后通过字符串函数加正则表达式来获取视频ID,并且进行拼接视频层地址然后进行访问解析,下列介绍需要解析的代码:

短视频的标题解析、对应的短视频作者解析、对应的短视频发布时间解析、对应的视频临时地址解析。

部分代码示下——

1): public string title_ceng(string html)

        {

            string title_b = "";

            // 使用正则表达式提取 content 属性的值

            string pattern = @"<meta\s+name=""lark:url:video_title""\s+content=""([^""]+)""";

            Match match = Regex.Match(html, pattern);

            if (match.Success)

            {

                // 获取匹配到的 content 属性值

                string contentValue = match.Groups[1].Value;

                title_b = contentValue;

                if (title_b.Trim() == "抖音-记录美好生活")

                {

                    title_b = "";

                }

                Console.WriteLine("Content Value: " + contentValue);

            }

            else

            {

                Console.WriteLine("No meta tag found or content attribute not present.");

            }

            return title_b;

        }

(2):public string zuozhe_ceng(string html)

        {

            string zuozhe = "";

            string htmlContent = html;

            try

            {

                // string title = "";//获取title值 标题 视频名称

                Regex regex = new Regex(@"<span class=""j5WZzJdp y7epAOXf hVNC9qgC"">(.*?)</span>", RegexOptions.IgnoreCase);

                Match match = regex.Match(htmlContent);

                if (match.Success)

                {

                    // 获取匹配到的第一个组(即<title>和</title>之间的内容)

                    zuozhe = match.Groups[1].Value;

                    //发布时间:

                    zuozhe = zuozhe.Replace("<span>", "");

                    zuozhe = zuozhe.Replace("/", "");

                }

            }

            catch

            {

                //MessageBox.Show("608");

            }

            return zuozhe;

        }

(3):public string shipin_shijian_ceng(string html)

        {

            string shipin_dates = "";

            string htmlContent = html;

            try

            {

                // string title = "";//获取title值 标题 视频名称

                Regex regex = new Regex(@"<span class=""time"">(.*?)</span>", RegexOptions.IgnoreCase);

                Match match = regex.Match(htmlContent);

                if (match.Success)

                {

                    // 获取匹配到的第一个组(即<title>和</title>之间的内容)

                    shipin_dates = match.Groups[1].Value.Trim();

                    //发布时间:

                    shipin_dates = shipin_dates.Replace("<span>", "");

                    shipin_dates = shipin_dates.Replace("/", "");

                    shipin_dates = shipin_dates.Replace("·", "");

                    shipin_dates = shipin_dates.Replace("日", "");

                    shipin_dates = shipin_dates.Replace("年", "-");

                    shipin_dates = shipin_dates.Replace("月", "-");

                    string day = "";

                    // try

                    // {

                    Regex yearRegex = new Regex(@"\b\d{4}\b");

                    Regex dateRegex = new Regex(@"\b\d{1,2}-\d{1,2}\b");

                    // 判断字符串中是否包含年份信息

                    if (yearRegex.IsMatch(shipin_dates.Trim()))

                    {

                        // Console.WriteLine("输入字符串包含年份信息");

                    }

                    else if (dateRegex.IsMatch(shipin_dates.Trim()))

                    {

                        // Console.WriteLine("输入字符串不包含年份信息,但包含日期信息");

                        shipin_dates = "2024-" + shipin_dates.Trim();

                    }

                    else

                    {

                        Console.WriteLine("输入字符串既没有年份信息,也不符合日期格式");

                        #region

                        //DateTime shipin_dates_y = Convert.ToDateTime(shipin_dates);

                        //if (shipin_dates_y.Year != 1)

                        //{

                        //    shipin_dates = "2004-" + shipin_dates.Trim ();

                        //    Console.WriteLine("这个日期变量包含年份。");

                        //}

                        //else

                        //{

                        //    shipin_dates = "2004-" + shipin_dates.Trim ();

                        //    //Console.WriteLine("这个日期变量不包含年份。");

                        //}

                        #endregion

                        //  }

                        //  catch

                        // {

                        char delimiter = '·';

                        int index1 = shipin_dates.IndexOf(delimiter);

                        if (index1 != -1)

                        {

                            string textBeforeDelimiter = shipin_dates.Substring(0, index1);

                            shipin_dates = textBeforeDelimiter;

                            Console.WriteLine("Text before delimiter: " + textBeforeDelimiter);

                        }

                        if (shipin_dates.Contains("天"))

                        {

                            //  pinglun_riqi_yuanshi = extraInfo;

                            int index = shipin_dates.IndexOf("天");

                            day = shipin_dates.Substring(0, index);

                            DateTime dt = DateTime.Now.Date.AddDays(-Convert.ToInt32(Convert.ToInt32(day)));

                            shipin_dates = dt.ToShortDateString();

                        }

                        if (shipin_dates.Contains("月"))

                        {

                            //pinglun_riqi_yuanshi = extraInfo;

                            int index = shipin_dates.IndexOf("月");

                            day = shipin_dates.Substring(0, index);

                            DateTime dt = DateTime.Now.Date.AddMonths(-Convert.ToInt32(Convert.ToInt32(day)));

                            shipin_dates = dt.ToShortDateString();

                        }

                        if (shipin_dates.Contains("小时"))

                        {

                            // pinglun_riqi_yuanshi = extraInfo;

                            int index = shipin_dates.IndexOf("小时");

                            day = shipin_dates.Substring(0, index);

                            DateTime dt = DateTime.Now.Date.AddHours(-Convert.ToInt32(Convert.ToInt32(day)));

                            shipin_dates = dt.ToString();

                        }

                        if (shipin_dates.Contains("分钟"))

                        {

                            //pinglun_riqi_yuanshi = extraInfo;

                            int index = shipin_dates.IndexOf("分钟");

                            day = shipin_dates.Substring(0, index);

                            DateTime dt = DateTime.Now.Date.AddMinutes(-Convert.ToInt32(Convert.ToInt32(day)));

                            shipin_dates = dt.ToString();

                        }

                        if (shipin_dates.Contains("周"))

                        {

                            //  pinglun_riqi_yuanshi = extraInfo;

                            int index = shipin_dates.IndexOf("周");

                            day = shipin_dates.Substring(0, index);

                            int week = (Convert.ToInt32(day) * 7);

                            DateTime dt = DateTime.Now.Date.AddDays(-Convert.ToInt32(week));

                            shipin_dates = dt.ToShortDateString();

                        }

                        if (shipin_dates.Contains("年"))

                        {

                            //  pinglun_riqi_yuanshi = extraInfo;

                            int index = shipin_dates.IndexOf("年");

                            day = shipin_dates.Substring(0, index);

                            DateTime dt = DateTime.Now.AddYears(-Convert.ToInt32(Convert.ToInt32(day)));

                            shipin_dates = dt.ToShortDateString();

                        }

                        //判断当前时间是否和视频时间 是否大于

                        DateTime a = DateTime.Now; // 当前时间

                        DateTime b = DateTime.ParseExact(shipin_dates, "yyyy-MM-dd", System.Globalization.CultureInfo.InvariantCulture);// 视频时间,假设为 2022-05-27

                        TimeSpan interval = a - b; // 计算时间间隔

                        if (Math.Abs(interval.TotalDays) <= 730) // 判断时间间隔是否小于等于两年//这个里面的值  通过字段获取

                        {

                            Console.WriteLine("视频时间和当前时间在两年内");

                        }

                        else

                        {

                            Console.WriteLine("视频时间和当前时间不在两年内");

                        }

                    }

                }

            }

            catch

            {

                //MessageBox.Show("608");

            }

            return shipin_dates;

        }

(4):public string mp4_ceng(string html)

        {

            string mp4_url = "";

            string input = html;

            string srcValue = GetSrcFromSourceTag(input);

            mp4_url = srcValue;

            return mp4_url;

        }

        static string GetSrcFromSourceTag(string input)

        {

            Regex regex = new Regex(@"<source[^>]+src\s*=\s*""([^""]+)""");

            Match match = regex.Match(input);

            if (match.Success)

            {

                return match.Groups[1].Value;

            }

            else

            {

                return null; // 或者抛出异常,视情况而定

            }

        }

在下一篇文章中,我们会详细介绍如何操作单个视频批量下载的解析逻辑

  • 11
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值