微信公众号文章内容采集

 直接上代码

<?php
// 设置脚本执行不超时
set_time_limit ( 0 );
class Gather {
 
private $url;
private $path;
public function __construct($url, $path) {
 
$this->url = $url;
$this->path = $path;
}
public function fetch() {
return $this->transform ( $this->url, $this->path );
}
private function createPic($url, $path, $name) {
 
$img = file_get_contents ( $url );
$info = getimagesize ( $url );
$type = str_replace ( 'image/', '', $info ['mime'] );
$fileName = $path . DIRECTORY_SEPARATOR . $name . ".$type";
file_put_contents ( $fileName, $img );
return $fileName;
 
}
private function transform($url, $path) {
if (! file_exists ( $path ))
mkdir ( $path );
$content = file_get_contents ( $url );
preg_match ( '/<title>(.*)<\/title>/i', $content, $result );
$data ['title'] = $result [1]; // 文章标题
preg_match ( '/var\s+msg_cdn_url\s*=\s*"([^\s]*)"/', $content, $result );
preg_match ( '/var\s+msg_desc\s*=\s*"([^\s]*)"/', $content, $result );
$data ['description'] = $result [1]; // 公众号文章摘要
                                   
// 获取微信主体内容
preg_match ( '/<div\s+class="rich_media_content\s*"\s+id="js_content">(.*?)<\/div>/is', $content, $result );

// 获取微信主体中的防盗链图片
preg_match_all ( '/data-src="[a-zA-z]+:\/\/[^\s]*[mmbiz|mmbiz_jpg|mmbiz_gif]\/[^\s]*\/\d*\?([^\s]*=[^\s]*)*"|data-src="[a-zA-z]+:\/\/[^\s]*[mmbiz|mmbiz_jpg|mmbiz_gif]\/[^\s]*\/\d+"|background-image\s*:\s*url\s*\(\s*[a-zA-z]+:\/\/[^\s]*mmbiz\/[^\s]*\/\d+|background-image\s*:\s*url\s*\(\s*[a-zA-z]+:\/\/[^\s]*mmbiz\/[^\s]*\/\d+\?[^\s]*=[^\s]*/is', $result [1], $result2 );
// 判断微信主体中是否包含防盗链图片
if (! empty ( $result2 [0] )) {
 
foreach ( $result2 [0] as $value ) {
 
// 取出防盗链地址中的data-src值后的http://url主体
//preg_match ( '/[a-zA-z]+:\/\/[^\s]*\/[mmbiz|mmbiz_jpg]\/([^\s\/]*)\/\d+\?[^\s"]*|[a-zA-z]+:\/\/[^\s]*[mmbiz|mmbiz_jpg]\/([^\s\/]*)\/\d+/', $value, $temp );
preg_match ( '/[a-zA-z]+:\/\/[^\s]*[mmbiz|mmbiz_jpg|mmbiz_gif]\/([^\s\/]*)\/\d*\?([^\s]*=[^\s]*)*[^"]|[a-zA-z]+:\/\/[^\s]*[mmbiz|mmbiz_jpg|mmbiz_gif]\/([^\s\/]*)\/\d+/', $value, $temp );
$temp = array_filter ($temp);
$temp = array_values($temp);
$urlList [] = $temp [0];
$nameList [] = $temp [1];
 
}
$path = realpath($path);
foreach ( $urlList as $value ) {
 
$name = array_shift ( $nameList );

$fileName = $this->createPic ( $value, $path, $name );    // 把图片保存到本地
$result [1] = str_replace ( $value, $fileName, $result [1] );
 
}
 
}
// 更新所有data-src的地址
$result [1] = str_replace ( "data-src", "src", $result [1] );
// 返回处理后的微信主体内容。
$data ['content'] = trim($result [1]);
return $data;
}
}

 

转载于:https://my.oschina.net/u/145255/blog/1609514

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
爬虫是一种自动抓取网页数据的程序,可以用于获取微信公众号的内容信息。下面是一个简单的Python爬虫示例,使用了`requests`库来发送HTTP请求并解析HTML内容,以及使用了`BeautifulSoup`库来进行HTML内容的解析: ```python import requests from bs4 import BeautifulSoup def get_wechat_article(url): # 发送GET请求到指定URL获取文章页面 response = requests.get(url) # 检查请求是否成功 if response.status_code == 200: soup = BeautifulSoup(response.text, 'html.parser') # 解析文章标题、作者、发布日期等信息 title = soup.find('title').text.strip() author = soup.find(id="js_content").find_previous("h2", class_="rich_media_title").text.strip() date = soup.find(id="js_content").find_next_sibling("span").text.strip() # 解析文章内容 article_text = "" for paragraph in soup.find_all("p"): article_text += paragraph.text.strip() + "\n\n" return {'title': title, 'author': author, 'date': date, 'content': article_text} else: print(f"Request failed with status code {response.status_code}") return None # 使用示例 url = "https://mp.weixin.qq.com/s/YsJZxXjwO7oBzRyvLk986A" # 微信公众号文章链接 article_info = get_wechat_article(url) if article_info is not None: print(f"Title: {article_info['title']}\nAuthor: {article_info['author']}\nDate: {article_info['date']}") print("\nContent:\n") print(article_info['content']) else: print("Failed to fetch the article.") ``` 请注意,这个示例仅作为一个基础框架,并可能存在一定的局限性和失效情况,尤其是当网站结构发生变化时。实际应用中,需要考虑到更多的边界条件和异常处理。 ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值