关于PHP简易爬虫小知识

最新推荐文章于 2024-04-30 15:58:32 发布

chenjiafeng123

最新推荐文章于 2024-04-30 15:58:32 发布

阅读量153

点赞数

设计

我们这里约定下，要处理的内容（可能是url，用户名之类的）我们都叫他实体（entity）。
考虑到扩展性这里采用了队列的概念，待处理的实体全部存储在队列中，每次处理的时候，
从队列中拿出一个实体，处理完成之后存储，并将新抓取到的实体存入队列中。当然了这里
还需要做存储去重处理，入队去重处理，防止处理程序做无用功。

  +--------+ +-----------+ +----------+
  | entity | |  enqueue  | |  result  |
  |  list  | | uniq list | | uniq list|
  |        | |           | |          |
  |        | |           | |          |
  |        | |           | |          |
  |        | |           | |          |
  +--------+ +-----------+ +----------+

当每个实体进入队列的时候入队排重队列设置入队实体标志为一后边不再入队，当处理完
实体，得到结果数据，处理完结果数据之后将结果诗句标志如结果数据排重list，当然了
，这里你也可以做更新处理，代码中可以做到兼容。

                     +-------+                     |  开始 |                     +---+---+                         |                         v                     +-------+  enqueue deep为1的实体                     | init  |-------------------------------->                      +---+---+  set 已经入过队列 flag                         |                             v                        +---------+ empty queue  +------+            +------>| dequeue +------------->| 结束 |            |       +----+----+              +------+            |            |                                       |            |                                       |            |                                       |            v                                       |    +---------------+  enqueue deep为deep+1的实体                         |    | handle entity |------------------------------>             |    +-------+-------+  set 已经入过队列 flag                         |            |                                   |            |                                   |            v                                   |    +---------------+  set 已经处理过结果 flag            |    | handle result |-------------------------->             |    +-------+-------+                         |            |                                 +------------+

爬取策略（反作弊应对）

为了爬取某些网站，最怕的就是封ip，封了ip入过没有代理就只能呵呵呵了。因此，爬取
策略还是很重要的。

爬取之前可以先在网上搜搜待爬取网站的相关信息，看看之前有没有前辈爬取过，吸收他
门的经验。然后就是是自己仔细分析网站请求了，看看他们网站请求的时候会不会带上特
定的参数？未登录状态会不会有相关的cookie？最后就是尝试了，制定一个尽可能高的抓
取频率。

如果待爬取网站必须要登录的话，可以注册一批账号，然后模拟登陆成功，轮流去请求，
如果登录需要验证码的话就更麻烦了，可以尝试手动登录，然后保存cookie的方式（当然
，有能力可以试试ocr识别）。当然登陆了还是需要考虑上一段说的问题，不是说登陆了就
万事大吉，有些网站登录之后抓取频率过快会封掉账号。

所以，尽可能还是找个不需要登录的方法，登录被封账号，申请账号、换账号比较麻烦。

抓取数据源和深度

初始数据源选择也很重要。我要做的是一个每天抓取一次，所以我找的是带抓取网站每日
更新的地方，这样初始化的动作就可以作为全自动的，基本不用我去管理，爬取会从每日
更新的地方自动进行。

抓取深度也很重要，这个要根据具体的网站、需求、及已经抓取到的内容确定，尽可能全
的将网站的数据抓过来。

优化

在生产环境运行之后又改了几个地方。

第一就是队列这里，改为了类似栈的结构。因为之前的队列，deep小的实体总是先执行，
这样会导致队列中内容越来越多，内存占用很大，现在改为栈的结构，递归的先处理完一个
实体的所以深度，然后在处理下一个实体。比如说初始10个实体（deep=1），最大爬取深度
是3，每一个实体下面有10个子实体，然后他们队列最大长度分别是：

1 2	`队列（lpush,rpop） => 1000个` `修改之后的队列（lpush，lpop） => 28个`

上面的两种方式可以达到同样的效果，但是可以看到队列中的长度差了很多，所以改为第二
中方式了。

最大深度限制是在入队的时候处理的，如果超过最大深度，直接丢弃。另外对队列最大长度
也做了限制，让制意外情况出现问题。

代码

下面就是又长又无聊的代码了，本来想发在github，又觉得项目有点小，想想还是直接贴出来吧，不好的地方还望看朋友们直言不讳，不管是代码还是设计。

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

 
      abstract 
      class 
      SpiderBase 
     
      { 
     
      /** 
     
      * @var 处理队列中数据的休息时间开始区间 
     
      */ 
     
      public 
      $startMS 
      = 1000000; 
     
      /** 
     
      * @var 处理队列中数据的休息时间结束区间 
     
      */ 
     
      public 
      $endMS 
      = 3000000; 
     
      /** 
     
      * @var 最大爬取深度 
     
      */ 
     
      public 
      $maxDeep 
      = 1; 
     
      /** 
     
      * @var 队列最大长度，默认1w 
     
      */ 
     
      public 
      $maxQueueLen 
      = 10000; 
     
      /** 
     
      * @desc 给队列中插入一个待处理的实体 
     
      *       插入之前调用 @see isEnqueu 判断是否已经如果队列 
     
      *       直插入没如果队列的 
     
      * 
     
      * @param $deep 插入实体在爬虫中的深度 
     
      * @param $entity 插入的实体内容 
     
      * @return bool 是否插入成功 
     
      */ 
     
      abstract 
      public 
      function 
      enqueue( 
      $deep 
      ,  
      $entity 
      ); 
     
      /** 
     
      * @desc 从队列中取出一个待处理的实体 
     
      *      返回值示例，实体内容格式可自行定义 
     
      *      [ 
     
      *          "deep" => 3, 
     
      *          "entity" => "balabala" 
     
      *      ] 
     
      * 
     
      * @return array 
     
      */ 
     
      abstract 
      public 
      function 
      dequeue(); 
     
      /** 
     
      * @desc 获取待处理队列长度 
     
      * 
     
      * @return int  
     
      */ 
     
      abstract 
      public 
      function 
      queueLen(); 
     
      /** 
     
      * @desc 判断队列是否可以继续入队 
     
      * 
     
      * @param $params mixed 
     
      * @return bool 
     
      */ 
     
      abstract 
      public 
      function 
      canEnqueue( 
      $params 
      ); 
     
      /** 
     
      * @desc 判断一个待处理实体是否已经进入队列 
     
      *  
     
      * @param $entity 实体 
     
      * @return bool 是否已经进入队列 
     
      */ 
     
      abstract 
      public 
      function 
      isEnqueue( 
      $entity 
      ); 
     
      /** 
     
      * @desc 设置一个实体已经进入队列标志 
     
      *  
     
      * @param $entity 实体 
     
      * @return bool 是否插入成功 
     
      */ 
     
      abstract 
      public 
      function 
      setEnqueue( 
      $entity 
      ); 
     
      /** 
     
      * @desc 判断一个唯一的抓取到的信息是否已经保存过 
     
      * 
     
      * @param $entity mixed 用于判断的信息 
     
      * @return bool 是否已经保存过 
     
      */ 
     
      abstract 
      public 
      function 
      isSaved( 
      $entity 
      ); 
     
      /** 
     
      * @desc 设置一个对象已经保存 
     
      * 
     
      * @param $entity mixed 是否保存的一句 
     
      * @return bool 是否设置成功 
     
      */ 
     
      abstract 
      public 
      function 
      setSaved( 
      $entity 
      ); 
     
      /** 
     
      * @desc 保存抓取到的内容 
     
      *       这里保存之前会判断是否保存过，如果保存过就不保存了 
     
      *       如果设置了更新，则会更新 
     
      * 
     
      * @param $uniqInfo mixed 抓取到的要保存的信息 
     
      * @param $update bool 保存过的话是否更新 
     
      * @return bool 
     
      */ 
     
      abstract 
      public 
      function 
      save( 
      $uniqInfo 
      ,  
      $update 
      ); 
     
      /** 
     
      * @desc 处理实体的内容 
     
      *       这里会调用enqueue 
     
      * 
     
      * @param $item 实体数组，@see dequeue 的返回值 
     
      * @return  
     
      */ 
     
      abstract 
      public 
      function 
      handle( 
      $item 
      ); 
     
      /** 
     
      * @desc 随机停顿时间 
     
      * 
     
      * @param $startMs 随机区间开始微妙 
     
      * @param $endMs 随机区间结束微妙 
     
      * @return bool 
     
      */ 
     
      public 
      function 
      randomSleep( 
      $startMS 
      ,  
      $endMS 
      ) 
     
      { 
     
      $rand 
      = rand( 
      $startMS 
      ,  
      $endMS 
      ); 
     
      usleep( 
      $rand 
      ); 
     
      return 
      true; 
     
      } 
     
      /** 
     
      * @desc 修改默认停顿时间开始区间值 
     
      * 
     
      * @param $ms int 微妙 
     
      * @return obj $this 
     
      */ 
     
      public 
      function 
      setStartMS( 
      $ms 
      ) 
     
      { 
     
      $this 
      ->startMS =  
      $ms 
      ; 
     
      return 
      $this 
      ; 
     
      } 
     
      /** 
     
      * @desc 修改默认停顿时间结束区间值 
     
      * 
     
      * @param $ms int 微妙 
     
      * @return obj $this 
     
      */ 
     
      public 
      function 
      setEndMS( 
      $ms 
      ) 
     
      { 
     
      $this 
      ->endMS =  
      $ms 
      ; 
     
      return 
      $this 
      ; 
     
      } 
     
      /** 
     
      * @desc 设置队列最长长度，溢出后丢弃 
     
      * 
     
      * @param $len int 队列最大长度 
     
      */ 
     
      public 
      function 
      setMaxQueueLen( 
      $len 
      ) 
     
      { 
     
      $this 
      ->maxQueueLen =  
      $len 
      ; 
     
      return 
      $this 
      ; 
     
      } 
     
      /** 
     
      * @desc 设置爬取最深层级 
     
      *       入队列的时候判断层级，如果超过层级不做入队操作 
     
      * 
     
      * @param $maxDeep 爬取最深层级 
     
      * @return obj 
     
      */ 
     
      public 
      function 
      setMaxDeep( 
      $maxDeep 
      ) 
     
      {    
     
      $this 
      ->maxDeep =  
      $maxDeep 
      ; 
     
      return 
      $this 
      ; 
     
      } 
     
      public 
      function 
      run() 
     
      { 
     
      while 
      ( 
      $this 
      ->queueLen()) { 
     
      $item 
      =  
      $this 
      ->dequeue(); 
     
      if 
      ( 
      empty 
      ( 
      $item 
      )) 
     
      continue 
      ; 
     
      $item 
      = json_decode( 
      $item 
      , true); 
     
      if 
      ( 
      empty 
      ( 
      $item 
      ) ||  
      empty 
      ( 
      $item 
      [ 
      "deep" 
      ]) ||  
      empty 
      ( 
      $item 
      [ 
      "entity" 
      ])) 
     
      continue 
      ; 
     
      $this 
      ->handle( 
      $item 
      ); 
     
      $this 
      ->randomSleep( 
      $this 
      ->startMS,  
      $this 
      ->endMS); 
     
      } 
     
      } 
     
      /** 
     
      * @desc 通过curl获取链接内容 
     
      *   
     
      * @param $url string 链接地址 
     
      * @param $curlOptions array curl配置信息 
     
      * @return mixed 
     
      */ 
     
      public 
      function 
      getContent( 
      $url 
      ,  
      $curlOptions 
      = []) 
     
      { 
     
      $ch 
      = curl_init(); 
     
      curl_setopt_array( 
      $ch 
      ,  
      $curlOptions 
      ); 
     
      curl_setopt( 
      $ch 
      , CURLOPT_URL,  
      $url 
      ); 
     
      if 
      (!isset( 
      $curlOptions 
      [CURLOPT_HEADER])) 
     
      curl_setopt( 
      $ch 
      , CURLOPT_HEADER, 0); 
     
      if 
      (!isset( 
      $curlOptions 
      [CURLOPT_RETURNTRANSFER])) 
     
      curl_setopt( 
      $ch 
      , CURLOPT_RETURNTRANSFER, 1); 
     
      if 
      (!isset( 
      $curlOptions 
      [CURLOPT_USERAGENT])) 
     
      curl_setopt( 
      $ch 
      , CURLOPT_USERAGENT,  
      "Mozilla/5.0 (Macintosh; Intel Mac" 
      ); 
     
      $content 
      = curl_exec( 
      $ch 
      ); 
     
      if 
      ( 
      $errorNo 
      = curl_errno( 
      $ch 
      )) { 
     
      $errorInfo 
      = curl_error( 
      $ch 
      ); 
     
      echo 
      "curl error : errorNo[{$errorNo}], errorInfo[{$errorInfo}]\n" 
      ; 
     
      curl_close( 
      $ch 
      ); 
     
      return 
      false; 
     
      } 
     
      $httpCode 
      = curl_getinfo( 
      $ch 
      ,CURLINFO_HTTP_CODE); 
     
      curl_close( 
      $ch 
      ); 
     
      if 
      (200 !=  
      $httpCode 
      ) { 
     
      echo 
      "http code error : {$httpCode}, $url, [$content]\n" 
      ; 
     
      return 
      false; 
     
      } 
     
      return 
      $content 
      ; 
     
      } 
     
      } 
     
      abstract 
      class 
      RedisDbSpider  
      extends 
      SpiderBase 
     
      { 
     
      protected 
      $queueName 
      =  
      "" 
      ; 
     
      protected 
      $isQueueName 
      =  
      "" 
      ; 
     
      protected 
      $isSaved 
      =  
      "" 
      ; 
     
      public 
      function 
      construct( 
      $objRedis 
      = null,  
      $objDb 
      = null,  
      $configs 
      = []) 
     
      { 
     
      $this 
      ->objRedis =  
      $objRedis 
      ; 
     
      $this 
      ->objDb =  
      $objDb 
      ; 
     
      foreach 
      ( 
      $configs 
      as 
      $name 
      =>  
      $value 
      ) { 
     
      if 
      (isset( 
      $this 
      -> 
      $name 
      )) { 
     
      $this 
      -> 
      $name 
      =  
      $value 
      ; 
     
      } 
     
      } 
     
      } 
     
      public 
      function 
      enqueue( 
      $deep 
      ,  
      $entities 
      ) 
     
      { 
     
      if 
      (! 
      $this 
      ->canEnqueue([ 
      "deep" 
      => 
      $deep 
      ])) 
     
      return 
      true; 
     
      if 
      ( 
      is_string 
      ( 
      $entities 
      )) { 
     
      if 
      ( 
      $this 
      ->isEnqueue( 
      $entities 
      )) 
     
      return 
      true; 
     
      $item 
      = [ 
     
      "deep" 
      =>  
      $deep 
      , 
     
      "entity" 
      =>  
      $entities 
     
      ]; 
     
      $this 
      ->objRedis->lpush( 
      $this 
      ->queueName, json_encode( 
      $item 
      )); 
     
      $this 
      ->setEnqueue( 
      $entities 
      ); 
     
      }  
      else 
      if 
      ( 
      is_array 
      ( 
      $entities 
      )) { 
     
      foreach 
      ( 
      $entities 
      as 
      $key 
      =>  
      $entity 
      ) { 
     
      if 
      ( 
      $this 
      ->isEnqueue( 
      $entity 
      )) 
     
      continue 
      ; 
     
      $item 
      = [ 
     
      "deep" 
      =>  
      $deep 
      , 
     
      "entity" 
      =>  
      $entity 
     
      ]; 
     
      $this 
      ->objRedis->lpush( 
      $this 
      ->queueName, json_encode( 
      $item 
      )); 
     
      $this 
      ->setEnqueue( 
      $entity 
      ); 
     
      } 
     
      } 
     
      return 
      true; 
     
      } 
     
      public 
      function 
      dequeue() 
     
      { 
     
      $item 
      =  
      $this 
      ->objRedis->lpop( 
      $this 
      ->queueName); 
     
      return 
      $item 
      ; 
     
      } 
     
      public 
      function 
      isEnqueue( 
      $entity 
      ) 
     
      { 
     
      $ret 
      =  
      $this 
      ->objRedis->hexists( 
      $this 
      ->isQueueName,  
      $entity 
      ); 
     
      return 
      $ret 
      ? true : false; 
     
      } 
     
      public 
      function 
      canEnqueue( 
      $params 
      ) 
     
      { 
     
      $deep 
      =  
      $params 
      [ 
      "deep" 
      ]; 
     
      if 
      ( 
      $deep 
      >  
      $this 
      ->maxDeep) { 
     
      return 
      false; 
     
      } 
     
      $len 
      =  
      $this 
      ->objRedis->llen( 
      $this 
      ->queueName); 
     
      return 
      $len 
      <  
      $this 
      ->maxQueueLen ? true : false; 
     
      } 
     
      public 
      function 
      setEnqueue( 
      $entity 
      ) 
     
      { 
     
      $ret 
      =  
      $this 
      ->objRedis->hset( 
      $this 
      ->isQueueName,  
      $entity 
      , 1); 
     
      return 
      $ret 
      ? true : false; 
     
      } 
     
      public 
      function 
      queueLen() 
     
      { 
     
      $ret 
      =  
      $this 
      ->objRedis->llen( 
      $this 
      ->queueName); 
     
      return 
      intval 
      ( 
      $ret 
      ); 
     
      } 
     
      public 
      function 
      isSaved( 
      $entity 
      ) 
     
      { 
     
      $ret 
      =  
      $this 
      ->objRedis->hexists( 
      $this 
      ->isSaved,  
      $entity 
      ); 
     
      return 
      $ret 
      ? true : false; 
     
      } 
     
      public 
      function 
      setSaved( 
      $entity 
      ) 
     
      { 
     
      $ret 
      =  
      $this 
      ->objRedis->hset( 
      $this 
      ->isSaved,  
      $entity 
      , 1); 
     
      return 
      $ret 
      ? true : false; 
     
      } 
     
      } 
     
      class 
      Test  
      extends 
      RedisDbSpider 
     
      { 
     
      /** 
     
      * @desc 构造函数，设置redis、db实例，以及队列相关参数 
     
      */ 
     
      public 
      function 
      construct( 
      $redis 
      ,  
      $db 
      ) 
     
      { 
     
      $configs 
      = [ 
     
      "queueName" 
      =>  
      "spider_queue:zhihu" 
      , 
     
      "isQueueName" 
      =>  
      "spider_is_queue:zhihu" 
      , 
     
      "isSaved" 
      =>  
      "spider_is_saved:zhihu" 
      , 
     
      "maxQueueLen" 
      => 10000 
     
      ]; 
     
      parent::construct( 
      $redis 
      ,  
      $db 
      ,  
      $configs 
      ); 
     
      } 
     
      public 
      function 
      handle( 
      $item 
      ) 
     
      { 
     
      $deep 
      =  
      $item 
      [ 
      "deep" 
      ]; 
     
      $entity 
      =  
      $item 
      [ 
      "entity" 
      ]; 
     
      echo 
      "开始抓取用户[{$entity}]\n" 
      ; 
     
      echo 
      "数据内容入库\n" 
      ; 
     
      echo 
      "下一层深度如队列\n" 
      ; 
     
      echo 
      "抓取用户[{$entity}]结束\n" 
      ; 
     
      } 
     
      public 
      function 
      save( 
      $addUsers 
      ,  
      $update 
      ) 
     
      { 
     
      echo 
      "保存成功\n" 
      ; 
     
      } 
     
      }

chenjiafeng123

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
关于PHP简易爬虫小知识

设计我们这里约定下，要处理的内容（可能是url，用户名之类的）我们都叫他实体（entity）。考虑到扩展性这里采用了队列的概念，待处理的实体全部存储在队列中，每次处理的时候，从队列中拿出一个实体，处理完成之后存储，并将新抓取到的实体存入队列中。当然了这里还需要做存储去重处理，入队去重处理，防止处理程序做无用功。 +--------+ +-----------+ +----------+ | ...
复制链接

扫一扫