My first Crawler

最新推荐文章于 2021-02-05 14:58:41 发布

TheBoyKimmy

最新推荐文章于 2021-02-05 14:58:41 发布

阅读量171

点赞数

本文链接：https://blog.csdn.net/u012597795/article/details/78366910

版权

I wrote some crwalers these days, not so strong though. But it feels terific when they get the data back.

Here some points which I value very important in the process.

1. HTML parsing: I used the urllib to get the html.

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html

2. BeatifulSoup class: with the bs4, it's convenient to grasp the data we want.

html = getHtml(url)
soup = BeautifulSoup(html, 'html.parser')
getImg(soup)
getUrl(soup)

3. Regular expression: the re library is briliant in string find function and also string match function.

for img in soup.find_all('img'):
    print(img, '\n')
    src = str(img.get('data-url'))
    if re.match(r'^https?:/{2}\w.+$', src):
        tot = tot + 1
        urllib.request.urlretrieve(src, "%s.jpg" % tot)

4. Operating system: this could be helpful for file dealing.

os.mkdir("58pic")
os.chdir("58pic")

Complete codes are presented as following:

import os
import re
import urllib.request
from bs4 import BeautifulSoup

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html

def getImg(soup):
    global tot
    for img in soup.find_all('img'):
        print(img, '\n')
        src = str(img.get('data-url'))
        if re.match(r'^https?:/{2}\w.+$', src):
            tot = tot + 1
            urllib.request.urlretrieve(src, "%s.jpg" % tot)
        src = "http:" + src
        if re.match(r'^https?:/{2}\w.+$', src):
            tot = tot + 1
            urllib.request.urlretrieve(src, "%s.jpg" % tot)

def getUrl(soup):
    global tmp
    global url
    for ai in soup.find_all('a'):
        href = str(ai.get('href'))
        if not re.match(r'^https?:/{2}\w.+$', href):
            continue
        if href.find('image') == -1:
            continue
        if href == tmp:
            continue
        url = href
        break

tot = 0
os.mkdir("58pic")
os.chdir("58pic")
url = "http://www.58pic.com/"
tmp = ""

while tot < 100:
    print(url, '\n')
    if url == tmp:
        break
    tmp = url
    html = getHtml(url)
    soup = BeautifulSoup(html, 'html.parser')
    getImg(soup)
    getUrl(soup)

TheBoyKimmy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
My first Crawler

I wrote some crwalers these days, not so strong though. But it feels terific when they get the data back.Here some points which I value very important in the process.1. HTML parsing: I use
复制链接

扫一扫