My first Crawler

I wrote some crwalers these days, not so strong though. But it feels terific when they get the data back.


Here some points which I value very important in the process.


1. HTML parsing: I used the urllib to get the html.


def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html



2. BeatifulSoup class: with the bs4, it's convenient to grasp the data we want.


html = getHtml(url)
soup = BeautifulSoup(html, 'html.parser')
getImg(soup)
getUrl(soup)



3. Regular expression: the re library is briliant in string find function and also string match function.


for img in soup.find_all('img'):
    print(img, '\n')
    src = str(img.get('data-url'))
    if re.match(r'^https?:/{2}\w.+$', src):
        tot = tot + 1
        urllib.request.urlretrieve(src, "%s.jpg" % tot)



4. Operating system: this could be helpful for file dealing.


os.mkdir("58pic")
os.chdir("58pic")



Complete codes are presented as following: 


import os
import re
import urllib.request
from bs4 import BeautifulSoup

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html

def getImg(soup):
    global tot
    for img in soup.find_all('img'):
        print(img, '\n')
        src = str(img.get('data-url'))
        if re.match(r'^https?:/{2}\w.+$', src):
            tot = tot + 1
            urllib.request.urlretrieve(src, "%s.jpg" % tot)
        src = "http:" + src
        if re.match(r'^https?:/{2}\w.+$', src):
            tot = tot + 1
            urllib.request.urlretrieve(src, "%s.jpg" % tot)

def getUrl(soup):
    global tmp
    global url
    for ai in soup.find_all('a'):
        href = str(ai.get('href'))
        if not re.match(r'^https?:/{2}\w.+$', href):
            continue
        if href.find('image') == -1:
            continue
        if href == tmp:
            continue
        url = href
        break

tot = 0
os.mkdir("58pic")
os.chdir("58pic")
url = "http://www.58pic.com/"
tmp = ""

while tot < 100:
    print(url, '\n')
    if url == tmp:
        break
    tmp = url
    html = getHtml(url)
    soup = BeautifulSoup(html, 'html.parser')
    getImg(soup)
    getUrl(soup)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值