ToBeAMensch的专栏

编程爱好者的地下室

Python, Crawler and Raspberry Pi

I left RP for quite a while as I did not think I can achieve something with RP in a short run. But since I’ve started Python, I start Raspberry Pi.

Today, at the moment, I’m fabricating another piece of crap to introduce the new web crawler. BeautifulSoup and mechanize.
It’s only that I’m glad I start to learn about Python. Emphasis on About…

from bs4 import BeautifulSoup
import mechanize
import time
import urllib
import string

start = "http://" +raw_input ("Where would you like to start searching?\n")
br = mechanize.Browser()
r = br.open(start)
html = r.read() # It's the moment xml gets generated.

soup = BeautifulSoup(html)
for link in soup.find_all('a'):
    print(link.get('href')) #This prints out all sub directories.

why cannot see the pic

Full code from “Learn Raspberry Pi Programming with Python”, see the note, that is important.

import mechanize
import time
from bs4 import BeautifulSoup
import re
import string
import os
import urllib

def downloadProcess(html,base,filetype,linkList):
    "This does the actual file downloading."
    soup = BeautifulSoup(html)
    for link in soup.find_all('a'):
        linkText = str(link.get('href')) # here we assume the href starts at sub directory, not start with http://

        if filetype in linkText:
            slashList = [i for i, ind in enumerate(linkText) if ind == '/'] # it can store the index of "/"
            directoryName = linkText[(slashList[0]+1):slashList[1]]  # Slice to get the name of Directory
            if not os.path.exists(directoryName):
                os.makedirs(directoryName)

            image = urllib.URLopener()
            linkGet = base + linkText
            filesave = string.lstrip(linkText,"/")
            image.retrieve(linkGet,filesave)

        elif "htm" in linkText: #cover both html and htm
            linkList.append(link)


start = "http://" + raw_input("Where would you like to start searching?\n")
filetype = raw_input("What file type are you looking for?\n")

numSlash = start.count('/')
slashList = [i for i, ind in enumerate(start) if ind == '/']

if (len(slashList) >= 3): 
    third = slashList[2]
    base = start[:third]

else:
    base = start

br = mechanize.Browser()
r = br.open(start)
html = r.read()
linkList = []

print "Parsing" + start
downloadProcess(html,base,filetype,linkList)

for leftover in linkList:

    time.sleep(0.1)
    linkText = str(leftover.get('href'))
    print "Parsing" + base + linkText
    br = mechanize.Browser()
    r = br.open(base + linkText)
    html = r.read()
    linkList = []
    downloadProcess(html,base,filetype,linkList) # The recursion: depth goes first.
阅读更多
版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u011410413/article/details/51560220
文章标签: python raspberry pi
个人分类: Python
想对作者说点什么? 我来说一句

Raspberry Pi Python 编程入门.pdf

2015年05月14日 21.94MB 下载

Raspberry Pi I2C (Python)

I2C (Python)

Zyj061 Zyj061

2017-03-15 23:17:36

阅读数:1119

learn python with raspberry

2015年08月09日 48.06MB 下载

Raspberry Pi Cookbook(2nd) 无水印pdf

2017年09月28日 122.7MB 下载

Raspberry Pi Cookbook for Python Programmers

2018年03月11日 31.45MB 下载

没有更多推荐了,返回首页

不良信息举报

Python, Crawler and Raspberry Pi

最多只允许输入30个字

加入CSDN,享受更精准的内容推荐,与500万程序员共同成长!
关闭
关闭