Python, Crawler and Raspberry Pi

原创 2016年06月02日 00:14:58

I left RP for quite a while as I did not think I can achieve something with RP in a short run. But since I’ve started Python, I start Raspberry Pi.

Today, at the moment, I’m fabricating another piece of crap to introduce the new web crawler. BeautifulSoup and mechanize.
It’s only that I’m glad I start to learn about Python. Emphasis on About…

from bs4 import BeautifulSoup
import mechanize
import time
import urllib
import string

start = "http://" +raw_input ("Where would you like to start searching?\n")
br = mechanize.Browser()
r = br.open(start)
html = r.read() # It's the moment xml gets generated.

soup = BeautifulSoup(html)
for link in soup.find_all('a'):
    print(link.get('href')) #This prints out all sub directories.

why cannot see the pic

Full code from “Learn Raspberry Pi Programming with Python”, see the note, that is important.

import mechanize
import time
from bs4 import BeautifulSoup
import re
import string
import os
import urllib

def downloadProcess(html,base,filetype,linkList):
    "This does the actual file downloading."
    soup = BeautifulSoup(html)
    for link in soup.find_all('a'):
        linkText = str(link.get('href')) # here we assume the href starts at sub directory, not start with http://

        if filetype in linkText:
            slashList = [i for i, ind in enumerate(linkText) if ind == '/'] # it can store the index of "/"
            directoryName = linkText[(slashList[0]+1):slashList[1]]  # Slice to get the name of Directory
            if not os.path.exists(directoryName):
                os.makedirs(directoryName)

            image = urllib.URLopener()
            linkGet = base + linkText
            filesave = string.lstrip(linkText,"/")
            image.retrieve(linkGet,filesave)

        elif "htm" in linkText: #cover both html and htm
            linkList.append(link)


start = "http://" + raw_input("Where would you like to start searching?\n")
filetype = raw_input("What file type are you looking for?\n")

numSlash = start.count('/')
slashList = [i for i, ind in enumerate(start) if ind == '/']

if (len(slashList) >= 3): 
    third = slashList[2]
    base = start[:third]

else:
    base = start

br = mechanize.Browser()
r = br.open(start)
html = r.read()
linkList = []

print "Parsing" + start
downloadProcess(html,base,filetype,linkList)

for leftover in linkList:

    time.sleep(0.1)
    linkText = str(leftover.get('href'))
    print "Parsing" + base + linkText
    br = mechanize.Browser()
    r = br.open(base + linkText)
    html = r.read()
    linkList = []
    downloadProcess(html,base,filetype,linkList) # The recursion: depth goes first.
版权声明:本文为博主原创文章,未经博主允许不得转载。

相关文章推荐

gstremer 1.2 compile and install on raspberry pi(在树莓派上编译gstreamer1.2,gstreamer1.0及以上版本的编译可参照此教程)

sudo apt-get update && sudo apt-get upgrade -y --force-yes sudo apt-get install -y --force-yes bui...

ROS Tutorials to Start Working with Arduino and Raspberry Pi

The robotic field is getting more and more complex, but there’s no need to worry since an army of en...
  • dxuehui
  • dxuehui
  • 2015年12月05日 13:16
  • 1280

Mac OS and Raspberry Pi

前言上午树莓派系统玩着玩着, 命令操作不当, 开不了机, 系统崩了… 于是决定重新开始, 手头有Mac OS的电脑一台, 树莓派2B, 读卡器 , 一个USB转串口(CH340)的小板, 下面记录下过...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Python, Crawler and Raspberry Pi
举报原因:
原因补充:

(最多只允许输入30个字)