Python, Crawler and Raspberry Pi

最新推荐文章于 2023-04-01 17:21:48 发布

取啥都被占用

最新推荐文章于 2023-04-01 17:21:48 发布

阅读量510

点赞数

分类专栏： pcb上有颗芯文章标签： python raspberry pi

本文链接：https://blog.csdn.net/u011410413/article/details/51560220

版权

pcb上有颗芯专栏收录该内容

30 篇文章 1 订阅

订阅专栏

I left RP for quite a while as I did not think I can achieve something with RP in a short run. But since I’ve started Python, I start Raspberry Pi.

Today, at the moment, I’m fabricating another piece of crap to introduce the new web crawler. BeautifulSoup and mechanize.
It’s only that I’m glad I start to learn about Python. Emphasis on About…

from bs4 import BeautifulSoup
import mechanize
import time
import urllib
import string

start = "http://" +raw_input ("Where would you like to start searching?\n")
br = mechanize.Browser()
r = br.open(start)
html = r.read() # It's the moment xml gets generated.

soup = BeautifulSoup(html)
for link in soup.find_all('a'):
	print(link.get('href')) #This prints out all sub directories.

why cannot see the pic

Full code from “Learn Raspberry Pi Programming with Python”, see the note, that is important.

import mechanize
import time
from bs4 import BeautifulSoup
import re
import string
import os
import urllib

def downloadProcess(html,base,filetype,linkList):
	"This does the actual file downloading."
	soup = BeautifulSoup(html)
	for link in soup.find_all('a'):
		linkText = str(link.get('href')) # here we assume the href starts at sub directory, not start with http://
		
		if filetype in linkText:
			slashList = [i for i, ind in enumerate(linkText) if ind == '/'] # it can store the index of "/"
			directoryName = linkText[(slashList[0]+1):slashList[1]]  # Slice to get the name of Directory
			if not os.path.exists(directoryName):
				os.makedirs(directoryName)
			
			image = urllib.URLopener()
			linkGet = base + linkText
			filesave = string.lstrip(linkText,"/")
			image.retrieve(linkGet,filesave)

		elif "htm" in linkText: #cover both html and htm
			linkList.append(link)
			

start = "http://" + raw_input("Where would you like to start searching?\n")
filetype = raw_input("What file type are you looking for?\n")

numSlash = start.count('/')
slashList = [i for i, ind in enumerate(start) if ind == '/']

if (len(slashList) >= 3): 
	third = slashList[2]
	base = start[:third]

else:
	base = start

br = mechanize.Browser()
r = br.open(start)
html = r.read()
linkList = []

print "Parsing" + start
downloadProcess(html,base,filetype,linkList)

for leftover in linkList:

	time.sleep(0.1)
	linkText = str(leftover.get('href'))
	print "Parsing" + base + linkText
	br = mechanize.Browser()
	r = br.open(base + linkText)
	html = r.read()
	linkList = []
	downloadProcess(html,base,filetype,linkList) # The recursion: depth goes first.