I wrote some crwalers these days, not so strong though. But it feels terific when they get the data back.
Here some points which I value very important in the process.
1. HTML parsing: I used the urllib to get the html.
def getHtml(url): page = urllib.request.urlopen(url) html = page.read() return html
2. BeatifulSoup class: with the bs4, it's convenient to grasp the data we want.
html = getHtml(url) soup = BeautifulSoup(html, 'html.parser') getImg(soup) getUrl(soup)
3. Regular expression: the re library is briliant in string find function and also string match function.
for img in soup.find_all('img'): print(img, '\n') src = str(img.get('data-url')) if re.match(r'^https?:/{2}\w.+$', src): tot = tot + 1 urllib.request.urlretrieve(src, "%s.jpg" % tot)
4. Operating system: this could be helpful for file dealing.
os.mkdir("58pic") os.chdir("58pic")
Complete codes are presented as following:
import os import re import urllib.request from bs4 import BeautifulSoup def getHtml(url): page = urllib.request.urlopen(url) html = page.read() return html def getImg(soup): global tot for img in soup.find_all('img'): print(img, '\n') src = str(img.get('data-url')) if re.match(r'^https?:/{2}\w.+$', src): tot = tot + 1 urllib.request.urlretrieve(src, "%s.jpg" % tot) src = "http:" + src if re.match(r'^https?:/{2}\w.+$', src): tot = tot + 1 urllib.request.urlretrieve(src, "%s.jpg" % tot) def getUrl(soup): global tmp global url for ai in soup.find_all('a'): href = str(ai.get('href')) if not re.match(r'^https?:/{2}\w.+$', href): continue if href.find('image') == -1: continue if href == tmp: continue url = href break tot = 0 os.mkdir("58pic") os.chdir("58pic") url = "http://www.58pic.com/" tmp = "" while tot < 100: print(url, '\n') if url == tmp: break tmp = url html = getHtml(url) soup = BeautifulSoup(html, 'html.parser') getImg(soup) getUrl(soup)