本篇参考:http://tonl.iteye.com/blog/1918245
python版本:2.7 64bit window版本;
下载python:http://www.python.org/getit/
- Python 2.7.5 Windows X86-64 Installer (Windows AMD64 / Intel 64 / X86-64 binary [1] -- does not include source),进行安装:
首先编写下面的spider.py脚本:
# -*- coding: utf-8 -*-
#import urllib2
from urllib import urlopen
import os
import sys
class Spider:
"""
download web site from the given file
"""
def __init__(self,filename,downloadPath):
"""
init the filename ,if the filename is not raise a error
"""
if not os.path.isfile(filename):
print 'the given file does not exist,the program will exit'
sys.exit(0)
else:
self.fname=filename
if not os.path.isdir(downloadPath):
print 'the given download path does not exist ,the programe will exit'
else:
self.dpath=downloadPath
def download(self):
"""
download the web site from the given file by line
"""
fp=open(self.fname,'r')
while True:
line=fp.readline()
if not line:
break
if 'html' in line:
tempname=filter(str.isalnum,line).replace('html','.html')
else:
tempname=filter(str.isalnum,line)+'.html'
self.download_html(line,self.dpath+'\\'+tempname)
fp.close()
def download_html(self,website,filename):
"""
download the html by the given web site and save to name
"""
response=urlopen(website)
data=response.read()
fp=file(filename,'a+')
fp.write(data)
fp.close()
def test():
"""
test program
"""
filename=sys.argv[1]
downloadPath=sys.argv[2]
spider=Spider(filename,downloadPath)
spider.download()
if __name__ =='__main__': test()
上面的脚本,要输入两个参数,一个是要下载的网页的地址文件,格式一般如下(websites.txt):
http://blog.csdn.net/fansy1990
http://www.baidu.com
另外一个参数是下载的网页的存放地点。
然后可以在命令行运行:
python D:\\spider.py D:\\websites.txt D:\\download_tmp
然后到D盘的download_tmp下面查找下载的文件,如果找到,则说明配置正确;
最后编写下面的java程序,需要导入jython-*.jar包(lz下载的是2.2的):
package test;
import java.io.IOException;
public class PyTest {
/**
* @param args
* @throws IOException
* @throws InterruptedException
*/
public static void main(String[] args) throws IOException, InterruptedException {
String py_path="D:\\spider.py";
String websites="D:\\websites.txt";
String outDir="D:\\tmp";
//
Process pr=Runtime.getRuntime().exec("python "+py_path+" "+websites+" "+outDir );
pr.waitFor();
System.out.println("done ...");
}
}
运行上面的命令,需要设置eclipse中的Environment属性,添加一个PATH变量,值是python的安装目录;
运行后,会提示:
*sys-package-mgr*: can't create package cache dir, *jython-2.2.jar\cachedir\packages'
这个可以不用管,不会影响程序运行。
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990