从网页上爬取相关的信息。
运用的知识:正则表达式,输入输出流
eg:邮箱
import java.net.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.io.*;
public class Spider
{
public static void main(String[] args) throws MalformedURLException
{
try{
URL url = new URL("http://daoshi.eol.cn/page.php?exp_id=14702");
try {
URLConnection urlConn=url.openConnection();
InputStream is=urlConn.getInputStream();
InputStreamReader isr=new InputStreamReader(is);
BufferedReader br=new BufferedReader(isr);
FileOutputStream fos=new FileOutputStream("1.txt");
String strLine;
StringBuffer sb = new StringBuffer();
while((strLine=br.readLine())!=null)
{
sb.append(strLine);
}
String regExp = "E-mail </strong></td>(.{1,2})\\s*<td width=\"150\">(.{1,50})</td>";
StringBuffer sbSec = new StringBuffer();
Pattern pat = Pattern.compile( regExp );
Matcher mat = pat.matcher(sb);
if( mat.find() )
{
sbSec.append(mat.group(2)).append("\n");
}
fos.write(sbSec.substring(0).getBytes());
is.close();
fos.close();
}
catch (IOException e)
{
e.printStackTrace();
}
}
catch(MalformedURLException e)
{
e.printStackTrace();
}
}
}
注意:
1程序运行后需要刷新一下才能看的到结果!
2里面涉及到一个中文乱码问题
解决方法:在输入输出时就转换好编码问题
InputStreamReader isr=new InputStreamReader(is);
改为
InputStreamReader isr=new InputStreamReader(is,"UTF-8");