Java文本I/O+Web爬虫

老阿姨叫我宝贝-

已于 2022-03-29 21:48:26 修改

阅读量150

点赞数 1

文章标签： java

于 2022-03-29 21:42:08 首次发布

本文链接：https://blog.csdn.net/m0_65333099/article/details/123826569

版权

文本I/O

1. File类
2. PrintWriter类写入数据
3. 自动关闭资源
4. Scanner类读取数据
5. 替换文本程序
6. 从Web上读取数据
7.Web爬虫
8. robots协议

1. File类

File类可用来获取文件/目录的属性
UML图：
在这里插入图片描述
File()方法只能为存在的文件创造一个对象，File类没有方法可以创建修改一个文件
例如：

public class Main{
        public static void main(String[] args){
                java.io.File file = new java.io.File("C:/Users/DELL/Desktop/ttt.txt");
                System.out.println(file.isFile());
                System.out.println(file.canWrite());
                System.out.println(file.canRead());
                System.out.println(file.isDirectory());
                System.out.println(file.getPath());
                System.out.println(file.getName());
                System.out.println(file.length());
	}

}

输出为：

true
true
true
false
C:\Users\DELL\Desktop\ttt.txt
ttt.txt
15

2. PrintWriter类写入数据

PrintWriter类有方法可以向文本读取写入数据，首先要创建一个PrintWriter对象

java.io.PrintWriter output = new java.io.PrintWriter("C:/Users/DELL/Desktop/ttt.txt");

使用print()方法向文件中输入数据

public class Main{
    public static void main(String[] args) throws IOException{
        java.io.PrintWriter output = new java.io.PrintWriter("C:/Users/DELL/Desktop/ttt.txt");
        output.print("Hello World!");
        output.close();
    }
}

1.如果文件不存在，调用PrintWriter的构造方法会新创建一个文件
如果文件存在，则会直接清空当前文件的内容
2.调用PrintWriter的构造方法可能会抛出某种异常，要在方法头声明异常
3.必须使用colse()方法关闭文件，否则数据不能保存到文件中

防止文件被误清空代码：

public class Main{
    public static void main(String[] args) throws IOException{
        java.io.File file = new java.io.File("C:/Users/DELL/Desktop/ttt.txt");
        if(file.exists()){
            System.exit(1);
        }
        java.io.PrintWriter output = new java.io.PrintWriter("C:/Users/DELL/Desktop/ttt.txt");
        output.print("aaaaa");
        output.close();
    }
}

如果ttt.txt文件存在则退出程序，如果不存在则创建ttt.txt文件并写入aaaaa

3. 自动关闭资源

JDK 7提供了try-with-resources语法来自动关闭文件
代码如下：

public class Main{
    public static void main(String[] args) throws IOException{
        java.io.File file = new java.io.File("C:/Users/DELL/Desktop/ttt.txt");
        if(file.exists()){
            System.exit(1);
        }
        try(
            java.io.PrintWriter output = new java.io.PrintWriter("C:/Users/DELL/Desktop/ttt.txt");
        ){
            output.println("aaaaa");
            output.print("aaaa");
        }
    }
}

4. Scanner类读取数据

如下代码创建一个对象可以从键盘读取数据：

Scanner sc = new Scanner(System.in);

如下代码创建一个对象可以从文件中读取数据

Scanner sc = new Scanner("C:/Users/DELL/Desktop/ttt.txt");

具体代码如下：

public class Main{
    public static void main(String[] args) throws IOException{
        java.io.File file = new java.io.File("C:/Users/DELL/Desktop/ttt.txt");
        if(file.exists()){
            System.exit(1);
        }
        try(
            java.io.PrintWriter output = new java.io.PrintWriter("C:/Users/DELL/Desktop/ttt.txt");
        ){
            output.println("a b c d e");
            output.print("aaaa");
        }
        Scanner sc = new Scanner(file);
        while(sc.hasNext())
        {
            System.out.println(sc.next());
        }
        sc.close();
    }
}
输出为：
a
b
c
d
e
aaaa

可以不关闭输入文件，但这样做可以释放被文件占用的资源

5. 替换文本程序

public class Main{
    public static void main(Stirng[] args) throws IOException {
		if(args.length() != 4){
			System.out.println("Error1!");
			System.exit(1);
		}
		java.io.File sourceFile = new java.io.File(args[0]);
		if(!sourceFile.exists()){
			System.out.println("Error2!");
			System.exists(2);
		}
		java.io.File targetFile = new java.io.File(args[1]);
		if(targetFile.exists()){
			System.out.println("Error3!");
			System.exists(3);
		}
		try(
			Scanner sc = new Scanner(sourceFile);
			java.io.PrintWriter output = new java.io.PrintWriter(targetFile);
		){
			while(sc.hasNext())
			{
				String s1 = sc.nextLine();
				String s2 = s1.replaceAll(args[2],args[3]);
				output.println(s2);
			}
		}
    }
}

args[0]存储了更改前文件地址，args[1]存储了更改后文件地址，用args[3]替换args[2]的字符串

6. 从Web上读取数据

知道了Web上文件的URL(俗称网址)，也可以从Web上访问数据

public class Main{
	public static void main(String[] args){
		System.out.print("Enter a URL:");
		Scanner sc = new Scanner(System.in);
		String inputURL = sc.nextLine();
		
		try{
			java.net.URL url = new java.net.URL(inputURL);
			int count = 0;
			Scanner input = new Scanner(url.openStream());
			while(input.hasNext())
			{
				String line = input.nextLine();
				count += line.length();
			}
			System.out.println(count);
		}
		catch(java.net.MalformedURLException ex){
			System.out.println("Invalid URL");
		}
		catch(java.io.IOException ex){
			System.out.println("no such file");
		}
	}
}

输入一个URL，并未它创建一个url对象，接下来用openStream()打开url的输入流，并创建Scanner对象input，接下来从输入流读取数据，如同读取本地文件一样
必须声明异常MalformedURLException和IOException
MalformedURLException：如果输入URL错误则会抛出该异常，例如没有加http://或只有http:没有//
IOException：如果输入URL格式正确但找不到该URL，则会抛出该异常

7.Web爬虫

从输入的网址URL开始，遍历URL，在URL中找到URL1,URL2,URL3,URL4…
在URL1中找到URL11,URL12,URL13…
在URL2中找到URL21,URL22,URL23…
在URL3中找到URL31,URL32,URL33…
…
类似广度优先搜索

public class Main{
	public static void main(String[] args){
		Scanner sc = new Scanner(System.in);
		String startURL = sc.nextLine();
		
		Queue<String> queue = new LinkedList<String>();
		ArrayList<String> list = new ArrayList<String>();
		queue.offer(startURL);
		while(!queue.isEmpty() && list.size() < 10)
		{
			String URL = queue.poll();
			if(!list.contains(URL)){
				list.add(URL);
				System.out.println("Crawl: " + URL);
				for(String newURL : getURL(URL)){
					if(!list.contains(newURL)){
						queue.offer(newURL);
					}
				}
			}
		}
	}
	public static ArrayList<String> getURL(String URL){
		ArrayList<String> list2 = new ArrayList<String>();
		try{
			java.net.URL url = new java.net.URL(URL);
			Scanner sc = new Scanner(url.openStream());
			int start = 0;
			while(sc.hasNext())
			{
				String line = sc.nextLine();
				start = line.indexOf("http://",start);
				while(start >= 0)
				{
					int end = line.indexOf("\"",start);
					if(end > 0){
						list2.add(line.substring(start,end));
						start = line.indexOf("http://",end);
					}
					else break;
				}
			}
		}
		catch(Exception ex){
			System.out.println("Error!");
		}
		return list2;
	}
}