[JAVA]爬取百度关于关键字的诗句

[JAVA]爬取百度关于关键字的诗句


寒假不想写作业,学完java基础的我手痒痒,于是就有了这篇菜逼吧初作
仅供学习和交流,严禁用于商业用途


先要获得有哪些关键字可以爬,在此之前,写一个Get函数,来获得website上的内容

public static String doGet(String website) {
   HttpURLConnection httpURLConnection = null;
   InputStream inputStream = null;
   BufferedReader bufferedReader = null;
   String result = null;

   try {
       URL url = new URL(website);
       httpURLConnection = (HttpURLConnection)url.openConnection();
       httpURLConnection.setRequestMethod("GET");  //设置连接方式0
       httpURLConnection.setConnectTimeout(3000);  //超时时间3秒
       httpURLConnection.setReadTimeout(6000);     //设置读取时间6秒
       httpURLConnection.connect();
       if (httpURLConnection.getResponseCode() == 200){
           inputStream = httpURLConnection.getInputStream();
           bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"));
           StringBuffer stringBuffer = new StringBuffer();
           String temp = null;
           while ((temp = bufferedReader.readLine()) != null){
               stringBuffer.append(temp);
               stringBuffer.append("\r\n");
           }
           result = stringBuffer.toString();
       }
   } catch (MalformedURLException e) {
       e.printStackTrace();
   } catch (IOException e) {
       e.printStackTrace();
    } finally {
        if (null != bufferedReader){
            try {
                bufferedReader.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        if (null != inputStream){
            try {
                inputStream.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        httpURLConnection.disconnect();
    }

    return result;
}

然后书写获取关键字的代码
这里website的stat0、subtitle、query随便写的,不是重点
返回一个集合,记录得到的关键字

public static ArrayList<String> getTypes() {
    ArrayList<String> arrayList = new ArrayList<>();
    StringBuffer website = new StringBuffer("http://opendata.baidu.com/api.php?from_mid=1&format=json&ie=utf-8&oe=utf-8&stat0=风&subtitle=含有山的诗句&query=含有山的诗句&rn=6&pn=0&resource_id=28239&cb=jQuery110208838176116082497_");
    website.append(System.currentTimeMillis());
    website.append("&_=");
    website.append(System.currentTimeMillis());

    String result = doGet(website.toString());
    result = result.substring(result.indexOf("{"), result.lastIndexOf("}") + 1);

    JSONObject jsonObject = null;
    JSONArray jsonArray = null;

    try {
        jsonObject = new JSONObject(result);
        jsonArray = jsonObject.getJSONArray("data").getJSONObject(0).getJSONObject("OtherInfo").getJSONArray("stat0");
        for (int i = 0; i < jsonArray.length(); i++){
            arrayList.add(jsonArray.getJSONObject(i).getString("sa"));
        }
    } catch (JSONException e) {
        e.printStackTrace();
    }
    return arrayList;
}

康康网页,再康康website,可知rn指的是一页内的句数,pn指的是当前页码-1

得到这些关键信息后,书写代码获得一页的所有诗句
这里我通过临时变量来记录内容,方便在使用时直接用[绝对不是因为懒 ]
如果遇到异常,会直接跳到下一页,不记录改业内容,碍于我的强迫症,就写了一个do while,来避免出现跳页的问题

private static StringBuffer TempDatas = new StringBuffer();

public static void getContents(int page, String type) {
    String result = null;
    String temp = null;
    boolean isWhile = false;
    do {
        StringBuffer website = new StringBuffer("http://opendata.baidu.com/api.php?from_mid=1&format=json&ie=utf-8&oe=utf-8&stat0=");
        website.append(type.substring(type.indexOf("有") + 1, type.indexOf("的")));
        website.append("&subtitle=含有山的诗句&query=含有山的诗句&rn=6&pn=");
        website.append(page);
        website.append("&resource_id=28239&cb=jQuery1102045186558424078416_");
        website.append(System.currentTimeMillis());
        website.append("&_=");
        website.append(System.currentTimeMillis());
        System.out.println(website);
        result = doGet(website.toString());
        System.out.println(result);
        JSONObject jsonObject = null;
        JSONArray jsonArray = null;
        try {
            result = result.substring(result.indexOf("{"), result.lastIndexOf("}") + 1);
            jsonObject = new JSONObject(result).getJSONArray("data").getJSONObject(0);
            jsonArray = jsonObject.getJSONArray("disp_data");
            int count = 0;
            while (count < 6) {
                temp = jsonArray.getJSONObject(count).getString("ename");       //诗句
                temp += "————";
                temp += jsonArray.getJSONObject(count).getJSONObject("author").getString("text");       //作者
                System.out.println(temp);
                TempDatas.append(temp);
                TempDatas.append("\r\n");
                count++;
            }
            isWhile = false;
        } catch (JSONException e) {
            isWhile = true;
            e.printStackTrace();
        } catch (StringIndexOutOfBoundsException e) {
            isWhile = true;
            e.printStackTrace();
        } catch (NullPointerException e) {
            isWhile = true;
            e.printStackTrace();
        }
    }while (isWhile);
}

做完所有的准备工作,就可以愉快的输出了
这个地方可以通过多线程的方式优化[大概吧],但是我现在还是个菜逼吧,就没写[其实是因为多线程会出现输出混乱的情况,不会解决 ]

private final static int NUMBER_OF_SENTENCES = 1080;   //一项的句数-6的倍数
private final static int ONCE_OF_WRITING = 360;        //一次写入句数-6的倍数
public static void output(File file, String type, boolean isAppend){
   try {
       if (!file.exists()) {
           file.createNewFile();
       } else if (!isAppend){
           file.delete();
           file.createNewFile();
       }
   } catch (IOException e) {
       e.printStackTrace();
   }

   write(file, type + "\r\n");         //写入含有-的古诗并换行

   int count = 0;
   while(true) {
       getContents(count, type);
       count = count + 6;
       if (count % ONCE_OF_WRITING == 0) {     //当句数是ONCE_OF_WRITING的倍数时,写入一次
           write(file, TempDatas.toString());
           TempDatas.delete(0, TempDatas.length());
       }
       if (count % NUMBER_OF_SENTENCES == 0) { //当达到NUMBER_OF_SENTENCES时,停止该项
           return;
       }
   }
}
public static void write(File file, String content){

    FileOutputStream fileOutputStream = null;
    BufferedOutputStream bufferedOutputStream = null;
    try {
        if (!file.exists()) file.createNewFile();
        fileOutputStream = new FileOutputStream(file, true);
        bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
        bufferedOutputStream.write(content.getBytes("utf-8"));
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            bufferedOutputStream.close();
            fileOutputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

main函数写完,完工

public static void main(String[] args) {

        ArrayList<String> typeArrayList = getTypes();
        ListIterator<String> listIterator = typeArrayList.listIterator();
        int index = 0;
        while(listIterator.hasNext()){
            typeArrayList.set(index, "含有" + listIterator.next() + "的诗句");
            index++;
        }

        System.out.println(typeArrayList);
        ArrayList<File> fileArrayList = new ArrayList<>();
        listIterator = typeArrayList.listIterator();
        index = 0;
        while (listIterator.hasNext()) {
            listIterator.next();
            fileArrayList.add(new File(FilePath + typeArrayList.get(index) + ".txt"));
            index++;
        }


        ArrayList<Thread> threads = new ArrayList<>();

        listIterator = typeArrayList.listIterator();
        index = 0;
        while (listIterator.hasNext()) {
            listIterator.next();
            output(fileArrayList.get(index), typeArrayList.get(index), false);
            index++;
        }
    }

鬼知道这两百多行的代码我是怎么做到写了两天的

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值