[JAVA]爬取百度关于关键字的诗句
寒假不想写作业,学完java基础的我手痒痒,于是就有了这篇菜逼吧初作
仅供学习和交流,严禁用于商业用途
先要获得有哪些关键字可以爬,在此之前,写一个Get函数,来获得website上的内容
public static String doGet(String website) {
HttpURLConnection httpURLConnection = null;
InputStream inputStream = null;
BufferedReader bufferedReader = null;
String result = null;
try {
URL url = new URL(website);
httpURLConnection = (HttpURLConnection)url.openConnection();
httpURLConnection.setRequestMethod("GET"); //设置连接方式0
httpURLConnection.setConnectTimeout(3000); //超时时间3秒
httpURLConnection.setReadTimeout(6000); //设置读取时间6秒
httpURLConnection.connect();
if (httpURLConnection.getResponseCode() == 200){
inputStream = httpURLConnection.getInputStream();
bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"));
StringBuffer stringBuffer = new StringBuffer();
String temp = null;
while ((temp = bufferedReader.readLine()) != null){
stringBuffer.append(temp);
stringBuffer.append("\r\n");
}
result = stringBuffer.toString();
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (null != bufferedReader){
try {
bufferedReader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (null != inputStream){
try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
httpURLConnection.disconnect();
}
return result;
}
然后书写获取关键字的代码
这里website的stat0、subtitle、query随便写的,不是重点
返回一个集合,记录得到的关键字
public static ArrayList<String> getTypes() {
ArrayList<String> arrayList = new ArrayList<>();
StringBuffer website = new StringBuffer("http://opendata.baidu.com/api.php?from_mid=1&format=json&ie=utf-8&oe=utf-8&stat0=风&subtitle=含有山的诗句&query=含有山的诗句&rn=6&pn=0&resource_id=28239&cb=jQuery110208838176116082497_");
website.append(System.currentTimeMillis());
website.append("&_=");
website.append(System.currentTimeMillis());
String result = doGet(website.toString());
result = result.substring(result.indexOf("{"), result.lastIndexOf("}") + 1);
JSONObject jsonObject = null;
JSONArray jsonArray = null;
try {
jsonObject = new JSONObject(result);
jsonArray = jsonObject.getJSONArray("data").getJSONObject(0).getJSONObject("OtherInfo").getJSONArray("stat0");
for (int i = 0; i < jsonArray.length(); i++){
arrayList.add(jsonArray.getJSONObject(i).getString("sa"));
}
} catch (JSONException e) {
e.printStackTrace();
}
return arrayList;
}
康康网页,再康康website,可知rn指的是一页内的句数,pn指的是当前页码-1
得到这些关键信息后,书写代码获得一页的所有诗句
这里我通过临时变量来记录内容,方便在使用时直接用[绝对不是因为懒 ]
如果遇到异常,会直接跳到下一页,不记录改业内容,碍于我的强迫症,就写了一个do while,来避免出现跳页的问题
private static StringBuffer TempDatas = new StringBuffer();
public static void getContents(int page, String type) {
String result = null;
String temp = null;
boolean isWhile = false;
do {
StringBuffer website = new StringBuffer("http://opendata.baidu.com/api.php?from_mid=1&format=json&ie=utf-8&oe=utf-8&stat0=");
website.append(type.substring(type.indexOf("有") + 1, type.indexOf("的")));
website.append("&subtitle=含有山的诗句&query=含有山的诗句&rn=6&pn=");
website.append(page);
website.append("&resource_id=28239&cb=jQuery1102045186558424078416_");
website.append(System.currentTimeMillis());
website.append("&_=");
website.append(System.currentTimeMillis());
System.out.println(website);
result = doGet(website.toString());
System.out.println(result);
JSONObject jsonObject = null;
JSONArray jsonArray = null;
try {
result = result.substring(result.indexOf("{"), result.lastIndexOf("}") + 1);
jsonObject = new JSONObject(result).getJSONArray("data").getJSONObject(0);
jsonArray = jsonObject.getJSONArray("disp_data");
int count = 0;
while (count < 6) {
temp = jsonArray.getJSONObject(count).getString("ename"); //诗句
temp += "————";
temp += jsonArray.getJSONObject(count).getJSONObject("author").getString("text"); //作者
System.out.println(temp);
TempDatas.append(temp);
TempDatas.append("\r\n");
count++;
}
isWhile = false;
} catch (JSONException e) {
isWhile = true;
e.printStackTrace();
} catch (StringIndexOutOfBoundsException e) {
isWhile = true;
e.printStackTrace();
} catch (NullPointerException e) {
isWhile = true;
e.printStackTrace();
}
}while (isWhile);
}
做完所有的准备工作,就可以愉快的输出了
这个地方可以通过多线程的方式优化[大概吧],但是我现在还是个菜逼吧,就没写[其实是因为多线程会出现输出混乱的情况,不会解决 ]
private final static int NUMBER_OF_SENTENCES = 1080; //一项的句数-6的倍数
private final static int ONCE_OF_WRITING = 360; //一次写入句数-6的倍数
public static void output(File file, String type, boolean isAppend){
try {
if (!file.exists()) {
file.createNewFile();
} else if (!isAppend){
file.delete();
file.createNewFile();
}
} catch (IOException e) {
e.printStackTrace();
}
write(file, type + "\r\n"); //写入含有-的古诗并换行
int count = 0;
while(true) {
getContents(count, type);
count = count + 6;
if (count % ONCE_OF_WRITING == 0) { //当句数是ONCE_OF_WRITING的倍数时,写入一次
write(file, TempDatas.toString());
TempDatas.delete(0, TempDatas.length());
}
if (count % NUMBER_OF_SENTENCES == 0) { //当达到NUMBER_OF_SENTENCES时,停止该项
return;
}
}
}
public static void write(File file, String content){
FileOutputStream fileOutputStream = null;
BufferedOutputStream bufferedOutputStream = null;
try {
if (!file.exists()) file.createNewFile();
fileOutputStream = new FileOutputStream(file, true);
bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
bufferedOutputStream.write(content.getBytes("utf-8"));
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
bufferedOutputStream.close();
fileOutputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
main函数写完,完工
public static void main(String[] args) {
ArrayList<String> typeArrayList = getTypes();
ListIterator<String> listIterator = typeArrayList.listIterator();
int index = 0;
while(listIterator.hasNext()){
typeArrayList.set(index, "含有" + listIterator.next() + "的诗句");
index++;
}
System.out.println(typeArrayList);
ArrayList<File> fileArrayList = new ArrayList<>();
listIterator = typeArrayList.listIterator();
index = 0;
while (listIterator.hasNext()) {
listIterator.next();
fileArrayList.add(new File(FilePath + typeArrayList.get(index) + ".txt"));
index++;
}
ArrayList<Thread> threads = new ArrayList<>();
listIterator = typeArrayList.listIterator();
index = 0;
while (listIterator.hasNext()) {
listIterator.next();
output(fileArrayList.get(index), typeArrayList.get(index), false);
index++;
}
}
鬼知道这两百多行的代码我是怎么做到写了两天的