android爬虫
In this tutorial, we’ll be implementing Web Scraping in our Android Application. We will be scraping Journaldev.com to get all the words listed on the home page. We’ll be using the Retrofit library to read web pages.
在本教程中,我们将在Android应用程序中实现Web Scraping。 我们将抓取Journaldev.com以获得主页上列出的所有单词。 我们将使用Retrofit库读取网页。
Android改装转换器 (Android Retrofit Converters)
We’ve covered a lot on Retrofit in the below tutorials:
在以下教程中,我们对Retrofit进行了很多介绍:
- Retrofit Basics 改造基础
- Retrofit And RxJava 改造和RxJava
- Retrofit Offline Caching 改造离线缓存
- Retrofit Calling In Intervals 间隔调用
- Retrofit Downloading Files 改造下载文件
- Retrofit MVP Dagger RxJava 改造MVP Dagger RxJava
- Retrofit Downloading And Showing Progress in Notifications 改造下载并显示通知进度
Most of the times we have used Gson to serialise/deserialise JSON responses.
For this, we’ve used GsonConverters in our Retrofit Builder.
大多数时候,我们使用Gson来序列化/反序列化JSON响应。
为此,我们在Retrofit Builder中使用了GsonConverters。
There can be instances when you just need plain text as the response body from the network call.
In such cases, instead of GsonConverters, we need to use Scalars Converter
在某些情况下,您只需要纯文本作为网络调用的响应正文即可。
在这种情况下,我们需要使用Scalars Converter
代替GsonConverters
In order to use Scalar Converters, you need to add the following dependency along with Retrofit and OkHttp dependencies in the build.gradle
为了使用标量转换器,您需要在build.gradle
添加以下依赖项以及Retrofit和OkHttp依赖build.gradle
implementation 'com.squareup.retrofit2:converter-scalars:2.3.0'
To add Scalar Converters to the Retrofit Builder, do the following:
要将标量转换器添加到Retrofit Builder,请执行以下操作:
Retrofit retrofit = new Retrofit.Builder()
.addConverterFactory(ScalarsConverterFactory.create())
.baseUrl("BASE URL")
.client(okHttpClient).build();
We can add multiple converters to the builder as well. But the order is important since retrofit chooses the first compatible converter.
我们也可以向构建器添加多个转换器。 但是顺序很重要,因为改造会选择第一个兼容的转换器。
Scalar Converters
.
Scalar Converters
可以使用OkHttp中的RequestBody和ResponseBody类作为类型。
RequestBody
and
RequestBody
和
ResponseBody
allows receiving any type of response data using
ResponseBody
允许在enqueue方法中使用
request.body()
in the enqueue method.
request.body()
接收任何类型的响应数据。
The only disadvantage: You need to handle the RequestBody object creation yourself.
唯一的缺点:您需要自己处理RequestBody对象的创建。
In the following section, we’ll be using ScalarConverter to parse the website passed in the Retrofit request. We’ll fetch all text words and keep a count of each word in the RecyclerView.
在下一节中,我们将使用ScalarConverter来解析在Retrofit请求中传递的网站。 我们将获取所有文本单词,并在RecyclerView中保留每个单词的计数。
Also, we’ll add a filter function that filters the words by the count. We’ll use a Hashmap to store the word/count pair and sort it by value.
另外,我们将添加一个过滤器功能,该功能可按计数过滤单词。 我们将使用Hashmap来存储单词/计数对并按值对其进行排序。
项目结构 (Project Structure)
The dependencies in the build.gradle
is:
build.gradle
的依赖build.gradle
是:
implementation 'com.squareup.retrofit2:retrofit:2.4.0'
implementation 'com.squareup.okhttp3:logging-interceptor:3.9.1'
implementation 'com.android.support:design:28.0.0'
implementation 'com.squareup.retrofit2:converter-scalars:2.3.0'
implementation 'org.jsoup:jsoup:1.10.1'
码 (Code)
The code for the activity_main.xml
is defined below:
下面定义了activity_main.xml
的代码:
<?xml version="1.0" encoding="utf-8"?>
<android.support.constraint.ConstraintLayout xmlns:android="https://schemas.android.com/apk/res/android"
xmlns:app="https://schemas.android.com/apk/res-auto"
xmlns:tools="https://schemas.android.com/tools"
android:layout_width="match_parent"
android:layout_height="match_parent">
<android.support.v7.widget.RecyclerView
android:id="@+id/wordList"
android:layout_width="match_parent"
android:layout_height="match_parent"
android:orientation="vertical"
app:layoutManager="android.support.v7.widget.LinearLayoutManager"
app:layout_constraintBottom_toBottomOf="parent"
app:layout_constraintLeft_toLeftOf="parent"
app:layout_constraintRight_toRightOf="parent"
app:layout_constraintTop_toTopOf="parent" />
<android.support.design.widget.FloatingActionButton
android:id="@+id/fab"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_margin="16dp"
app:layout_constraintBottom_toBottomOf="parent"
app:layout_constraintLeft_toLeftOf="parent"
android:src="@drawable/ic_filter_list"
app:layout_constraintRight_toRightOf="parent" />
</android.support.constraint.ConstraintLayout>
The code for the ApiService.java class is given below:
ApiService.java类的代码如下:
package com.journaldev.androidwebscrapingretrofit;
import retrofit2.Call;
import retrofit2.http.GET;
public interface ApiService {
@GET(".")
Call<String> getStringResponse();
}
.
is used to specify no path. Thus the base url only would be used.
.
用于指定无路径。 因此,仅将使用基本URL。
The code for the MainActivity.java is given below:
MainActivity.java的代码如下:
package com.journaldev.androidwebscrapingretrofit;
import android.support.design.widget.FloatingActionButton;
import android.support.v7.app.AppCompatActivity;
import android.os.Bundle;
import android.support.v7.widget.RecyclerView;
import android.view.View;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import okhttp3.OkHttpClient;
import okhttp3.logging.HttpLoggingInterceptor;
import retrofit2.Call;
import retrofit2.Callback;
import retrofit2.Response;
import retrofit2.Retrofit;
import retrofit2.converter.scalars.ScalarsConverterFactory;
public class MainActivity extends AppCompatActivity {
RecyclerView recyclerView;
FloatingActionButton fab;
HashMap<String, Integer> occurrences = new HashMap<>();
WordsAdapter wordsAdapter;
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
recyclerView = findViewById(R.id.wordList);
fab = findViewById(R.id.fab);
OkHttpClient okHttpClient = new OkHttpClient().newBuilder().addInterceptor(new HttpLoggingInterceptor().setLevel(HttpLoggingInterceptor.Level.BODY))
.build();
Retrofit retrofit = new Retrofit.Builder()
.addConverterFactory(ScalarsConverterFactory.create())
.baseUrl("https://www.journaldev.com/")
.client(okHttpClient).build();
final ApiService apiService = retrofit.create(ApiService.class);
Call<String> stringCall = apiService.getStringResponse();
stringCall.enqueue(new Callback<String>() {
@Override
public void onResponse(Call<String> call, Response<String> response) {
if (response.isSuccessful()) {
String responseString = response.body();
Document doc = Jsoup.parse(responseString);
responseString = doc.text();
createHashMap(responseString);
}
}
@Override
public void onFailure(Call<String> call, Throwable t) {
}
});
fab.setOnClickListener(new View.OnClickListener() {
@Override
public void onClick(View view) {
occurrences = sortByValueDesc(occurrences);
wordsAdapter = new WordsAdapter(MainActivity.this, occurrences);
recyclerView.setAdapter(wordsAdapter);
}
});
}
private void createHashMap(String responseString) {
responseString = responseString.replaceAll("[^a-zA-Z0-9]", " ");
String[] splitWords = responseString.split(" +");
for (String word : splitWords) {
if (StringUtil.isNumeric(word)) {
continue;
}
Integer oldCount = occurrences.get(word);
if (oldCount == null) {
oldCount = 0;
}
occurrences.put(word, oldCount + 1);
}
wordsAdapter = new WordsAdapter(this, occurrences);
recyclerView.setAdapter(wordsAdapter);
}
public static HashMap<String, Integer> sortByValueDesc(Map<String, Integer> map) {
List<Map.Entry<String, Integer>> list = new LinkedList(map.entrySet());
Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
@Override
public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
return o2.getValue().compareTo(o1.getValue());
}
});
HashMap<String, Integer> result = new LinkedHashMap<>();
for (Map.Entry<String, Integer> entry : list) {
result.put(entry.getKey(), entry.getValue());
}
return result;
}
}
The following code parses the string from HTML format;
以下代码从HTML格式解析字符串;
Document doc = Jsoup.parse(responseString);
responseString = doc.text();
Inside createHashMap
we remove all special characters and omit all numerics from the hashmap.
sortByValueDesc uses a Comparator to compare the values and sort the HashMap in a descending order.
在createHashMap
内部,我们删除了所有特殊字符,并从哈希图中省略了所有数字。
sortByValueDesc使用Comparator比较值并以降序对HashMap进行排序。
The code for the list_item_words.xml
which contains the layout for RecyclerView rows is given below:
下面给出了list_item_words.xml
的代码,其中包含RecyclerView行的布局:
<?xml version="1.0" encoding="utf-8"?>
<RelativeLayout xmlns:android="https://schemas.android.com/apk/res/android"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:background="?attr/selectableItemBackground"
android:padding="24dp">
<TextView
android:id="@+id/txtWord"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_alignParentStart="true"
android:layout_centerVertical="true" />
<TextView
android:id="@+id/txtCount"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_alignParentEnd="true"
android:layout_centerVertical="true" />
</RelativeLayout>
The code for the WordsAdapter.java
class is given below:
下面给出了WordsAdapter.java
类的代码:
package com.journaldev.androidwebscrapingretrofit;
import android.content.Context;
import android.support.annotation.NonNull;
import android.support.v7.widget.RecyclerView;
import android.view.LayoutInflater;
import android.view.View;
import android.view.ViewGroup;
import android.widget.TextView;
import java.util.HashMap;
public class WordsAdapter extends RecyclerView.Adapter<WordsAdapter.WordsHolder> {
HashMap<String, Integer> modelList;
Context mContext;
private String[] mKeys;
class WordsHolder extends RecyclerView.ViewHolder {
TextView txtWord;
TextView txtCount;
public WordsHolder(View itemView) {
super(itemView);
txtWord = itemView.findViewById(R.id.txtWord);
txtCount = itemView.findViewById(R.id.txtCount);
}
}
public WordsAdapter(Context context, HashMap<String, Integer> modelList) {
this.modelList = modelList;
mContext = context;
mKeys = modelList.keySet().toArray(new String[modelList.size()]);
}
@NonNull
@Override
public WordsHolder onCreateViewHolder(@NonNull ViewGroup parent, int viewType) {
View view = LayoutInflater.from(mContext).inflate(R.layout.list_item_words, parent, false);
return new WordsHolder(view);
}
@Override
public void onBindViewHolder(@NonNull WordsHolder holder, int position) {
holder.txtWord.setText(mKeys[position]);
holder.txtCount.setText(String.valueOf(modelList.get(mKeys[position])));
}
@Override
public int getItemCount() {
return modelList.size();
}
}
The output of the above application is given below:
以上应用程序的输出如下:
So the above output shows all words present on the home page of JournalDev at the time of writing this tutorial with their frequency.
因此,以上输出显示了在撰写本教程时JournalDev主页上出现的所有单词及其频率。
This brings an end to this tutorial. You can download the project from the link below:
本教程到此结束。 您可以从下面的链接下载项目:
翻译自: https://www.journaldev.com/23448/android-web-scraping-with-retrofit
android爬虫