android爬虫_进行Android Web爬虫改造

本文介绍如何使用Retrofit库在Android应用中实现Web抓取,通过抓取JournalDev.com主页上的所有文字,使用Jsoup库解析HTML,统计各单词出现频率,并在RecyclerView中展示。
摘要由CSDN通过智能技术生成

android爬虫

In this tutorial, we’ll be implementing Web Scraping in our Android Application. We will be scraping Journaldev.com to get all the words listed on the home page. We’ll be using the Retrofit library to read web pages.

在本教程中,我们将在Android应用程序中实现Web Scraping。 我们将抓取Journaldev.com以获得主页上列出的所有单词。 我们将使用Retrofit库读取网页。

Android改装转换器 (Android Retrofit Converters)

We’ve covered a lot on Retrofit in the below tutorials:

在以下教程中,我们对Retrofit进行了很多介绍:

Most of the times we have used Gson to serialise/deserialise JSON responses.
For this, we’ve used GsonConverters in our Retrofit Builder.

大多数时候,我们使用Gson来序列化/反序列化JSON响应。
为此,我们在Retrofit Builder中使用了GsonConverters。

There can be instances when you just need plain text as the response body from the network call.
In such cases, instead of GsonConverters, we need to use Scalars Converter

在某些情况下,您只需要纯文本作为网络调用的响应正文即可。
在这种情况下,我们需要使用Scalars Converter代替GsonConverters

In order to use Scalar Converters, you need to add the following dependency along with Retrofit and OkHttp dependencies in the build.gradle

为了使用标量转换器,您需要在build.gradle添加以下依赖项以及Retrofit和OkHttp依赖build.gradle

implementation 'com.squareup.retrofit2:converter-scalars:2.3.0'

To add Scalar Converters to the Retrofit Builder, do the following:

要将标量转换器添加到Retrofit Builder,请执行以下操作:

Retrofit retrofit = new Retrofit.Builder()
                .addConverterFactory(ScalarsConverterFactory.create())
                .baseUrl("BASE URL")
                .client(okHttpClient).build();

We can add multiple converters to the builder as well. But the order is important since retrofit chooses the first compatible converter.

我们也可以向构建器添加多个转换器。 但是顺序很重要,因为改造会选择第一个兼容的转换器。

Scalar Converters. Scalar Converters可以使用OkHttp中的RequestBody和ResponseBody类作为类型。

RequestBody and RequestBodyResponseBody allows receiving any type of response data using ResponseBody允许在enqueue方法中使用 request.body() in the enqueue method. request.body()接收任何类型的响应数据。

The only disadvantage: You need to handle the RequestBody object creation yourself.

唯一的缺点:您需要自己处理RequestBody对象的创建。

Jsoup library. Jsoup库。

In the following section, we’ll be using ScalarConverter to parse the website passed in the Retrofit request. We’ll fetch all text words and keep a count of each word in the RecyclerView.

在下一节中,我们将使用ScalarConverter来解析在Retrofit请求中传递的网站。 我们将获取所有文本单词,并在RecyclerView中保留每个单词的计数。

Also, we’ll add a filter function that filters the words by the count. We’ll use a Hashmap to store the word/count pair and sort it by value.

另外,我们将添加一个过滤器功能,该功能可按计数过滤单词。 我们将使用Hashmap来存储单词/计数对并按值对其进行排序。

项目结构 (Project Structure)

The dependencies in the build.gradle is:

build.gradle的依赖build.gradle是:

implementation 'com.squareup.retrofit2:retrofit:2.4.0'
implementation 'com.squareup.okhttp3:logging-interceptor:3.9.1'
implementation 'com.android.support:design:28.0.0'
implementation 'com.squareup.retrofit2:converter-scalars:2.3.0'
implementation 'org.jsoup:jsoup:1.10.1'

(Code)

The code for the activity_main.xml is defined below:

下面定义了activity_main.xml的代码:

<?xml version="1.0" encoding="utf-8"?>
<android.support.constraint.ConstraintLayout xmlns:android="https://schemas.android.com/apk/res/android"
    xmlns:app="https://schemas.android.com/apk/res-auto"
    xmlns:tools="https://schemas.android.com/tools"
    android:layout_width="match_parent"
    android:layout_height="match_parent">


    <android.support.v7.widget.RecyclerView
        android:id="@+id/wordList"
        android:layout_width="match_parent"
        android:layout_height="match_parent"
        android:orientation="vertical"
        app:layoutManager="android.support.v7.widget.LinearLayoutManager"
        app:layout_constraintBottom_toBottomOf="parent"
        app:layout_constraintLeft_toLeftOf="parent"
        app:layout_constraintRight_toRightOf="parent"
        app:layout_constraintTop_toTopOf="parent" />


    <android.support.design.widget.FloatingActionButton
        android:id="@+id/fab"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_margin="16dp"
        app:layout_constraintBottom_toBottomOf="parent"
        app:layout_constraintLeft_toLeftOf="parent"
        android:src="@drawable/ic_filter_list"
        app:layout_constraintRight_toRightOf="parent" />

</android.support.constraint.ConstraintLayout>

The code for the ApiService.java class is given below:

ApiService.java类的代码如下:

package com.journaldev.androidwebscrapingretrofit;

import retrofit2.Call;
import retrofit2.http.GET;

public interface ApiService {


    @GET(".")
    Call<String> getStringResponse();
}
. is used to specify no path. Thus the base url only would be used. . 用于指定无路径。 因此,仅将使用基本URL。

The code for the MainActivity.java is given below:

MainActivity.java的代码如下:

package com.journaldev.androidwebscrapingretrofit;

import android.support.design.widget.FloatingActionButton;
import android.support.v7.app.AppCompatActivity;
import android.os.Bundle;
import android.support.v7.widget.RecyclerView;
import android.view.View;

import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;

import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;

import okhttp3.OkHttpClient;
import okhttp3.logging.HttpLoggingInterceptor;
import retrofit2.Call;
import retrofit2.Callback;
import retrofit2.Response;
import retrofit2.Retrofit;
import retrofit2.converter.scalars.ScalarsConverterFactory;

public class MainActivity extends AppCompatActivity {


    RecyclerView recyclerView;
    FloatingActionButton fab;
    HashMap<String, Integer> occurrences = new HashMap<>();
    WordsAdapter wordsAdapter;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);
        recyclerView = findViewById(R.id.wordList);

        fab = findViewById(R.id.fab);

        OkHttpClient okHttpClient = new OkHttpClient().newBuilder().addInterceptor(new HttpLoggingInterceptor().setLevel(HttpLoggingInterceptor.Level.BODY))
                .build();

        Retrofit retrofit = new Retrofit.Builder()
                .addConverterFactory(ScalarsConverterFactory.create())
                .baseUrl("https://www.journaldev.com/")
                .client(okHttpClient).build();


        final ApiService apiService = retrofit.create(ApiService.class);


        Call<String> stringCall = apiService.getStringResponse();
        stringCall.enqueue(new Callback<String>() {
            @Override
            public void onResponse(Call<String> call, Response<String> response) {
                if (response.isSuccessful()) {

                    String responseString = response.body();
                    Document doc = Jsoup.parse(responseString);
                    responseString = doc.text();
                    createHashMap(responseString);
                }

            }

            @Override
            public void onFailure(Call<String> call, Throwable t) {

            }
        });

        fab.setOnClickListener(new View.OnClickListener() {
            @Override
            public void onClick(View view) {


                occurrences = sortByValueDesc(occurrences);

                wordsAdapter = new WordsAdapter(MainActivity.this, occurrences);
                recyclerView.setAdapter(wordsAdapter);


            }
        });

    }

    private void createHashMap(String responseString) {


        responseString = responseString.replaceAll("[^a-zA-Z0-9]", " ");

        String[] splitWords = responseString.split(" +");

        for (String word : splitWords) {

            if (StringUtil.isNumeric(word)) {
                continue;
            }

            Integer oldCount = occurrences.get(word);
            if (oldCount == null) {
                oldCount = 0;
            }
            occurrences.put(word, oldCount + 1);
        }

        wordsAdapter = new WordsAdapter(this, occurrences);
        recyclerView.setAdapter(wordsAdapter);
    }

    public static HashMap<String, Integer> sortByValueDesc(Map<String, Integer> map) {
        List<Map.Entry<String, Integer>> list = new LinkedList(map.entrySet());
        Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
            @Override
            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                return o2.getValue().compareTo(o1.getValue());
            }
        });

        HashMap<String, Integer> result = new LinkedHashMap<>();
        for (Map.Entry<String, Integer> entry : list) {
            result.put(entry.getKey(), entry.getValue());
        }
        return result;
    }


}

The following code parses the string from HTML format;

以下代码从HTML格式解析字符串;

Document doc = Jsoup.parse(responseString);
                    responseString = doc.text();

Inside createHashMap we remove all special characters and omit all numerics from the hashmap.
sortByValueDesc uses a Comparator to compare the values and sort the HashMap in a descending order.

createHashMap内部,我们删除了所有特殊字符,并从哈希图中省略了所有数字。
sortByValueDesc使用Comparator比较值并以降序对HashMap进行排序。

The code for the list_item_words.xml which contains the layout for RecyclerView rows is given below:

下面给出了list_item_words.xml的代码,其中包含RecyclerView行的布局:

<?xml version="1.0" encoding="utf-8"?>
<RelativeLayout xmlns:android="https://schemas.android.com/apk/res/android"
    android:layout_width="match_parent"
    android:layout_height="wrap_content"
    android:background="?attr/selectableItemBackground"
    android:padding="24dp">


    <TextView
        android:id="@+id/txtWord"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_alignParentStart="true"
        android:layout_centerVertical="true" />

    <TextView
        android:id="@+id/txtCount"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_alignParentEnd="true"
        android:layout_centerVertical="true" />

</RelativeLayout>

The code for the WordsAdapter.java class is given below:

下面给出了WordsAdapter.java类的代码:

package com.journaldev.androidwebscrapingretrofit;

import android.content.Context;
import android.support.annotation.NonNull;
import android.support.v7.widget.RecyclerView;
import android.view.LayoutInflater;
import android.view.View;
import android.view.ViewGroup;
import android.widget.TextView;

import java.util.HashMap;

public class WordsAdapter extends RecyclerView.Adapter<WordsAdapter.WordsHolder> {

    HashMap<String, Integer> modelList;
    Context mContext;
    private String[] mKeys;

    class WordsHolder extends RecyclerView.ViewHolder {


        TextView txtWord;
        TextView txtCount;

        public WordsHolder(View itemView) {
            super(itemView);


            txtWord = itemView.findViewById(R.id.txtWord);
            txtCount = itemView.findViewById(R.id.txtCount);
        }
    }

    public WordsAdapter(Context context, HashMap<String, Integer> modelList) {
        this.modelList = modelList;
        mContext = context;
        mKeys = modelList.keySet().toArray(new String[modelList.size()]);
    }

    @NonNull
    @Override
    public WordsHolder onCreateViewHolder(@NonNull ViewGroup parent, int viewType) {
        View view = LayoutInflater.from(mContext).inflate(R.layout.list_item_words, parent, false);
        return new WordsHolder(view);
    }

    @Override
    public void onBindViewHolder(@NonNull WordsHolder holder, int position) {
        holder.txtWord.setText(mKeys[position]);
        holder.txtCount.setText(String.valueOf(modelList.get(mKeys[position])));
    }

    @Override
    public int getItemCount() {
        return modelList.size();
    }


}

The output of the above application is given below:

以上应用程序的输出如下:

So the above output shows all words present on the home page of JournalDev at the time of writing this tutorial with their frequency.

因此,以上输出显示了在撰写本教程时JournalDev主页上出现的所有单词及其频率。

This brings an end to this tutorial. You can download the project from the link below:

本教程到此结束。 您可以从下面的链接下载项目:

翻译自: https://www.journaldev.com/23448/android-web-scraping-with-retrofit

android爬虫

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值