网页爬虫入门操作

1. 任务描述

在指定网站中获取公司人员信息,并将数据写入到JSON文件中。

2. 核心技术点

  • 使用 Jsoup 实现网页页面元素的选取

  • 使用alibaba的fastjson实现JSON文件的读写操作

  • 使用递归将公司、部门、人员的信息分别记录下来

  • 使用反射机制读取JSON文件

3. 实现思路

(1)组织架构类

分别创建Company、Department、Employee类以保存各个层级的架构信息,其中Company、Department 类实现了GetBranches接口,用以获取当前类的分支,对于Company类来说,则是获取隶属于当前Company 的Department。如下为Company类示例代码:

public class Company implements GetBranches<Department> {
    private String name;
    private String id;
    private List<Department> departments;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public List<Department> getDepartments() {
        return departments;
    }

    public void setDepartments(List<Department> departments) {
        this.departments = departments;
    }

    public List<Department> getBranch() {
        return this.departments;
    }

    @Override
    public List<Department> getBranches(String url) throws IOException {
        List<Department> departmentList = new ArrayList<>();
        Document document = Jsoup.connect(url).get();
        Elements elements = document.getElementsByClass("xwtable").select("tr");
        for (int i = 1; i < elements.size(); i++) {
            Element element = elements.get(i);
            Department department = new Department();
            department.setName(element.select("td").get(1).text());
            department.setId(element.select("span").text());
            String employeeUrl = element.select("td").get(0).select("a").attr("href");
            department.setEmployees(department.getBranches(employeeUrl));
            departmentList.add(department);
        }
        return departmentList;
    }

    @Override
    public String toString() {
        return "company{" +
                "id='" + id + '\'' +
                ", name='" + name + '\'' +
                ", departments=" + departments +
                '}';
    }
}

(2)数据写入、读取工具类

getDocFromUrl方法将从网页中获取到的数据写入到JSON文件中,getInfoFromJson方法将数据从JSON文件中 提取到组织架构类的实例中,并根据用户需要输出相应的查询数据,代码如下:

public class GetInfo {
    public static void getDocFromUrl(String url, File file) throws IOException {
        Document document = Jsoup.connect(url).get();
        Elements elements = document.getElementsByClass("xwtable").select("tr");
        List<Company> companyList = new ArrayList<>();
        for (int i = 1; i < elements.size(); i++) {
            Company company = new Company();
            Element element = elements.get(i);
            company.setName(element.select("td").get(1).text());
            company.setId(element.select("span").text());
            String departmentUrl = element.select("td").get(0).select("a").attr("href");
            company.setDepartments(company.getBranches(departmentUrl));
            companyList.add(company);
        }
        FileOutputStream fileOutputStream = new FileOutputStream(file, false);
        fileOutputStream.write(JSONObject.toJSONString(companyList).getBytes());
        fileOutputStream.close();
    }

    public static void getInfoFromJson(List<?> list, String str) throws Exception {
        Field field = list.get(0).getClass().getDeclaredField("id");
        Method method = list.get(0).getClass().getDeclaredMethod("toString");
        field.setAccessible(true);
        boolean flag = false;
        for (Object obj : list) {
            String id = (String) field.get(obj);
            if (str.equals(id)) {
                System.out.println(method.invoke(obj));
                flag = true;
                break;
            } else if (str.startsWith(id)) {
                Method getBranch = obj.getClass().getDeclaredMethod("getBranch");
                List<?> branch = (List<?>) getBranch.invoke(obj);
                getInfoFromJson(branch, str);
                flag = true;
                break;
            }
        }
        if (!flag) {
            System.out.println("Wrong id!");
        }
    }
}

(3)主类

用于设置网页URL,并且对用户输入进行相应的处理。

public class DoWorm {
    public static void main(String[] args) throws Exception {
        File file = new File("./result.json");
        if (!file.exists()) {
            String url = "http://localhost:8080/company";
            GetInfo.getDocFromUrl(url, file);
        }
        FileInputStream fileInputStream = new FileInputStream(file);
        long length = file.length();
        byte[] bytes = new byte[(int) length];
        while (true) {
            if (-1 == fileInputStream.read(bytes)) {
                break;
            }
        }
        String jsonText = new String(bytes);
        List<Company> companyList = JSON.parseArray(jsonText, Company.class);
        Scanner sc = new Scanner(System.in);
        while (true) {
            String str = sc.nextLine();
            if (!"-1".equals(str)) {
                GetInfo.getInfoFromJson(companyList, str);
            } else {
                break;
            }
        }
    }
}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值