如何在 Java 中清理 HTML

最新推荐文章于 2024-09-22 07:00:00 发布

allway2

最新推荐文章于 2024-09-22 07:00:00 发布

阅读量784

点赞数

文章标签： java html junit

本文链接：https://blog.csdn.net/allway2/article/details/125960896

版权

每当我们的 Web 应用程序接收到任何将呈现为 HTML 的文本时，我们都必须清理这些文本以避免潜在的XSS 攻击。

OWASP提供了一个很好的工具来帮助我们清理 HTML。

概括

设置项目

我们将使用：

JUnit 5运行我们的测试
AssertJ拥有流畅的断言
OWASP HTML Sanitizer用于清理 HTML 内容

这是我们的 Mavenpom.xml文件：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.blebail.blog.sample</groupId>
    <artifactId>java-sanitize-html</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>java-sanitize-html</name>
    <url>https://github.com/baptistelebail/samples/tree/master/java-sanitize-html</url>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <!-- HTML Sanitizer -->
        <dependency>
            <groupId>com.googlecode.owasp-java-html-sanitizer</groupId>
            <artifactId>owasp-java-html-sanitizer</artifactId>
            <version>20191001.1</version>
        </dependency>

        <!-- JUnit -->
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter</artifactId>
            <version>5.6.0</version>
            <scope>test</scope>
        </dependency>

        <!-- AssertJ -->
        <dependency>
            <groupId>org.assertj</groupId>
            <artifactId>assertj-core</artifactId>
            <version>3.15.0</version>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.22.2</version>
            </plugin>
        </plugins>
    </build>
</project>

定义清理策略

假设我们的 Web 应用程序是一个社区网站，其中包含文章和评论。

我们的应用程序将在三个用例中处理和呈现 HTML：

用户提交评论
一位出版商提交了一篇文章
用户更新其个人资料描述（我们为此提供<my-element>用户可以使用的组件）

我们将分别需要 3 项清理策略：

从评论中删除任何 HTML 的严格政策
允许在文章中找到常见 HTML 元素的策略（标题、段落、图像、链接……）
只允许我们的元素的自定义策略<my-element>

我们创建SanitizationPolicy接口：

public interface SanitizationPolicy {

    /**
     * Sanitizes the string according to the policy
     * @param input the input string to be sanitized
     * @return the sanitized string
     */
    String sanitize(String input);
}

这将简单地清理输入字符串。

OWASP HTML Sanitizer提供了几种创建清理策略的方法（OWASP 命名为PolicyFactory）：我们可以通过HtmlPolicyBuilder手动创建的策略，或者通过 Sanitizers.*的预制策略，可以与and() 方法结合使用。

我们将使用EnumSanitizationPolicy实现接口，该接口将具有三个值：

STRICT
ARTICLE
CUSTOM

每一个都与特定的PolicyFactory相关联。

我们创建HtmlSanitizationPolicy枚举：

import org.owasp.html.HtmlPolicyBuilder;
import org.owasp.html.PolicyFactory;
import org.owasp.html.Sanitizers;

public enum HtmlSanitizationPolicy implements SanitizationPolicy {

    STRICT(new HtmlPolicyBuilder()
            .toFactory()),

    ARTICLE(Sanitizers.BLOCKS
            .and(Sanitizers.FORMATTING)
            .and(Sanitizers.STYLES)
            .and(Sanitizers.IMAGES)
            .and(Sanitizers.LINKS)),

    CUSTOM(new HtmlPolicyBuilder()
            .allowElements("my-element")
            .toFactory());

    private final PolicyFactory policyFactory;

    HtmlSanitizationPolicy(PolicyFactory policyFactory) {
        this.policyFactory = policyFactory;
    }

    @Override
    public String sanitize(String input) {
        return policyFactory.sanitize(input);
    }
}

根据我们的政策，我们现在可以使用HtmlSanitizationPolicy.<POLICY>.sanitize(...).

编写测试

我们将通过一些测试来验证我们的政策：

我们的政策不允许使用链接 ( <a>) 和 JavaScript ( )<script>STRICT
<script>我们的政策不允许使用JavaScript ( )，ARTICLE但常见的文章元素 ( <p>, <strong>, style="...", <img>, <a>, <h1>, <h2>, ...) 是允许的。
我们将使用 JUnit 5 ParameterizedTest并使用 @ValueSource 测试一些示例
我们的政策不允许使用链接 ( <a>) 和 JavaScript ( )，但<script>CUSTOM<my-element>

我们创建HtmlSanitizationPolicyTest：src/test/java

import org.junit.jupiter.api.Test;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.ValueSource;

import static org.assertj.core.api.Assertions.assertThat;

public final class HtmlSanitizationPolicyTest {

    @Test
    public void shouldNotAllowLinksOrJavaScriptOnStrictPolicy() {
        String text = "Text with <a href=\"https://example.com\">a link</a> " +
                "and<script>alert('javascript');</script>";

        String sanitized = HtmlSanitizationPolicy.STRICT.sanitize(text);

        assertThat(sanitized).isEqualTo("Text with a link and");
    }

    @Test
    public void shouldNotAllowJavaScriptOnArticlePolicy() {
        String text = "Text with <a href=\"https://example.com\" rel=\"nofollow\">a link</a> " +
                "and<script>alert('javascript');</script>";

        String sanitized = HtmlSanitizationPolicy.ARTICLE.sanitize(text);

        assertThat(sanitized).isEqualTo("Text with <a href=\"https://example.com\" rel=\"nofollow\">a link</a> and");
    }

    @ParameterizedTest
    @ValueSource(strings = {
            "A <h1>Title</h1> and a <p>paragraph</p>",
            "<strong>Strong</strong> and <em>emphasized</em>",
            "Code with <span style=\"color:red\">style</span>",
            "An <img src=\"https://example.com/img.jpg\" width=\"200\" />",
            "A <a href=\"https://example.com\" rel=\"nofollow\">link</a>"
    })
    public void shouldAllowCommonArticleElementsOnArticlePolicy(String text) {
        String sanitized = HtmlSanitizationPolicy.ARTICLE.sanitize(text);

        assertThat(sanitized).isEqualTo(text);
    }

    @Test
    public void shouldNotAllowLinksOrJavaScriptOnCustomPolicy() {
        String text = "Text with <a href=\"https://example.com\" rel=\"nofollow\">a link</a> " +
                "and<script>alert('javascript');</script> and <my-element>Mine</my-element>";

        String sanitized = HtmlSanitizationPolicy.CUSTOM.sanitize(text);

        assertThat(sanitized).isEqualTo("Text with a link and and <my-element>Mine</my-element>");
    }

    @Test
    public void shouldAllowMyElementOnCustomPolicy() {
        String text = "Text with <my-element>Mine</my-element>";

        String sanitized = HtmlSanitizationPolicy.CUSTOM.sanitize(text);

        assertThat(sanitized).isEqualTo(text);
    }
}