I'm learning Jsoup and have this HTML:
[...]
Content
Content
Content
[...]
I use Jsoup.parse() and document select("p") for catch "content" (and works nice). But...
[...]
Content
Content
Content
[...]
In this scene, I see that Jsoup.parse() convert this code to:
[...]
Content
Content
Content
[...]
How can I keep order of nested paragraphs with Jsoup (div 4 & 5 inside of div 3)?
Add a example:
HTML file:
TitleText
Text
Text
Text
Text
Text
Java code:
Document doc = null;
doc = Jsoup.connect(URL_with_HTML).get();
System.out.println(doc.outerHtml());
Return:
TitleText Text
Text Text
Text Text
Is correct this? I using Jsoup 1.6.1. I understand that Jsoup should return nested paragraphs instead of previous return.
解决方案
Nested paragraphs do not exist in HTML. The prior paragraph is closed automatically since Jsoup implements the WHATWG HTML5 specification:
A p tag is automatically closed by any of the following: address, article, aside, blockquote, div, dl, fieldset, footer, form, h1, h2, h3, h4, h5, h6, header, hgroup, hr, main, menu, nav, ol, p, pre, section, table, or ul. Therefore
An end tag whose name is p (ie
) that does not have a corresponding start tag is a parse error and is replaced with. Therefore
becomes.
So jsoup is correct and your HTML is invalid.
Be sure to comprehend that your HTML is invalid because you have too many
and not because "nesting" paragraphs. Nesting cannot happend because they get auto-closed. But the later coming is obsolet because the "corresponding"was already auto-closed before.