gzencode java,Jsoup和gzip压缩的HTML内容(Android版)

I've been trying all day to make this thing works but it's still not right yet. I've checked so many posts around here and tested so many different implementations that I'dont know where to look now...

Here is my situation, I have a small php test file (gz.php) on my server wich looks like this :

header("Content-Encoding: gzip");

print("\x1f\x8b\x08\x00\x00\x00\x00\x00");

$contents = gzcompress("Is it working?", 9);

print($contents);

This is the simplest I could do and it works fine with any web browser.

Now I have an Android activity using Jsoup that has this code :

URL url = new URL("http://myServerAdress.com/gz.php");

doc = Jsoup.parse(url, 1000);

Which cause an empty EOFException on the "Jsoup.parse" line.

I've read everywhere that Jsoup is supposed to parse gzipped content without having to do anything special, but obviously, there's something missing.

I've tried many other ways like using Jsoup.connect().get() or InpuStream, GZipInputStream and DataInpuStream. I did try the gzDeflate() and gzencode() methods from PHP as well but no luck either. I even tried not to declare the header-encoding in PHP and try to deflate the content later...but it was as clever as effective...

It has to be something "stupid" I'm missing but I just can't tell what... anybody has an idea?

(ps : I'm using Jsoup 1.7.0, so the latest one as of now)

解决方案

The asker indicated in a comment that gzcompress was writing a CRC that was both incorrect and incomplete, according to information from here, the operative code being:

// Display the header of the gzip file

// Thanks ck@medienkombinat.de!

// Only display this once

echo "\x1f\x8b\x08\x00\x00\x00\x00\x00";

// Figure out the size and CRC of the original for later

$Size = strlen($contents);

$Crc = crc32($contents);

// Compress the data

$contents = gzcompress($contents, 9);

// We can't just output it here, since the CRC is messed up.

// If I try to "echo $contents" at this point, the compressed

// data is sent, but not completely. There are four bytes at

// the end that are a CRC. Three are sent. The last one is

// left in limbo. Also, if we "echo $contents", then the next

// byte we echo will not be sent to the client. I am not sure

// if this is a bug in 4.0.2 or not, but the best way to avoid

// this is to put the correct CRC at the end of the compressed

// data. (The one generated by gzcompress looks WAY wrong.)

// This will stop Opera from crashing, gunzip will work, and

// other browsers won't keep loading indefinately.

//

// Strip off the old CRC (it's there, but it won't be displayed

// all the way -- very odd)

$contents = substr($contents, 0, strlen($contents) - 4);

// Show only the compressed data

echo $contents;

// Output the CRC, then the size of the original

gzip_PrintFourChars($Crc);

gzip_PrintFourChars($Size);

Jonathan Hedley commented, "jsoup just uses a normal Java GZIPInputStream to parse the gzip, so you'd hit that issue with any Java program." The EOFException is presumably due to the incomplete CRC.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Java Jsoup是一个用于解析HTML文档的开源库。通过使用Jsoup,您可以轻松地从HTML文档中提取数据或进行数据操作。以下是使用Java Jsoup解析HTML的基本步骤: 1. 下载Jsoup库:您可以从Jsoup的官方网站(https://jsoup.org/)下载Jsoup库的最新本。 2. 导入Jsoup库:将下载的Jsoup库的JAR文件导入到您的Java项目中。 3. 创建连接:使用Jsoup.connect()方法创建一个Connection对象,将HTML文档的URL作为参数传递给该方法。 4. 获取Document对象:使用Connection对象的get()方法获取一个Document对象,该对象表示整个HTML文档。 5. 使用选择器进行数据提取:使用Jsoup的选择器语法,您可以根据HTML元素的标签、类名、ID等属性来选择和提取数据。 以下是一个基本的Java Jsoup解析HTML的示例代码: ```java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class HtmlParser { public static void main(String[] args) { try { // 创建连接 Connection connection = Jsoup.connect("http://example.com"); // 获取Document对象 Document document = connection.get(); // 使用选择器提取数据 Elements links = document.select("a[href]"); for (Element link : links) { System.out.println("Link: " + link.attr("href")); System.out.println("Text: " + link.text()); } } catch (IOException e) { e.printStackTrace(); } } } ``` 这个示例代码将从"http://example.com"网页中提取所有链接的URL和文本,并打印出来。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值