c#存档_深入研究HTTP存档#2

c#存档

c#存档

Continuing from earlier tonight, let's see how you can use the HTTP archive as a starting point and continue examining the Internet at large.

今晚早些时候开始,让我们看看如何使用HTTP存档作为起点并继续检查整个Internet。

Task: figure out what % of the JPEGs out there on the web today are progressive vs baseline. Ann Robson has an article for the perfplanet calendar later tonight with all the juicy details.

任务:弄清当今网络上的JPEG百分比是渐进式还是基线式安·罗布森(Ann Robson)今晚晚些时候发表了一篇有关飞翼飞机日历的文章,其中包含所有多汁的细节。

Problemo: there's no such information in HTTPArchive. However there's table requests with a list of URLs as you can see in the previous post.

问题:HTTPArchive中没有此类信息。 但是,有一些表requests带有URL列表,如您在上一篇文章中所见。

Solution: Get a list of 1000 random jpegs (mimeType='image/jpeg'), download them all and run imagemagick's identify to figure out the percentage.

解决方案:获取1000个随机jpeg列表(mimeType ='image / jpeg'),将其全部下载并运行imagemagickidentify以确定百分比。

怎么样? (How?)

You have a copy of the DB as described in the previous post. Now connect to mysql (assuming you have an alias by now):

如上一篇文章所述,您具有数据库的副本。 现在连接到mysql(假设您现在有了一个别名):

$ mysql -u root httparchive

Now just for kicks, let's get one jpeg:

现在只是踢球,让我们得到一个jpeg:

mysql> select requestid, url, mimeType from requests \
    where mimeType = 'image/jpeg' limit 1;
+-----------+--------------------------------------------+------------+
| requestid | url                                        | mimeType   |
+-----------+--------------------------------------------+------------+
| 404421629 | http://www.studymode.com/education-blog....| image/jpeg |
+-----------+--------------------------------------------+------------+
1 row in set (0.01 sec)

Looks promising.

看起来很有希望。

Now let's fetch 1000 random images, while at the same time dump them into a file. For convenience let's make this file a shell script so it's easy to run. And the contents will be one curl command per line. Let's use mysql to do all the string concatenation.

现在,让我们获取1000个随机图像,同时将它们转储到文件中。 为了方便起见,让我们将此文件设为shell脚本,以便于运行。 内容将是每行一个curl命令。 让我们使用mysql来完成所有的字符串连接。

Testing with one image:

用一张图像进行测试:

mysql> select concat('curl -o ', requestid, '.jpg "', url, '"') from requests\
    where mimeType = 'image/jpeg' limit 1;
+-----------------------------------------------------------+
| concat('curl -o ', requestid, '.jpg "', url, '"')         |
+-----------------------------------------------------------+
| curl -o 404421629.jpg "http://www.studymode.com/educ..."  |
+-----------------------------------------------------------+
1 row in set (0.00 sec)

All looks good. I'm using the requestid as file name, so the experiment is always reproducible.

一切看起来不错。 我使用requestid作为文件名,因此实验总是可重复的。

mysql>
 SELECT concat('curl -o ', requestid, '.jpg "', url, '"') 
  INTO OUTFILE '/tmp/jpegs.sh' 
  LINES TERMINATED BY '\n' FROM requests
  WHERE mimeType = 'image/jpeg'
  ORDER by rand() 
  LIMIT 1000;
Query OK, 1000 rows affected (2 min 25.04 sec)

Lo and behold, three minutes later, we have generated a shell script in /tmp/jpegs.sh that looks like:

瞧,三分钟后,我们在/tmp/jpegs.sh中生成了一个shell脚本, /tmp/jpegs.sh所示:

curl -o 422877532.jpg "http://www.friendster.dk/file/pic/user/SellDiablo_60.jpg"
curl -o 406113210.jpg "http://profile.ak.fbcdn.net/hprofile-ak-ash4/370543_100004326543130_454577697_q.jpg"
curl -o 423577106.jpg "http://www.moreliainvita.com/Banner_index/Cantinelas.jpg"
curl -o 429625174.jpg "http://newnews.ca/apics/92964906IMG_9424--1.jpg"
....

Now, nothing left to do but run this script and download a bunch of images:

现在,除了运行此脚本并下载一堆图像,别无所要做:

$ mkdir /tmp/jpegs
$ sh ../jpegs.sh

curl output flashes by and some minutes later you have almost 1000 images, mostly NSFW. Not 1000 because of timeouts, unreachable hosts, etc.

curl输出闪烁,几分钟后,您将获得近1000张图像,其中大部分是NSFW。 由于超时,无法访问的主机等原因,不是1000。

$ ls | wc -l
     983

Now back to the original task: how many baseline and how many progressive JPEGs:

现在回到原始任务:多少基线和多少渐进JPEG:

$ identify -verbose *.jpg | grep "Interlace: None" | wc -l
     XXX
$ identify -verbose *.jpg | grep "Interlace: JPEG" | wc -l
     YYY

For the actual values of XXX and YYY, check Ann's post later tonight 🙂

有关XXX和YYY的实际值,请查阅今晚晚些时候安的帖子🙂

Also turns out 983 - XXX - YYY = 26 because some of the downloaded images were not really images, but 404 pages and other non-image files.

还会显示983-XXX-YYY = 26,因为某些下载的图像不是真正的图像,而是404页和其他非图像文件。

Tell your friends about this post on Facebook and Twitter

FacebookTwitter上告诉您的朋友有关此帖子的信息

翻译自: https://www.phpied.com/digging-into-the-http-archive-2/

c#存档

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值