c#存档_深入研究HTTP存档＃2

最新推荐文章于 2023-01-29 13:56:30 发布

cunbei2644

最新推荐文章于 2023-01-29 13:56:30 发布

阅读量78

点赞数

文章标签：数据库 linux mysql 计算机视觉 shell

原文链接：https://www.phpied.com/digging-into-the-http-archive-2/

版权

c#存档

Continuing from earlier tonight, let's see how you can use the HTTP archive as a starting point and continue examining the Internet at large.

从今晚早些时候开始，让我们看看如何使用HTTP存档作为起点并继续检查整个Internet。

Task: figure out what % of the JPEGs out there on the web today are progressive vs baseline. Ann Robson has an article for the perfplanet calendar later tonight with all the juicy details.

任务：弄清当今网络上的JPEG百分比是渐进式还是基线式。安·罗布森(Ann Robson)今晚晚些时候发表了一篇有关飞翼飞机日历的文章，其中包含所有多汁的细节。

Problemo: there's no such information in HTTPArchive. However there's table requests with a list of URLs as you can see in the previous post.

问题：HTTPArchive中没有此类信息。但是，有一些表requests带有URL列表，如您在上一篇文章中所见。

Solution: Get a list of 1000 random jpegs (mimeType='image/jpeg'), download them all and run imagemagick's identify to figure out the percentage.

解决方案：获取1000个随机jpeg列表(mimeType ='image / jpeg')，将其全部下载并运行imagemagick的identify以确定百分比。

怎么样？ (How?)

You have a copy of the DB as described in the previous post. Now connect to mysql (assuming you have an alias by now):

如上一篇文章所述，您具有数据库的副本。现在连接到mysql(假设您现在有了一个别名)：

$ mysql -u root httparchive

Now just for kicks, let's get one jpeg:

现在只是踢球，让我们得到一个jpeg：

mysql> select requestid, url, mimeType from requests \
    where mimeType = 'image/jpeg' limit 1;
+-----------+--------------------------------------------+------------+
| requestid | url                                        | mimeType   |
+-----------+--------------------------------------------+------------+
| 404421629 | http://www.studymode.com/education-blog....| image/jpeg |
+-----------+--------------------------------------------+------------+
1 row in set (0.01 sec)

Looks promising.

看起来很有希望。

Now let's fetch 1000 random images, while at the same time dump them into a file. For convenience let's make this file a shell script so it's easy to run. And the contents will be one curl command per line. Let's use mysql to do all the string concatenation.

现在，让我们获取1000个随机图像，同时将它们转储到文件中。为了方便起见，让我们将此文件设为shell脚本，以便于运行。内容将是每行一个curl命令。让我们使用mysql来完成所有的字符串连接。

Testing with one image:

用一张图像进行测试：

mysql> select concat('curl -o ', requestid, '.jpg "', url, '"') from requests\
    where mimeType = 'image/jpeg' limit 1;
+-----------------------------------------------------------+
| concat('curl -o ', requestid, '.jpg "', url, '"')         |
+-----------------------------------------------------------+
| curl -o 404421629.jpg "http://www.studymode.com/educ..."  |
+-----------------------------------------------------------+
1 row in set (0.00 sec)

All looks good. I'm using the requestid as file name, so the experiment is always reproducible.

一切看起来不错。我使用requestid作为文件名，因此实验总是可重复的。

mysql>
 SELECT concat('curl -o ', requestid, '.jpg "', url, '"') 
  INTO OUTFILE '/tmp/jpegs.sh' 
  LINES TERMINATED BY '\n' FROM requests
  WHERE mimeType = 'image/jpeg'
  ORDER by rand() 
  LIMIT 1000;

Query OK, 1000 rows affected (2 min 25.04 sec)

Lo and behold, three minutes later, we have generated a shell script in /tmp/jpegs.sh that looks like:

瞧，三分钟后，我们在/tmp/jpegs.sh中生成了一个shell脚本， /tmp/jpegs.sh所示：

curl -o 422877532.jpg "http://www.friendster.dk/file/pic/user/SellDiablo_60.jpg"
curl -o 406113210.jpg "http://profile.ak.fbcdn.net/hprofile-ak-ash4/370543_100004326543130_454577697_q.jpg"
curl -o 423577106.jpg "http://www.moreliainvita.com/Banner_index/Cantinelas.jpg"
curl -o 429625174.jpg "http://newnews.ca/apics/92964906IMG_9424--1.jpg"
....

Now, nothing left to do but run this script and download a bunch of images:

现在，除了运行此脚本并下载一堆图像，别无所要做：

$ mkdir /tmp/jpegs
$ sh ../jpegs.sh

curl output flashes by and some minutes later you have almost 1000 images, mostly NSFW. Not 1000 because of timeouts, unreachable hosts, etc.

curl输出闪烁，几分钟后，您将获得近1000张图像，其中大部分是NSFW。由于超时，无法访问的主机等原因，不是1000。

$ ls | wc -l
     983

Now back to the original task: how many baseline and how many progressive JPEGs:

现在回到原始任务：多少基线和多少渐进JPEG：

$ identify -verbose *.jpg | grep "Interlace: None" | wc -l
     XXX
$ identify -verbose *.jpg | grep "Interlace: JPEG" | wc -l
     YYY

For the actual values of XXX and YYY, check Ann's post later tonight 🙂

有关XXX和YYY的实际值，请查阅今晚晚些时候安的帖子🙂

Also turns out 983 - XXX - YYY = 26 because some of the downloaded images were not really images, but 404 pages and other non-image files.

还会显示983-XXX-YYY = 26，因为某些下载的图像不是真正的图像，而是404页和其他非图像文件。

Tell your friends about this post on Facebook and Twitter

在Facebook和Twitter上告诉您的朋友有关此帖子的信息

翻译自: https://www.phpied.com/digging-into-the-http-archive-2/

c#存档

cunbei2644

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
c#存档_深入研究HTTP存档＃2

c#存档Continuing from earlier tonight, let's see how you can use the HTTP archive as a starting point and continue examining the Internet at large. 从今晚早些时候开始，让我们看看如何使用HTTP存档作为起点并继续检查整个Internet。 Task: ...
复制链接

扫一扫