bash 取整_如何使用Bash从Subreddit刮取主题列表

最新推荐文章于 2021-05-01 03:22:57 发布

culinluo3322

最新推荐文章于 2021-05-01 03:22:57 发布

阅读量147

点赞数

文章标签： python java linux shell json

原文链接：https://www.howtogeek.com/405471/how-to-scrape-a-list-of-topics-from-a-subreddit-using-bash/

版权

bash 取整

Linux terminal on Ubuntu laptop concept — Fatmawati Achmad Zaenuri/Shutterstock.com Fatmawati Achmad Zaenuri / Shutterstock.com

Reddit offers JSON feeds for each subreddit. Here’s how to create a Bash script that downloads and parses a list of posts from any subreddit you like. This is just one thing you can do with Reddit’s JSON feeds.

Reddit为每个子reddit提供JSON feed。这是创建Bash脚本的方法，该脚本从您喜欢的任何subreddit下载并解析帖子列表。这只是Reddit的JSON feed可以做的一件事。

安装Curl和JQ (Installing Curl and JQ)

We’re going to use curl to fetch the JSON feed from Reddit and jq to parse the JSON data and extract the fields we want from the results. Install these two dependencies using apt-get on Ubuntu and other Debian-based Linux distributions. On other Linux distributions, use your distribution’s package management tool instead.

我们将使用curl从Reddit和jq获取JSON提要，以解析JSON数据并从结果中提取所需的字段。在Ubuntu和其他基于Debian的Linux发行版上使用apt-get安装这两个依赖项。在其他Linux发行版上，请改用发行版的软件包管理工具。

sudo apt-get install curl jq

从Reddit获取一些JSON数据 (Fetch Some JSON Data from Reddit)

Let’s see what the data feed looks like. Use curl to fetch the latest posts from the MildlyInteresting subreddit:

让我们看看数据馈送的样子。使用curl从MildlyInteresting subreddit中获取最新帖子：

curl -s -A “reddit scraper example” https://www.reddit.com/r/MildlyInteresting.json

Note how the options used before the URL: -s forces curl to run in silent mode so that we don’t see any output, except the data from Reddit’s servers. The next option and the parameter that follows, -A “reddit scraper example” , sets a custom user agent string that helps Reddit identify the service accessing their data. The Reddit API servers apply rate limits based on the user agent string. Setting a custom value will cause Reddit to segment our rate limit away from other callers and reduce the chance that we get an HTTP 429 Rate Limit Exceeded error.

请注意URL： -s之前使用的选项如何强制curl在静默模式下运行，以便除了Reddit服务器的数据外，我们看不到任何输出。下一个选项和随后的参数-A “reddit scraper example”设置自定义用户代理字符串，该字符串可帮助Reddit标识访问其数据的服务。 Reddit API服务器基于用户代理字符串应用速率限制。设置自定义值将导致Reddit将速率限制与其他呼叫者分开，并减少我们收到HTTP 429速率限制错误的可能性。

The output should fill up the terminal window and look something like this:

输出应填满终端窗口，如下所示：

There are lots of fields in the output data, but all we’re interested in are Title, Permalink, and URL. You can see an exhaustive list of types and their fields on Reddit’s API documentation page: https://github.com/reddit-archive/reddit/wiki/JSON

输出数据中有很多字段，但我们感兴趣的只是标题，永久链接和URL。您可以在Reddit的API文档页面上看到类型及其字段的详尽列表： https : //github.com/reddit-archive/reddit/wiki/JSON

从JSON输出中提取数据 (Extracting Data from the JSON Output)

We want to extract Title, Permalink, and URL, from the output data and save it to a tab-delimited file. We can use text processing tools like sed and grep , but we have another tool at our disposal that understands JSON data structures, called jq . For our first attempt, let’s use it to pretty-print and color-code the output. We’ll use the same call as before, but this time, pipe the output through jq and instruct it to parse and print the JSON data.

我们要从输出数据中提取标题，永久链接和URL，并将其保存到制表符分隔的文件中。我们可以使用sed和grep类的文本处理工具，但是我们还有另一个可以理解JSON数据结构的工具jq 。对于我们的第一次尝试，让我们使用它对输出进行漂亮的打印和颜色编码。我们将使用与以前相同的调用，但是这次，将输出通过jq传递给管道，并指示其解析和打印JSON数据。

curl -s -A “reddit scraper example” https://www.reddit.com/r/MildlyInteresting.json | jq .

Note the period that follows the command. This expression simply parses the input and prints it as-is. The output looks nicely formatted and color-coded:

请注意命令后的时间段。该表达式只是解析输入并按原样打印。输出看起来很好地格式化和颜色编码：

Extract data from a subreddit's JSON in Bash

Let’s examine the structure of the JSON data we get back from Reddit. The root result is an object that contains two properties: kind and data. The latter holds a property called children, which includes an array of posts to this subreddit.

让我们检查一下从Reddit返回的JSON数据的结构。根结果是一个包含两个属性的对象：种类和数据。后者拥有一个名为children的属性，其中包括该subreddit的一系列帖子。

Each item in the array is an object that also contains two fields called kind and data. The properties we want to grab are in the data object. jq expects an expression that can be applied to the input data and produces the desired output. It must describe the contents in terms of their hierarchy and membership to an array, as well as how the data should be transformed. Let’s run the whole command again with the correct expression:

数组中的每个项目都是一个对象，还包含两个字段，分别称为kind和data。我们要获取的属性在数据对象中。 jq期望可以将表达式应用于输入数据并产生所需的输出。它必须根据其层次结构和数组成员身份来描述内容，以及如何转换数据。让我们使用正确的表达式再次运行整个命令：

curl -s -A “reddit scraper example” https://www.reddit.com/r/MildlyInteresting.json | jq ‘.data.children | .[] | .data.title, .data.url, .data.permalink’

The output shows Title, URL, and Permalink each on their own line:

输出在各自的行中分别显示Title，URL和Permalink：

Parse contents of a subreddit from Linux command line

Let’s dive into the jq command we called:

让我们深入研究我们调用的jq命令：

jq ‘.data.children | .[] | .data.title, .data.url, .data.permalink’

There are three expressions in this command separated by two pipe symbols. The results of each expression are passed to the next for further evaluation. The first expression filters out everything except the array of Reddit listings. This output is piped into the second expression and forced into an array. The third expression acts on each element in the array and extracts three properties. More information about jq and its expression syntax can be found in jq’s official manual.

此命令中有三个表达式，由两个管道符号分隔。每个表达式的结果将传递到下一个以进行进一步评估。第一个表达式过滤掉除Reddit列表列表以外的所有内容。此输出通过管道传递到第二个表达式中，并强制进入数组。第三个表达式作用于数组中的每个元素，并提取三个属性。有关jq及其表达式语法的更多信息，请参见jq的官方手册。

将所有内容放到脚本中 (Putting it All Together in a Script)

Let’s put the API call and the JSON post-processing together in a script that will generate a file with the posts we want. We’ll add support for fetching posts from any subreddit, not just /r/MildlyInteresting.

让我们将API调用和JSON后处理放到一个脚本中，该脚本将生成包含所需帖子的文件。我们将添加对从任何subreddit中获取帖子的支持，而不仅仅是/ r / MildlyInteresting。

Open your editor and copy the contents of this snippet into a file called scrape-reddit.sh

打开编辑器，然后将此代码段的内容复制到名为scrape-reddit.sh的文件中

#!/bin/bash

if [ -z "$1" ]
  then
    echo "Please specify a subreddit"
    exit 1
fi

SUBREDDIT=$1
NOW=$(date +"%m_%d_%y-%H_%M")
OUTPUT_FILE="${SUBREDDIT}_${NOW}.txt"

curl -s -A "bash-scrape-topics" https://www.reddit.com/r/${SUBREDDIT}.json | \
        jq '.data.children | .[] | .data.title, .data.url, .data.permalink' | \
        while read -r TITLE; do
                read -r URL 
                read -r PERMALINK
                echo -e "${TITLE}\t${URL}\t${PERMALINK}" | tr --delete \" >> ${OUTPUT_FILE}
        done

This script will first check if the user has supplied a subreddit name. If not, it exits with an error message and a non-zero return code.

此脚本将首先检查用户是否提供了subreddit名称。如果不是，则退出并显示错误消息和非零返回码。

Next, it will store the first argument as the subreddit name, and build up a date-stamped filename where the output will be saved.

接下来，它将第一个参数存储为subreddit名称，并建立一个带有日期戳的文件名，以保存输出。

The action begins when curl is called with a custom header and the URL of the subreddit to scrape. The output is piped to jq where it’s parsed and reduced to three fields: Title, URL and Permalink. These lines are read, one-at-a-time, and saved into a variable using the read command, all inside of a while loop, that will continue until there are no more lines to read. The last line of the inner while block echoes the three fields, delimited by a tab character, and then pipes it through the tr command so that the double-quotes can be stripped out. The output is then appended to a file.

当使用自定义标头和要抓取的subreddit的URL调用curl时，操作开始。输出通过管道传递到jq进行解析，并缩减为三个字段：“标题”，“ URL”和“永久链接”。每次一次读取这些行，并使用read命令将其保存到变量中，所有这些都在while循环内进行，直到不再有要读取的行为止。内部while块的最后一行回显三个由制表符分隔的字段，然后通过tr命令将其通过管道传递，以便可以删除双引号。然后将输出附加到文件中。

Before we can execute this script, we must ensure that it has been granted execute permissions. Use the chmod command to apply these permissions to the file:

在执行此脚本之前，我们必须确保已授予该脚本执行权限。使用chmod命令将以下权限应用于文件：

chmod u+x scrape-reddit.sh

And, lastly, execute the script with a subreddit name:

最后，使用subreddit名称执行脚本：

./scrape-reddit.sh MildlyInteresting

An output file is generated the same directory and its contents will look something like this:

输出文件将在同一目录中生成，其内容如下所示：

Scrape and view topics from a subreddit in Bash

Each line contains the three fields we’re after, separated using a tab character.

每行包含我们要使用的三个字段，使用制表符分隔。

走得更远 (Going Further)

Reddit is a goldmine of interesting content and media, and it’s all easily accessed using its JSON API. Now that you have a way to access this data and process the results you can do things like:

Reddit是有趣内容和媒体的金矿，使用JSON API均可轻松访问所有内容。现在您可以访问这些数据并处理结果，您可以执行以下操作：

Grab the latest headlines from /r/WorldNews and send them to your desktop using notify-send
从/ r / WorldNews获取最新的标题，并使用notify-send将其发送到您的桌面
Integrate the best jokes from /r/DadJokes into your system’s Message-Of-The-Day
将来自/ r / DadJokes的最佳笑话集成到系统的每日消息中
Get today’s best picture from /r/aww and make it your desktop background
从/ r / aww获取当今最好的图片，并将其用作桌面背景

All this is possible using the data provided and the tools you have on your system. Happy hacking!

使用提供的数据和系统上的工具，所有这些都是可能的。骇客骇客！