rsync - Spidering Hacks

rsync - Spidering Hacks

Hack 92 Mirroring Web Sites with wget and rsync

Mirroring Directly with the Server

In this case, you have access to the server. Most likely, you want to mirror your own site or perhaps some other data. For this, rsync (http://rsync.samba.org/) is the ideal tool. rsync is a versatile tool for mirroring or backing up data across computers. There are multiple ways of using rsync between machines; however, here we are going to use ssh. This is the easiest to configure and has the added advantage of providing good security for the files being transferred.

Obviously, you will need ssh installed and configured on both systems. You will also need to make sure you can log in to the system you want to mirror from. Next, you'll need to determine which directories you want to mirror and where you want them mirrored to on your system. With that in mind, you just need to run rsync, passing it the necessary options:

rsync -a -e ssh remote.machine.com:/some/directory /local/directory

The -a option tells rsync you want to mirror the directory; it sets a series of options that make rsync keep timestamps, permissions, user and group ownership, soft links, and so on for all the files, and it recurses through the directory. The -e option followed by ssh tells rsync to use ssh to connect to the remote server; if you are not using public key encryption, you will be prompted for the password by ssh when it connects to the server. The next argument is the server to connect to, followed by the directory to mirror from and, finally, the directory to mirror to. Make sure the last argument is a directory that already exists on your system, because rsync will create directories only inside this one.

Before getting into more options, it is a good idea to take a look at what this command just did. The mirrored directory should now appear the same as the directory on the server. Every file should be the same and should have the same timestamps, permissions, and so on. If you ran the command as root, the mirrored directory will also have the same usernames, assuming they exist on this system.

Now, we'll talk a bit about how rsync works; it was designed for exactly what we are using it for. It checks each file, comparing it to see if changes were made. If changes exist, it attempts to update the local file by sending only the parts that have changed. For new files, it sends the whole file. This is great, because not only is it better at checking for changes than wget's use of the HTTP protocol, but it also tries to send only the data necessary to update the file, saving nicely on bandwidth.

Now, onto the other options. The -z option is probably one you will always want to use; it tells rsync to compress the data stream, decreasing bandwidth and most likely making the entire process go faster.

The -v option tells rsync to spit out the names of the files it is syncing; this works well when coupled with the --progress and --stats options. The former adds a progress indicator to each file as it is downloaded, and the latter details statistics about the entire mirroring operation.

The -u option tells rsync to only update files (it does not touch local files with a timestamp newer than the one on the server). This option is useful only if you modify the files locally and want to keep those changes. If you intend to keep a fully accurate mirror of the remote site, do not use this option; however, keep in mind that any changes you make to the files locally will be overwritten.

Finally, the --delete option deletes files that no longer exist on the server. If a file is deleted on the server, it will also be deleted on your backup. Again, this is very useful if you want to maintain an exact mirror of the files on the server.

Hacking the Hack

The way we use rsync here is a secure, easy way to handle it. However, if you do not have ssh installed, you may be looking for an alternative. There are basically two other options. One is to use rsync with rsh instead of ssh. This still requires setup on the server, though it is more traditional than ssh and considerably less secure than ssh. If you use rsh, remove the -e ssh option and make sure you have rsh set up correctly on your server. Another option is to run rsync as a service on the server. This option does not have the security of ssh, but it allows you to use rsync without having ssh set up. To do this, you still need rsync installed on both servers, but you have to create an rsync configuration file on the server and make sure rsync runs as a service.

To begin, you'll want the following command run at startup on the server:

rsync --daemon

Then, you will want to create a configuration file for rsync, such as the following:

[backup]
path = /some/directory

Put this in the /etc/rsyncd.conf file and have rsync start as shown previously. Now, when you connect to the rsync server, you will want to change the options a bit. Instead of remote.machine.com:/some/directory, you will want remote.machine.com::backup. This tells rsync to connect to the backup module on the rsync server. You will also want to omit the -e ssh option. There is more you can do with the rsyncd.conf file, including restricting access based on usernames, setting read-only access, and so on. For a complete list of options, view the manpage for rsyncd.conf by typing man rsyncd.conf.

[相关问题]

全局常用配置说明

模块常用配置说明

客户端常用参数

for Windows (cygwin)

远程shell模式和rsync守护进程模式

22.6. File Synchronization. Building Internet Firewalls, 2nd Edition

Hack 92 Mirroring Web Sites with wget and rsync. Spidering Hacks

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值