I recently worked with a client that had about 200GB of data, mostly smaller files like images or PDFs, stored on a web server. I used BitTorrent Sync to get all of that data from that server to a computer, with very little effort. When I began working with them, they didn’t have any backups of the data at all – their webhost did daily backups, but they were pretty much inaccessible to us, besides the ability to put in a request for a backup to be restored. My initial plan was simple: to purchase a small, low power computer, with a 4TB hard drive in it, far more than enough for the 200GB of data and for future growth. I would then get the data from the server to the computer, and deliver the computer to the client’s office. They would then have a backup of all of their data on-site.
The question is, what software solution would be able to sync that amount of data, quickly and easily?
I had several requirements for this project, each of which were non-negotiable. It would later turn out that BitTorrent Sync would actually solve each of the issues quite well, although it did have a couple small drawbacks that needed to be mitigated.
I came up with a number of ideas, none of which were particularly good, but any of which may have worked. My first idea was to use DropBox – an easy-to-use, free service with which I was very familiar. The plan was to use a free DropBox account, which gets me 2GB of data, and use it to transfer only the new files nightly. The client is using a custom-written application on their server, and luckily for me the program is still being developed. I was able to get the developer to copy all new files not only to the existing data directory but also to a new directory they made for me – I was going to tie that to DropBox, so that all future data would be synced between the server’s DropBox folder and the computer’s DropBox folder. I would then have a script running every few minutes on the computer that would pull the data out of that folder and put it elsewhere – leaving the DropBox folder as empty as possible for as much time as possible.
This probably would have worked – the initial sync would still be a problem, but I had another solution for that – just dump all the files into DropBox anyway. It only has 2GB of space, but DropBox doesn’t just die when you hit that point – it syncs 2GB and no more. Therefore, I could have dumped all the data in there and let 2GB of it sync, then every minute have the desktop pull data out, allowing for more of it to sync. It’s certainly not an elegant solution, but once it was done, I could continue the process with only the changed files, which would add up to under 2GB daily. There’s also a potential concern for how DropBox would act if you put in a single file whose size is greater than 2GB – it wasn’t a concern with my data, which was a large number of tiny files. However, whether DropBox would consistently allow large files to sync would definitely be a problem worth researching.
This plan was free, in that I would not be paying for a DropBox for Business account. It certainly wouldn’t be fast, because it could only sync in 2GB chunks – although, I could potentially get the data moving in and out relatively quickly (but that might have gotten the account flagged for suspicious activity). It would be relatively real-time – the files would be copied to the Dropbox directory as soon as they were uploaded to the server, and from there would begin syncing with the computer immediately. It would be secure – Dropbox is pretty secure as a platform.
Another idea would be to use FTP, or SFTP. We already had a FileZilla FTP server up and running on the server for ease of file transfer. I could have allowed the entire data directory to be available via FTP – and then have a script running on the desktop that checks for new files, or keeps an index of files and checks for changes periodically.
This solution would also be free, given that both the FileZilla client and server are free. However, the programming aspect of polling an FTP server really turned me off from the whole idea. In terms of speed, the FTP server/client combination can pretty easily use up all available bandwidth, which is nice – but the true speed would be based on the efficiency of the polling script. It would indeed be real-time, in that the directory itself would be accessible via the FTP server. In terms of security, it can be fairly easily set up with SSL/TLS for security, which is a big plus.
CrashPlan is another product with which I’m familiar – you can see my review of the Family plan here. It’s a fairly robust program, and the free option lets you use the program to sync two computers, which is nice. If you pay, you get an additional sync destination, which is their cloud backup service. However, based on my review, the program itself simply isn’t performant – as the number of files increases, the RAM requirement increases. This means that although the program itself is free, it would be eating up precious resources from the server, and never giving them back. So long as CrashPlan is running, it will require a ton of RAM. So it’s free, but not really.
Speed is also a major issue for CrashPlan – as I mention throughout my review, CrashPlan can be extremely slow, with little to no explanation. The only support they offer is basically to restart your router (seriously). In a home environment, that’s incredibly unhelpful – and in a business environment, impossible. I wouldn’t interrupt my client’s internet access because the solution I chose got slow. The solution would be real-time, in that the folders linked by CrashPlan would be synchronized, and changes would be propogated from the server to the PC. Unfortunately, due to the slow speed of propogation, real-time is relative.
In terms of security, CrashPlan is pretty secure. Their free version uses the same Blowfish encryption as their paid version, but the free version uses a key strength of 128-bit while their paid version uses 448-bit. That’s kind of annoying, but they’re a business so I’m lucky they’re giving any part of their program away for free, I guess. Still, it’s another reason not to go with them as an option.
In the end, I went with Resilio Sync (formerly Bittorrent Sync). This option has everything I need – in terms of price, it’s completely free. In terms of speed, Resilio Sync is able to eat up 100% of my available upload bandwidth on my server, or 100% of my download bandwidth on my desktop, whichever is the bottleneck. It also gives me the ability to limit the inbound or outbound bandwidth so it’s not causing problems for other services. In terms of real-time connectivity, so long as both the server and the desktop are running the Resilio Sync service, pretty much as soon as I put a file in on the server I see it begin to sync on the desktop. Works perfectly.
Security is an interesting aspect. In order to connect the server and the desktop, you first set up a share on the server, and then you get a private key for that share. You can then insert that private key on the desktop, and it will begin syncing. According to NetworkWorld.com’s analysis of the Hackito article on BitTorrent Sync security, it shouldn’t be used for sensitive data for a number of reasons. Now, on the one hand, I’m not using it for inherently sensitive data. On the other hand, I have mitigation protocols in place for any of the issues mentioned at the end of the NetworkWorld article.
The choice of Resilio Sync as the synchronization solution comes with a number of bonus features that are pretty cool. The first is that the synchronization is based on torrent technology – so far, my whole use case has been about syncing one server and one desktop. If I were to get a second desktop, however, it would join the pool of syncing computers. Both the server and the existing desktop could act as a “seed”, providing data to the new desktop. Because of this, the files would be downloading from two sources instead of just one, theoretically doubling the download speed. More realistically, if there was an upload bottleneck on the side of either the existing desktop or the server, that bottlenecked speed would be supplemented by the additional speed of the second seed.
Another great set of options is the Archive. One potential downside of using a truly real-time synchronization option is that if, for some reason, all of the files get deleted from the server, all of those changes would then be propogated to the desktop, which would synchronize by deleting all of its files as well. The good news is that Resilio Sync has a section called the Archive, which holds certain files even after they get deleted. You can customize the settings for how long files are kept and what size the Archive can be – I have those settings pretty high, so that the entire 200GB of data can sit in the Archive for up to 30 days, if need-be. Since I would most likely be getting panicked phone calls the same day that everything got deleted, 30 days is overkill – but it’s nice to know it’ll all be there.
I mentioned that when setting up the relationship between the two Resilio Sync clients, I needed to share a key. The cool thing is, there are actually two different kinds of keys – a standard key, and a read-only key. A standard key keeps both folders synchronized, meaning that if I make a change on the server or the desktop, that change will be sent over to the desktop or the server. But I don’t want changes on the desktop to be sent to the server – if for some reason the desktop gets messed with, the server needs to continue to work properly. By providing the desktop with a read-only key, it is able to “read” the changes made on the server and synchronize with it, but any changes made on the desktop will not be “written” to the server. I can rest assured that even though I’m putting the desktop in my client’s office, even if someone really messed with it, it would not affect the server at all.
Resilio Sync does have a couple of drawbacks, though. Although I have the client software set up on both the server and the desktop, the actual syncing server is proprietary. Without delving into too much detail about how torrents work, Resilio owns the tracker server, and I can’t run my own. Basically, the tracker server keeps track of which computers have which data, and lets them know about each other. So, when my PC uses the private key provided by my server, the tracker server pairs them up so they can begin syncing.
The issue is one of security – how do I really know that Resilio isn’t giving away that private key elsewhere? How do I know that they don’t have their own backdoor in the program, and are therefore able to get at all of my files? There are a few reasons why I feel safe, even with those questions. The first is that the software itself lets me restrict the IP addresses that are allowed to get my files. This means that even if someone else had my private key, they wouldn’t have the correct IP address. The desktop will be at my client’s office, which has static IP, so I simply set that IP address as the “predefined host” and I’m good to go.
Of course, for this to be truly secure, I would have to trust the Resilio Sync software when it says that it’s using only the predefined hosts. I could monitor outbound connections to determine if any services running on the server are connecting to unknown IP addresses – however, because my backup is sizeable, I have an even easier option: bandwidth monitoring. I disabled it during the inital sync of the 200GB, but re-enabled it after it was done. Now, even if another computer manages to find my secret key and spoof the IP address of my client, I’d be notified by the webhosting company if my server started sending out massive amounts of data. Even if the program itself is untrustworthy (and admittedly, mentioning using torrent technology in a production environment may get you laughed at), it still wouldn’t be too much of a problem – perhaps they could get at a little bit of the data before I got the notification, but they most certainly couldn’t get all 200GB before I could react (and in the future, I will be scripting this reaction).