How to clone a site
Have you discovered for some time a Web site full of contents of your interest but, terrified by the idea that it could disappear at any moment, you would like to “keep” a copy of it on your computer? If this is your case, then you’ve come to the right place at the right time: in the course of this guide, in fact, I’ll explain how to clone a site using a series of programs specially created for the purpose.
However, I would like to clarify, right from the start, a fundamental aspect of the matter: once the download of the website has been carried out, the “local” copy obtained will always remain the same, even if the computer is connected to the Internet. This means, in practice, that any updates made by the legitimate owner of the website will not be “reflected” in the local copy present on the computer: to make this happen, it will be necessary to start a new download.
In any case, downloading a website is not difficult but, as you’ll find out in a moment, it’s necessary to apply very precise settings in order to avoid ending up with less than satisfactory results. So, if you’re interested in learning more, continue reading this guide: I’m sure that, in a few minutes, you’ll have a clear understanding of all the details regarding this issue. Good reading and good luck with everything!
Index
Preliminary information
How to clone a website
HTTrack (Windows)
SiteSucker (macOS)
Other programs to copy a Web site
How to copy a website and edit it
Preliminary information
Before getting into the heart of this guide, I think it is necessary to give you some explanations about the operation you are going to perform.
First of all, you should know that a Web site generally consists of several Web pages, each of which contains numerous elements: plain text, images, multimedia content of various kinds, and more or less direct links to external resources, such as CSS style sheets (for proper layout), scripts, and numerous other categories of Web resources. All the contents of a Web page, including any links to external resources, are included in its source code.
The download operation of an entire site requires that, in the first instance, a crawl phase is performed on the relative home page: this means that the software downloads the initial page, analyzes the source and downloads all the elements present in it (other pages to which it is linked, multimedia components, scripts and so on). This operation is repeated, in a recursive way, on all the pages of the site under analysis and on the elements they point to.
Ideally, it is like following the structure of a tree, starting from the apex: the tip of the tree is the home page (first level) which, through “direct” branches, points to the sub-pages for which there are hypertext links, (second level); these pages, in turn, are linked to other pages, other hypertext links and other content (third level), and so on.
By default, programs dedicated to downloading websites try to download all the links present on the analyzed pages and that, in most cases, means two things.
Downloading a complete download of a large website is almost unthinkable because of the huge amount of data that needs to be downloaded to your computer. For example, don’t try to download all the pages of Wikipedia or similar portals: you might not be able to complete the download even after months!
For a downloaded website to be consistent and usable, it is imperative to limit the crawler to the analysis of third or, at most, fourth level links, starting with the home page. Ideally, it would also be advisable to set a limit on the size of the downloaded files in order to avoid downloading multimedia files that could unnecessarily saturate the computer’s memory.
Finally, remember that, unless you use a staging environment, the performance of a Web site downloaded locally hardly approaches the performance obtainable, instead, by browsing the site in the “classic” way (i.e. through its legitimate address): this happens because, very often, some elements necessary to the correct visualization (databases, server-side Web applications, external scripts and so on) are not available for download, as they are accessible only from within the Web server hosting the site.
How to clone a website
Excuse me? Have you fully understood what I have just explained to you and, with the necessary precautions, are you ready to clone a site of which you are interested in having a local copy? If so, you can rely on one of the software that I’m going to present below.
HTTrack (Windows)
HTTrack is a free and open source program, available for Windows, macOS and Linux, that allows you to take the entire content of an Internet site and save it to a folder on your PC of your choice.
Although the program is available for all three major desktop platforms, it only has a ready-to-use graphical interface for Windows: for simplicity’s sake, I’ll limit myself to talking about the latter operating system.
So, to download HTTrack for Windows, go to this website, click on the Download tab and then click on the link httrack_x64-x.y.z.exe, if you are using 64-bit Windows, or httrack-x.y.exe, if you are using 32-bit Windows, in order to start downloading the installation package of the program.
Once you get the file (i.e. httrack_x64-3.49.2.exe), run it and click the Yes and Next buttons, then check the box next to I accept the agreement, click the Next button for 4 more times and then the Install button. To exit the setup and start the program, remove the checkmark from the box next to View history.txt file and click the Finish button.
Once the program starts, select your language (presumably Italian) from the Language preference dropdown menu, click the OK button and then close the program and start it again, using the icon added to the Start menu, to apply the new language settings.
On the initial screen of the program, click the Next button to start a new project, specify the name and category of the project in the appropriate fields and select the folder where you want to save everything by clicking the […] button next to the Basic path box.
When you’re done, click the Next button, set the Action drop-down menu to Download website(s), and specify the website’s homepage web address (e.g. aranzulla.co.uk) in the text box immediately below.
After that, click on the Define Options… button, go to the Limits tab and, using the available boxes and text fields, specify the maximum depth of internal links (e.g. 3), the maximum depth of external links (e.g. 3), the maximum size of HTML files, other file types and the entire site (in bytes). The Filters tab, on the other hand, allows you to exclude or include specific types of files in the download.
Once you’ve made the necessary adjustments, click on the OK button and press the Next and Finish buttons to start downloading the website, which can take up to several hours – it all depends on the quantity and weight of the files that make up the site. Once the download is complete, all you have to do is press the View Web button to start browsing the downloaded site. Easy, isn’t it?
SiteSucker (macOS)
If you use a Mac, I suggest you turn instead to SiteSucker: a program that, using an easy-to-use interface, allows you to download an entire website, with the possibility of customizing the analysis options. SiteSucker is available on the Mac App Store and costs €5.49.
After purchasing, installing and launching the program, first click on the Settings button to apply the appropriate restrictions to be used during crawling. Go to the Restrictions tab and check the box next to the restriction options you want to use: personally, I recommend limiting the Maximum number of levels (between 3 and 4), Maximum number of files (400 or less) and Maximum file size.
When you’re done, click the File Type tab, set the drop-down menu at the top to Do not allow specified file types and place a checkmark next to the file types you want to exclude from downloading (if you want to define new ones, use the Custom Templates tab. To save the applied settings, first click the Save as user defaults button and then click OK.
Once you return to the main screen of the program, use the Folder button to specify the directory in which to download the site’s files (if you don’t want to use the default one), enter the address of the home page of the site you are interested in into the URL text field and, to start downloading the site right away, click the Start Download button.
Once you’ve finished downloading the site, all you have to do is press the File button, to open the home page of the downloaded site locally.
Note: if you do not wish to purchase SiteSucker, you can download one of the older versions of the software, freely available on the program’s website. The operation, although more limited, is similar to what you have seen before.
Other programs to copy a Web site
If you feel that the solutions shown above are not for you, you can consider using some other, equally effective website copying programs. Below are some of them.
Cyotek WebCopy (Windows) – is an English language software that allows you to download a website or part of a website by applying restrictive content rules as needed. It is free of charge.
Website Ripper Copier (Windows) – is a program with an interface that is not very neat, but it has all the functions you need to do your work well. It is free for the first 30 days of use, after which you need to purchase a license (currently costs €44.64+VAT).
wget (macOS/Linux) – is a download utility built into Windows and macOS operating systems that, when properly configured, allows you to download entire websites. It works from the Terminal.
How to copy a website and edit it
You have created your first website and published it online, and now you would like to copy it to another place and modify it, so that you can test the improvements you have made in a safe environment, thus keeping the “official” site safe from changes that could destroy it?
In this case, a staging platform may come in handy: in case you haven’t heard of it, this is a feature offered by a large number of hosting platforms that allows you to copy your website “on the fly”, within a couple of clicks, to a separate platform, through which you can safely edit your site.
One platform that allows you to easily create and configure a staging environment is Aruba: I told you about this possibility, in detail, in my guide to Aruba WordPress hosting.