Using wget To Download Entire Websites



Basic wget Commands:
To download a file from the Internet type:

wget http://www.example.com/downloads.zip

If you are downloading a large file, for example an ISO image, this could take some time. If your Internet connection goes down, then what do you do? You will have to start the download again. If you are downloading a 700Mb ISO image on a slow connection, this could be very annoying! To get around this problem, you can use the -c parameter. This will continue the download after any disruptions. eg:
wget -c http://www.example.com/linux.iso

I have came across some websites that do not allow you to download any files using a download manager. To get around this,
wget -U mozilla http://www.example.com/image.jpg

This will pass wget off as being a Mozilla web browser

Downloading Entire Sites:
Wget is also able to download an entire website. But because this can put a heavy load upon the server, wget will obey the robots.txt file.
wget -r -p http://www.example.com

The -p parameter tells wget to include all files, including images. This will mean that all of the HTML files will look how they should do.

So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:
wget -r -p -e robots=off http://www.example.com


As many sites will not let you download the entire site, they will check your browsers identity. To get around this, use -U mozilla as I explained above.
wget -r -p -e robots=off -U mozilla http://www.example.com

A lot of the website owners will not like the fact that you are downloading their entire site. If the server sees that you are downloading a large amount of files, it may automatically add you to it's black list. The way around this is to wait a few seconds after every download. The way to do this using wget is by including --wait=X (where X is the amount of seconds.)

you can also use the parameter: --random-wait to let wget chose a random number of seconds to wait. To include this into the command:
wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com


Other Useful wget Parameters:
--limit-rate=20k : Limits the rate at which it downloads files. (20Kb/s)
-b : Continues wget after logging out. Very useful if you are connecting to your home PC via SSH.
-o $HOME/wget_log.txt : Logs the output of the wget command to a text file within your home directory. Useful for if you are using wget in the background, as you can check for any errors that may appear.




Share your views...

7 Respones to "Using wget To Download Entire Websites"

Sull said...

Thanks for this!


17 February 2009 at 15:50
Dmitriy said...

My method:
wget -rkpNl5 www.sysadmin.md


22 June 2009 at 12:03
CheckMEout!!!!! said...

ya but how can you specify how much u want the random wait to vary from? like --random-wait+0.0-40 or something like that??????


9 October 2009 at 16:33
Moglagh said...

This post has been very helpful. I am a total N00b trying to learn the ropes of Ubuntu Linux. I am currently running 11.04 Natty Narwal. on an HP mini 1010nr. My query is after I copy the web page with the wget command where does it store the downloaded site?


31 October 2011 at 18:32
SamwiseGamgee said...

 It stores the download in /proc/2144/cwd (if it isn't there just search the filesystem)
Thanks for the tips from this article. I am downloading the LSL wiki to learn it. But I am having troubles getting a viewer.


29 April 2012 at 00:30
Prasant94 said...

also include -k flag, it'll convert all the links and you can view the entire site offline


14 January 2013 at 18:09
cheap web design said...

Awesome collection, Thanks for shearing this,


14 February 2013 at 16:04

Post a Comment

 

About Me

Hey! I'm Jam, a part time web developer / blogger form the UK. I have a love for anything involving Artificial Intelligence, and robotics.

I'm also quite a fan of Ubuntu Linux! And I'm guessing you are since you're here? If you would like to contribute your own posts to this blog, then please get in touch!

Thinking about getting your own website? Click on the below link for great prices, and great designs!

Jam1e.co.uk

© 2010 Jam's Ubuntu Linux Blog All Rights Reserved Thesis WordPress Theme Converted into Blogger Template by Hack Tutors.info