Download a Website with Wget

Last modified: 
Wednesday, March 23rd, 2016
Topics: 
wget

Scraping a Site with WGet

wget -E -k -p -r -nH \
 -P /path/to/output/directory \
 -X /skip-me,/skip-me-too  \
 http://example.com

The above command will download all of the files from http://example.com and place them in the directory /page/to/output/directory/, except for the contents of http://example.com/skip-me/ and http://example.com/skip-me-too/.

The site will be available for offline viewing, with the exception of the ignored folders. Assets in these folders will still be linked to http://example.com/. Wget will not pull down files from any domains outside of http://example.com either.

Command Explained

wget

  • -E (--adjust-extension)
    Converts extension to .html where appropriate.

  • -k (--convert-links)
    Convert links in document for local viewing.

  • -p (--page-requisites)
    Download stylesheets, images, and any other files needed to view the page locally.

  • -r (--recursive)
    Enter subfolders and get their contents, and so on.

  • -nH
    Disable host-prefixed filenames. Prevents the creation of root folders like hostname.tld.

  • -P (--directory-prefix)
    Save the downloaded files in this directory.

  • -X (--exclude-directories)
    Do not download files from within these directories. Links to this pages will be prefixed with the FQD and will not be available locally.

References

GNU Wget 1.17.1 Manual


The operator of this site makes no claims, promises, or guarantees of the accuracy, completeness, originality, uniqueness, or even general adequacy of the contents herein and expressly disclaims liability for errors and omissions in the contents of this website.