Download a Website with Wget

Posted:

Wednesday, March 23rd, 2016

Last modified:

Wednesday, March 23rd, 2016

Topics:

Scraping a Site with WGet

wget -E -k -p -r -nH \
 -P /path/to/output/directory \
 -X /skip-me,/skip-me-too  \
 http://example.com

The above command will download all of the files from http://example.com and place them in the directory /page/to/output/directory/, except for the contents of http://example.com/skip-me/ and http://example.com/skip-me-too/.

The site will be available for offline viewing, with the exception of the ignored folders. Assets in these folders will still be linked to http://example.com/. Wget will not pull down files from any domains outside of http://example.com either.

Command Explained

wget

-E (--adjust-extension)
Converts extension to .html where appropriate.
-k (--convert-links)
Convert links in document for local viewing.
-p (--page-requisites)
Download stylesheets, images, and any other files needed to view the page locally.
-r (--recursive)
Enter subfolders and get their contents, and so on.
-nH
Disable host-prefixed filenames. Prevents the creation of root folders like hostname.tld.
-P (--directory-prefix)
Save the downloaded files in this directory.
-X (--exclude-directories)
Do not download files from within these directories. Links to this pages will be prefixed with the FQD and will not be available locally.

References

GNU Wget 1.17.1 Manual

Available Wiki Topics

@TODO	chrome	Excel	ls	Photography	sysconfig
acpi	Composer	extensions	lsof	PHP	Tar
AJAX	Compser	file system	Mamp	PhpStorm	Taxonomy
Algorithms	Cookies	Firefox	math	Plesk	TCP
Android	CoreAudio	FTP	Module Development	Postgresql	Tests
angular	Coverage	Git	mod_rewrite	Programming	Theme Development
angular-1.x	cron	Gnome	MongoDB	Python	This wiki page is incomplete
Ansible	CSS	grep	Mothership Theme	Queues	Time and Date
Apache	cURL	Gunzip	MySQL	Rackspace	Tomcat
APT	Cygwin	gzip	mysqldump	Rails	Typography
Aptana	Data Structures	Hardware	Networking	React	Ubuntu
Arrays	Database	htaccess	NFS	REHL	vagrant
Backups	Debug	HTML	Nginx	RSpec	Variables
BAM	Demonstrations	HTTP	ngrok	Ruby	Views
Bash	dev tools	Ignore	Node	Sass	Views Data Export
benchmarking	Devops	images	Nodequeues	SCP	VirtualBox
Bitbucket	dig	Intellij	Not fully tested	Security	VirtualHosts
Blocks	directivess	iptables	npm	sed	wget
Bower	Drupal	Jasmine	nvm	SQL	Wiki
Caching	Drupal5	Java	OOP/D	SSH	Win7
cat	Drupal6	Javascript	OSX	SSHFS	WordPress
CDbCriteria	Drupal7	Jest	OSX Server	Strings	WWW
CentOs	drush	jQuery	Pagination	Sublime Text	XAMPP
CentOs5	Editors and IDEs	kill	Pantheon	Surviving Obsolescence	Xdebug
CentOs6	EPEL	Legacy	PDO	Syntax	Xfce
CGridView	errors	Links	Performance	sysadmin	Yii
Charsets	ES6	Linux	perl

The operator of this site makes no claims, promises, or guarantees of the accuracy, completeness, originality, uniqueness, or even general adequacy of the contents herein and expressly disclaims liability for errors and omissions in the contents of this website.