wget for data hoarding
Using wget for Large Siterips
wget
is a powerful command-line tool that allows you to download files from the web. It can be particularly useful for performing large siterips, where you want to download an entire website or a specific portion of it. In this guide, we will explore some of the most commonly used flags and identifiers that can help you with siterips using wget
.
Basic Usage
To start a siterip, you can use the following command:
wget -r -np -k -p -e robots=off -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36' https://example.com
This command will recursively download the entire website, including all subdomains, and convert all links to relative links. It will also ignore robots.txt and use a user agent string that is commonly used by web browsers. This is useful for websites that block wget by default.
Downloading Specific File Types
If you only want to download specific file types, you can use the following command:
wget --recursive -np -k -p -e robots=off -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36' -A jpg,jpeg,gif,png,mp4,webm,webp,mp3,ogg,flac,zip,rar,tar.gz,tar.xz,7z,exe,iso,apk,deb,msi,torrent https://example.com
This command will only download files with the specified extensions. You can add or remove extensions as needed.