Before I used to run HTTrack (Windows, GUI version), but most of the time it wasn't what I wanted.
Here is a complete wget solution:
wget -r --no-parent --mirror --page-requisites --adjust-extension --convert-links --continue -e robots=off https://www.website.com/
UPD: Yet, after a while, that was still isn't what I wanted :) I needed to crawl all pages which only belong to this website's domain PLUS the files like PDF the website references which are outside of site's domain. WGET won't give you this. HTTRACK in its CONSOLE incarnation, however, does. Here is the solution:
httrack --near -u2 -p7 https://YOUR_SITE
Note, it's GUI version won't do! Moreover, I found no switch in its GUI where these flags could be activated, same as no place to add cmd prompt flags manually.
No comments:
Post a Comment