I need to download all of the pages of a website under www.xyz.com. How do I accomplish this in an automated fashion?
I know I cannot duplicate dynamic content but I should at least be able to download all of the static content.
Observing members:
0
Composing members:
0
10 Answers
http://www.httrack.com/ is a WIndows & LInux tool for that task.
“HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility.
“It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.”
thank you both
i am trying httrack, next if that fails I will try wget
httrack failed—it cannot handle https
wget also failed. It only retrieved a few of the pages but others hidden behind links generated by javascript (within the javascript-menu) did not function. Are there additional options?
I use a tool called Web Site Downloader. http://www.web-site-downloader.com/entire/
The only time I have had it fail is when there were required proxy server settings I did not have. However, for links generated client side with JavaScript, you are probably going to have to click through by hand and save each such page individually. That’s one reason to not assign mission-critical behavior like linking to client-side scripting. Another good reason is that most search spiders do not run JavaScript interpreters, so such content is invisible to them as well.
@ETpro: Thank you. I have solved the problem as follows:
1. have downloader download to local directory
2. run webserver pointed at mirror
3. click EVERYTHING (there is a click-and-drag-to-open-links firefox add-in)
4. have downloader download everything listed as a 404 in the log of the local webserver
5. repeat
wget is cool but can’t handle Javascript indeed. What you also could’ve done was to use the DownThemAll! Firefox extension, I think.
@Vincentt I added that to my list for next time.
Thanks!
Answer this question
This question is in the General Section. Responses must be helpful and on-topic.