Httrack follow call forwarding

I am trying to mirror web pages starting from the URL provided by the user (of course, there is a set of depth restrictions). Wget did not catch links from css / js, so I decided to use httrack .

I am trying to reflect some sites as follows:

# httrack <http://onet.pl> -r6 --ext-depth=6 -O ./a "+*" 

This site uses the (301) redirect to http://www.onet.pl:80 , httrack just loads the index.html page with:

 <a HREF="onet.pl/index.html" >Page has moved</a> 

and nothing else! When I run:

 # httrack <http://www.onet.pl> -r6 --ext-depth=6 -O ./a "+*" 

he does what I want.

Is there any way to do httrack after the redirect? I am currently just adding "www." + Url to httrack URL, but this is not a real solution (does not apply to all cases of the user). Are there any better tools for mirroring sites for Linux?

+4
source share
2 answers

In the main httrack forum, one of the developers said that this is impossible.

The correct solution is to use another web mirroring tool.

+3
source

You can use this script to determine the real destination URL first, and then run httrack against this URL:

 function getCorrectUrl($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_HEADER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_URL, $url); $out = curl_exec($ch); // line endings is the wonkiest piece of this whole thing $out = str_replace("\r", "", $out); // only look at the headers $headers_end = strpos($out, "\n\n"); if ($headers_end !== false) { $out = substr($out, 0, $headers_end); } $headers = explode("\n", $out); foreach ($headers as $header) { if (substr($header, 0, 10) == "Location: ") { $target = substr($header, 10); return $target; } } return $url; } 
0
source

All Articles