crawler Archive - repats blog

Some pages have an extra screen before the real site appears, e.g. for age verification (like porn, for now anyways, or drug sites) or country/currency selection (like travel and big company sites). As far as I researched this, sometimes the selection is stored in a cookie, sometimes in Web Storage.

Crawling with Cookies is no problem with Guzzle.

But for example, at https://cannabis.wiki the age verification is stored in the Local Storage as you can see in the Chrome DevTools:

Using spatie/crawler we can inject a Browsershot (puppeteer) instance, that sets the localStorage key ageConfirmed through custom JavaScript:

$js = "localStorage.setItem('ageConfirmed', '1');";
$browsershot->setOption('addScriptTag', json_encode(['content' => $js]));

However, the code is only injected after the side is already loaded. Therefore we have to use JavaScript to reload the page:

$js = "localStorage.setItem('ageConfirmed', '1');location.reload()";
$browsershot->setOption('addScriptTag', json_encode(['content' => $js]));

You can test this on the console e.g. by using ->save($pathToFile); and have a look at the screenshot to see if it worked.

This script could be helpful to download a set of files from a webserver, that you don’t have (S)FTP access to. The input file consists of a list of filenames, one name each line.

#!/bin/bash file=$1 WEBSERVER="http://webserver.tld/folder/" while IFS= read -r line; do FULLURL="$WEBSERVER$line" wget -nc -R --spider $FULLURL done < "$file"

At first, the first command line argument is saved into the variable file. Then the Webserver address is saved to the WEBSERVER variable. IFS stands for Internal Field Separator. It’s used to read line by line through the file in the while loop that ends in the last line. Inside of the loop, the read line is concatenated with the webserver address into FULLURL. Then, wget is used with the parameters -nc for checking if the file is not already present in the current folder, -R for downloading and –spider for checking the existence on the webserver.

You can find the script on GitHub.

repats blog

Thoughts of a traveling digital native

Entries tagged crawler

Crawling with Web Storage in PHP