Entries tagged browsershot

Crawling with Web Storage in PHP

Posted on 28. September 2019 Comments

Some pages have an extra screen before the real site appears, e.g. for age verification (like porn, for now anyways, or drug sites) or country/currency selection (like travel and big company sites). As far as I researched this, sometimes the selection is stored in a cookie, sometimes in Web Storage.

Crawling with Cookies is no problem with Guzzle.

But for example, at https://cannabis.wiki the age verification is stored in the Local Storage as you can see in the Chrome DevTools:

image

Using spatie/crawler we can inject a Browsershot (puppeteer) instance, that sets the localStorage key ageConfirmed through custom JavaScript:

$js = "localStorage.setItem('ageConfirmed', '1');";
$browsershot->setOption('addScriptTag', json_encode(['content' => $js]));

However, the code is only injected after the side is already loaded. Therefore we have to use JavaScript to reload the page:

$js = "localStorage.setItem('ageConfirmed', '1');location.reload()";
$browsershot->setOption('addScriptTag', json_encode(['content' => $js]));

You can test this on the console e.g. by using ->save($pathToFile); and have a look at the screenshot to see if it worked.