Traffic Spikes attributed to Schemabot

Modified on Mon, 7 Feb, 2022 at 11:26 AM

We use our web crawler, known technically as Schemabot, for two objectives. First, we use the web crawler to discover schema markup on the website to inform reports like our Analyzer and Trend Report. Secondly, if you are a Highlighter customer, the crawler can be used to generate markup asynchronously.

Reason 1: In both scenarios, our crawler will go across through your website to collect or generate the most current data. The crawler will typically run weekly and go through your entire website to renew the data. Each crawl attempt to find a sitemap, and queue all the webpages. Afterwhich, the crawl will start on the home page and follow/queue the links discovered on the webpage.

Reason 2: For each page load, the crawler will perform a GET request to retrieve the HTML, then render the page HTML and JavaScript on the page to get a complete picture of the website. Modern JavaScript frameworks allow for many complex interfaces and functionality, which is great. In some cases, the JavaScript may execute callbacks to your website and if they do, it may appear as a second request from Schemabot as the User-Agent. Wordpress websites, as an example, may have plugins which run JavaScript. These may show up as callbacks to /wp-admin/admin-ajax.php as POST requests, and server logs therefore might appear as if the crawl has duplicate requests, however, that is because of the JS rendering. If you want to diagnose a situation like this, open the webpage and use the Developer Tools > Network tab to see callbacks to the server.

Other Reasons: Honestly, the web is vast, and scenarios abound and we can't predict all the scenarios. We sometimes get broken relative links or GET parameters that create weird recursive URL patterns. While we have heurtistics to avoid this, If it looks like the crawler is stuck somewhere or underoptimized, please let us know. We don't want to waste your or our resources, so let's optimize the effort.