Overlay | Cued | Tests | Config | Per Page | Archives | Bottom

JavaScript website crawler you can run on your Mac in Safari (client-side)

About this JavaScript Client-side Website Crawler...

Before you start: Enable the 'Develop' Menu to show (in Safari > Preferences > Advanced) then, in the 'Develop' menu options check the option to 'Disable Cross-Origin Restrictions', then refresh this crawler webpage itself, enter a website address and click CRAWL. Then let your Mac create actionable data to improve your site! The crawler outputs spreadsheets for things like: SEO headlines, keyword repeat density analysis, crawl google for competitor SEO analysis, QA Sanity checks, and debugging code, HTML source extraction... and much much more. This crawler does not butter your toast. It does pretty much everything else though. (LOL)

I suspect, it might be a bit like Google's site crawler, perhaps? If I had to guess, I'd say this crawler is a super simplified version of what Google's page crawler might be. Perhaps.

About this client-side JavaScript website crawler

I built this JavaScript Website Crawler that helps people scan entire sites, looking at all pages and capturing data about each page, like SEO reviews, content audits, debugging code, finding the needle in the haystack, extraction of page text, page code and much more including:

- Scan pages for keyword repeat density automatically on each page crawled, in one of three ways:
- smart filters (based on the keywords you enter and the characters and words around them)
- strict filters (just the keywords you enter)
- no filter (all the words on the page)

- Find the "needle in the haystack" (up to 5 needles per crawl) with this client-side JavaScript website crawler, then get a list of all the incoming links to pages with the "needle in the haystack"

- An actionable fix it list. Easy. Just type in generic parts of "error page" text (i.e. "not found" for example) to find all pages with that text (needle) throughout your site (haystack). Then "1-click" later, get the list of all the incoming links to all those pages. Going a few levels deep and saving potentially hours of spreadsheet dribble drabble tosh. Hated it. So I automated it.

- AA accessibility for images and alt tag descriptions... made easy. Makes a spreadsheet check list of images and their "alt tag" text. So you can spot the missing alt values and add them for good AA Accessible ratings.

- Search Engine Optimization SEO Site Audits (capture links list, H1,H2,H3,H4 headlines, URLS, body text, links list and more, automatically with this client-side JavaScript website crawler)

- Likely Landing Page spreadsheet (beta) - copy, paste, sort by keyword density in descending order, then filter your view by the keyword you want... and the pages near the top of the list will most likely be the ones users may land on after searching Google, Bing, DuckDuckGo and others.

- View all pages of your site, in both mobile and desktop view side-by-side for QA testing, with this client-side JavaScript website crawler

- Crawl from your site and (optionally) follow links to other domains, subdomains and/or URL links with a wildcard term (like your brand name) in the URL... So if you are crawling a site that has many subdomains, you can crawl them all, in one go. Or follow your links with brand mentioned in any URL found from any site. ..

- Find broken links and redirected links easily - site-wide or multi-site wide, with this client-side JavaScript website crawler

- Crawl test sites that aren't public yet. So you are building a new site to replace one that is live now, you can crawl and compare them both.


How to use this Client-side JavaScript Website Crawler:


- go to http://crawler.fyi/crawler/
- type in your homepage URL (make sure you are NOT logged into a CMS for the duration of your crawl)
- press CRAWL
It will then crawl your site, looping through a few steps on each page. When it has completed every page of the site, it will prompt you to copy and paste the data into a Spreadsheet.

Options you can change, while this Client-side JavaScript Website Crawler is crawling your site pages:


- adjust the speed of the crawl to match how fast your page loads by pressing SLOWER or FASTER a few times, and the duration of the delay between taking the next steps will change.
- Copy and paste CUED, DONE, and SEO DATABASE while the crawl is in progress, and let it carry on crawling


Options to set up before starting a crawl, for this Client-side JavaScript Website Crawler:


- If you want to crawl from more than one domain, you can. With this client-side JavaScript website crawler, you can crawl up to 30 domains/subdomains in one go. Or add a WildCard domain, and it will crawl any link with that in it. Great for crawling big corporate sites and internal intranet sites too.
- Type in some text or code that you are looking for on each page, and it will help you find those "needles in the haystack", by clicking the (x) to close the top layer, and then scroll down to NEEDLE IN HAYSTACK text fields, and type in up to 5 different bits of text or code you are looking for. Then start your crawl as normal, and it will list any pages that have those "needles in the haystack"
- Extract all HTML for each page, and convert them to a single line of HTML that can be put into spreadsheet.
- Extract by DIV-ID or CLASS-NAME if you like too. So you can just get the bits you want captured on your crawl data spreadsheet, with this client-side JavaScript website crawler

Check it out at:
http://crawler.fyi/crawler/

Use it and enjoy it! And if you have an idea to extend it, please DO share. Don't hack/blag/pretend you wrote that code and try to sell it. That'd be stupid. And illegal.

Roadmap: (These are features not yet coded up, but planned or possibly planned for future versions.)
- Insert-able string manipulation and database building function manipulation into the core script.
By POST ala PHP. You enter changes to the Build SEO Database Function/Code, press POST, it writes to the form results page and you execute your code changes. They are retained in the form under the crawl controller modal, where entered before. So you can save them into your favourite text editor or as a browser form field pre-set, customising is easy. Do at your own risk.
- Manipulate the 'needle in the haystack 1' versus 'needle in the haystack 2' into a new variable called bothNeedles, then save into your editted Build SEO Database Function above.
- Option to split SEO Database into multiple textareas, by domain, if crawling more than one domain, and a checkbox is checked for it. Might be handy for some people? Dunno.
- Old site versus New Site - Side by Side Page comparison with prompts for QA feedback and notes that get added to the SEO Database inline with Page Crawl Data.
- Whatever other ideas I can think of or you suggest.

Who made this JavaScript Client-side Website Crawler?

Terence Chisholm

Check me out at terencechisholm.com/cv

Legal

This is mine. Copyright © Terence Chisholm 2021. All rights reserved.

Use at your own risk. And Log out of your CMS first.

If you crawl links that trigger things like deleting pages on your site, it's not my fault. ALWAYS Log out of your CMS, before crawling your site. Otherwise it will find CMS links to edit, delete or otherwise mess up your website. Also, watch the screen as it crawls, and check the CUED list to be sure no links that you don't want to activate are the 'cued' list for it to crawl.

If you use this crawler on your site, make sure you are NOT logged into your CMS. If you allow the crawler to activate links that DELETE's REVISES, EDITS, PUBLISH, or otherwise alters content in the Content Management System that you used to build your site... then the crawler will follow each of those links and delete and mess up things on your site. Log out of your CMS before crawling your site. That way, it will only follow links that the public can follow too. And not mess up your site. Alternatively, you can watch the CUED links fill up, then pause and remove unwanted links from that list of links it will crawl. You can also set options to NOT crawl links that have any bit of text in them that you specify. For example, if you use WordPress, then "/wp-login" and "/wp-admin" are folders you won't want to crawl. So you can add those to the section marked "Don't crawl URLs with these characters in it (instead treat them like external links):" above.

Crawl data...

Cued URLs:

Done URLs:

External/Filtered out URLs:

SEO DB spreadsheet:

If the above one gets too big... put the SEO DB Spreadsheet here whilst looping, and move it above after the full crawl is done.
(if crawling more than 100 pages, every 100 pages we will copy and paste it all into this textarea below, so that the time that takes won't delay the looping per page on the rest of the crawl):

Voice Recording Note on page when crawling (use slower speed crawl for this):

Flag:
Images & Alt Tags spreadsheet:

All Needles in Haystack spreadsheet (part 1):

Incoming Links to Needle in Haystack pages spreadsheet (part 2):
After a crawl where searches for 'needles' show you pages with the needles, this optional set of spreadsheet rows can find and add all the incoming links to those pages with needles. So if you use the needle in haystack feature to find unpublished/broke pages, then you can use this find out what pages link to each unpublished/broke page too. Job done.


Part 2 scan (after crawl)
Landing Page(s) Spreadsheet to figure the likely landing page based on Keyword Repeat Density
Looking for repeating keywords from the page URL, Page Title & Body Text visible on page just after the page loads.
After the crawl, copy and paste this into spreadsheet., then sort by Density in descending order, then filter by keywords column in Excel to view only the Keywords you are reviewing, and you will see the most likely top landing pages in order of the most keyword dense to least. (Note, Google may rank some pages higher without much Keyword Repeat Density, if Google thinks that site's Page Rank is a directly related topic or a known/endorsed/truest/expert site on the subject. For example, when searching for "us news", Google knows that site's like CNN are MORE known for "us news" than you blog will be. No matter how many times you put "us news" all over your site.


If the above one gets too big... put the Landing Page(s) Spreadsheet here whilst looping, and move it above after the full crawl is done.
(if crawling more than 100 pages, every 100 pages we will copy and paste it all into this textarea below, so that the time that takes won't delay the looping per page on the rest of the crawl)

Test Steps Results spreadsheet Per Page (BETA):

Test Steps Results1 Collated (site wide):


Crawl configuration


Scan URL:

URL of Pixel to Inject when capturing page data:
SEO Keywords/Key-phrases
Keyword repeat density scans on each crawled page. (comma separated list)


Filter the words as you scan for repeating keywords?






ScrollBy (pixel number):

Scroll Interval (milliseconds number):


Crawl Search Engines, to get data from top ranking competitor sites
before crawling your site, for automated SEO comparision?


Name: BaseURL: Filter out links with:
Name: BaseURL: Filter out links with:

Running a normal crawl that adds new links to the Cued URLs list above?



Want to write an SEODB (Search Engine Optimiser) Database?



Want to save a copy of each crawled page data item to a web server?


If so, put in the Endpoint URL you want to post this data too:

Use: https://web.archive.org/save/ to post to the archive.org Internet Archive (Public archive of web pages as they were found on given days)
Or use: whatever endpoint URL you enter.


Field name: Enter the name of the field you want this custom token string to be posted from. Check what system is receiving the data string, and ask them for their schema (field names list) of incoming data field names they accept. Then pick the one you want to post your custom data string into and enter that field name below:


Submit via POST or GET method, or inject a form here and submit it?


(use 'myCustomTokensDataString' field for this form, name the form with the ID & name of "MyBespokeForm", and we will trigger a submission of that form, for each page crawled.)

If the data is sent via GET, it will put the URL it's just scanned, after the Endpoint URL you enter above. (Sample: https://web.archive.org/save/http://terry.site or https://your.com/save.php?url=http://terry.site)
If the data is sent via POST, in this data schema:
pageURL = Page URL
pageDATA = HTML Source of page
seoDB_LINE = Page SEO Data: (In this tab delimited fields schema: SEE FIRST HEADER ROW OF SEO DB)


Or instead of the above options, if you want to load a form page template (with tokens) from your own server, enter the URL of that below:
[Load it]

Your form code should NOT contain the <form> and </form> tags, but should have all the fields you put between those tags instead. With Tokens in place of values.
For each page crawled, your form code will be injected into a DIV on this page, with the tokens being replaced with values, and a auto-submit of the form to your data endpoint.

Show mobile view too?



Test Steps?

rebuildEntriesDB from WebForm


Where phrases with Latin characters (other than XXXXXX) are extracted and highlighted for translation QA confirmation, so after importing a translation, you can spot missing texts which might be:
- Checks for any required Needle in the haystack, such as the code that says what language the page is in, like: <option value="es" selected="selected"> , then collates missing texts for pages that have that required needle only, not for pages in other languages
- Finds and makes a list of hard coded words/phrases in templates which are in latin based characters, which might need to be integrated into the CMS in editable and translatable ways?
- Words/phrases with Latin characters and/or numbers


Home URL to start crawl on:

Staging URL to start Compare crawl on (optional):
Add your staging root domain below, to compare the 'Home URL' root domain pages to those on 'staging root' domain (below) to check for different text on the pages (when the character length is different for body text on both Home/Staging versions, it will show the difference in the SEODB output):


Complete Domains List to follow links on in crawl (the last 4 may auto-cue):

(These last 4 domains, will be added to the cue after config is saved)

Don't crawl URLs with these characters in it (instead treat them like external/omitted links to be filtered out of the cue and are not crawled):

After omitting filters above are applied, you can all some to bypass omission by adding a
Wildcard part of URL (For example: "/en/" - would crawl ALL links to any URL that has "/en/" contained within it.)

Wildcard (optional) filter to also require these characters in the URLs (Leave blank for all, or add "mysite.com" and "/en/" above to crawl pages with BOTH "/en/" and "mysite.com" in their URLs)


Loop delay speeds: Steps 1,2,3...
(step 1 is for new page loading & js injection, 2 = grab page data, 3 = send back data & repeat loop)

Pop Up Window Format String(s)
(desktop & mobile, desktop only)



Fix it lists, automatically created using "Needle in Haystack" searches on each page:
Enter text or code string below, and the crawler will highlight when it finds the entered text/code in the page.
Case In-Sensitive needles in haystack, where a search for "tHiS", "this" and "THIS" all return the SAME RESULTS.

Needle 1: in + times
Needle 2: in
Needle 3: in
Needle 4: in
Case Sensitive needles in haystack, where a search for "tHiS", "this" and "THIS" each return DIFFERENT RESULTS.
Needle 5: in

Check for required text in the URL before adding new links found (on that page) to the crawler cued list?
If you want to add pages to the crawl cue that have /en/ in them, type: "/en/" below - and when we land on a page without /en/ in the URL, we won't add any links found on that page to the crawl cue.
Check for required text/code before adding new links found (on that page) to the crawler cued list? Required in URL:

If you want to check for "text or code" that should appear on every page you crawl, enter it below, and we will flag any pages which don't have this text/code on it as a "Required Needle in Haystack that's missing".
Required: in
Sample: <option value="es" selected="selected"> = Spanish Lang selected on page load.
If a page doesn't have the required needle text/code (above), do you still want to add the links you find on that page to the crawling cue?



Bespoke content extraction for part or all of the HTML source?




If YES, enter the DIV-ID or CLASS name (for content you want to extract on this crawl) below:


If YES, when extracting my HTML, get the outer HTML or inner HTML?




Crawl links to PDF and Images as you would pages?



Windows during loop to:



Treat what comes after question marks in URLs as:


Treat hashtags in URLs as:




Want to Auto Start at the time below, each day, then after the crawl, auto-refresh the window and start again the next day?


If yes, start at what time each day? (24 hour clock time)

If yes, at 23:59, we will clear your browser's crawl data and starts fresh from just after midnight. Then wait for the time entered above, and restart your daily crawl.



Open Overlay (press 'ESC' key) & Start Crawling



Per Page Data captured


Just scanned this URL:

Redirected URL (if Scan URL link redirects, this will show the destination's redirected URL):

Short Link URL (if shortlink in head of page):

Canonical URL (if in head of page):

Compare URL (if comparing Staging to Production):


Page title:

Page Links Count:

Links on page:

Page H1:

Page H2s:


Page H3s:


Page H4s:


Page H5s:


Page H6s:


Required Needle Found on this page? (YES or NO)


No. of Forms on Page:

SEO Tips:

Page Body (HTML):

Page Body (TEXT):

Compare Page Body (TEXT):

Page Raw HTML:

Extracted HTML:

DOCTYPE:

Meta Keywords:

Meta Description:

Images:

Images (without Alt Tags) checking for accessibility:

Count of Images (without Alt Tags) found on this page:




Next up in the cue:


Next crawl, if auto starting again:


If saving a copy of source HTML of the crawled pages to a webserver...


URL:

SEODB LINE:

CUSTOM TOKEN DATA:

PAGE DATA:

Copyright © 2021 Terence Chisholm. All rights reserved.





Pages sent to Your GET Endpoint Archive URL:
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20




















































































































































Form Injection Here