Overlay | Cued | Tests | Config | Per Page | Archives | Bottom
Before you start: Enable the 'Develop' Menu to show (in Safari > Preferences > Advanced) then, in the 'Develop' menu options check the option to 'Disable Cross-Origin Restrictions', then refresh this crawler webpage itself, enter a website address and click CRAWL. Then let your Mac create actionable data to improve your site! The crawler outputs spreadsheets for things like: SEO headlines, keyword repeat density analysis, crawl google for competitor SEO analysis, QA Sanity checks, and debugging code, HTML source extraction... and much much more. This crawler does not butter your toast. It does pretty much everything else though. (LOL)
I suspect, it might be a bit like Google's site crawler, perhaps? If I had to guess, I'd say this crawler is a super simplified version of what Google's page crawler might be. Perhaps.
- Scan pages for keyword repeat density automatically on each page crawled, in one of three ways:
- smart filters (based on the keywords you enter and the characters and words around them)
- strict filters (just the keywords you enter)
- no filter (all the words on the page)
- An actionable fix it list. Easy. Just type in generic parts of "error page" text (i.e. "not found" for example) to find all pages with that text (needle) throughout your site (haystack). Then "1-click" later, get the list of all the incoming links to all those pages. Going a few levels deep and saving potentially hours of spreadsheet dribble drabble tosh. Hated it. So I automated it.
- AA accessibility for images and alt tag descriptions... made easy. Makes a spreadsheet check list of images and their "alt tag" text. So you can spot the missing alt values and add them for good AA Accessible ratings.
- Likely Landing Page spreadsheet (beta) - copy, paste, sort by keyword density in descending order, then filter your view by the keyword you want... and the pages near the top of the list will most likely be the ones users may land on after searching Google, Bing, DuckDuckGo and others.
- Crawl from your site and (optionally) follow links to other domains, subdomains and/or URL links with a wildcard term (like your brand name) in the URL... So if you are crawling a site that has many subdomains, you can crawl them all, in one go. Or follow your links with brand mentioned in any URL found from any site. ..
- Crawl test sites that aren't public yet. So you are building a new site to replace one that is live now, you can crawl and compare them both.
- go to http://crawler.fyi/crawler/
- type in your homepage URL (make sure you are NOT logged into a CMS for the duration of your crawl)
- press CRAWL
It will then crawl your site, looping through a few steps on each page. When it has completed every page of the site, it will prompt you to copy and paste the data into a Spreadsheet.
- Type in some text or code that you are looking for on each page, and it will help you find those "needles in the haystack", by clicking the (x) to close the top layer, and then scroll down to NEEDLE IN HAYSTACK text fields, and type in up to 5 different bits of text or code you are looking for. Then start your crawl as normal, and it will list any pages that have those "needles in the haystack"
- Extract all HTML for each page, and convert them to a single line of HTML that can be put into spreadsheet.
Check it out at:
Use it and enjoy it! And if you have an idea to extend it, please DO share. Don't hack/blag/pretend you wrote that code and try to sell it. That'd be stupid. And illegal.
Roadmap: (These are features not yet coded up, but planned or possibly planned for future versions.)
- Insert-able string manipulation and database building function manipulation into the core script.
By POST ala PHP. You enter changes to the Build SEO Database Function/Code, press POST, it writes to the form results page and you execute your code changes. They are retained in the form under the crawl controller modal, where entered before. So you can save them into your favourite text editor or as a browser form field pre-set, customising is easy. Do at your own risk.
- Manipulate the 'needle in the haystack 1' versus 'needle in the haystack 2' into a new variable called bothNeedles, then save into your editted Build SEO Database Function above.
- Option to split SEO Database into multiple textareas, by domain, if crawling more than one domain, and a checkbox is checked for it. Might be handy for some people? Dunno.
- Old site versus New Site - Side by Side Page comparison with prompts for QA feedback and notes that get added to the SEO Database inline with Page Crawl Data.
- Whatever other ideas I can think of or you suggest.
Check me out at terencechisholm.com/cv
This is mine. Copyright © Terence Chisholm 2018. All rights reserved.
Use at your own risk. And Log out of your CMS first.
If you crawl links that trigger things like deleting pages on your site, it's not my fault. ALWAYS Log out of your CMS, before crawling your site. Otherwise it will find CMS links to edit, delete or otherwise mess up your website. Also, watch the screen as it crawls, and check the CUED list to be sure no links that you don't want to activate are the 'cued' list for it to crawl.
If you use this crawler on your site, make sure you are NOT logged into your CMS. If you allow the crawler to activate links that DELETE's REVISES, EDITS, PUBLISH, or otherwise alters content in the Content Management System that you used to build your site... then the crawler will follow each of those links and delete and mess up things on your site. Log out of your CMS before crawling your site. That way, it will only follow links that the public can follow too. And not mess up your site. Alternatively, you can watch the CUED links fill up, then pause and remove unwanted links from that list of links it will crawl. You can also set options to NOT crawl links that have any bit of text in them that you specify. For example, if you use WordPress, then "/wp-login" and "/wp-admin" are folders you won't want to crawl. So you can add those to the section marked "Don't crawl URLs with these characters in it (instead treat them like external links):" above.
If saving a copy of source HTML of the crawled pages to a webserver...