HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEBSITE

How to Find All Existing and Archived URLs on a Website

How to Find All Existing and Archived URLs on a Website

Blog Article

There are lots of reasons you would possibly want to seek out each of the URLs on a web site, but your specific aim will decide what you’re attempting to find. As an example, you may want to:

Recognize every single indexed URL to research troubles like cannibalization or index bloat
Acquire latest and historic URLs Google has viewed, especially for web page migrations
Come across all 404 URLs to recover from write-up-migration problems
In each circumstance, one Device received’t Offer you anything you'll need. However, Google Research Console isn’t exhaustive, in addition to a “web-site:illustration.com” search is restricted and tricky to extract information from.

With this post, I’ll stroll you through some applications to construct your URL checklist and ahead of deduplicating the information using a spreadsheet or Jupyter Notebook, based upon your internet site’s sizing.

Aged sitemaps and crawl exports
For those who’re seeking URLs that disappeared from the Stay web page recently, there’s a chance someone with your team can have saved a sitemap file or even a crawl export prior to the variations ended up manufactured. When you haven’t now, check for these data files; they are able to typically supply what you may need. But, should you’re examining this, you most likely didn't get so lucky.

Archive.org
Archive.org
Archive.org is an invaluable Instrument for Search engine optimisation jobs, funded by donations. For those who search for a domain and select the “URLs” alternative, you'll be able to obtain nearly 10,000 mentioned URLs.

Having said that, There are several limits:

URL Restrict: You'll be able to only retrieve up to web designer kuala lumpur 10,000 URLs, and that is inadequate for bigger web pages.
Top quality: Several URLs could possibly be malformed or reference source files (e.g., photos or scripts).
No export alternative: There isn’t a built-in technique to export the checklist.
To bypass The dearth of the export button, use a browser scraping plugin like Dataminer.io. On the other hand, these restrictions mean Archive.org may not provide a whole Resolution for more substantial web pages. Also, Archive.org doesn’t point out whether or not Google indexed a URL—however, if Archive.org located it, there’s an excellent opportunity Google did, far too.

Moz Professional
Whilst you may perhaps generally utilize a link index to seek out external internet sites linking for you, these tools also uncover URLs on your website in the process.


The way to use it:
Export your inbound links in Moz Pro to secure a rapid and straightforward list of concentrate on URLs out of your site. For those who’re managing a massive Site, think about using the Moz API to export info over and above what’s manageable in Excel or Google Sheets.

It’s essential to Observe that Moz Pro doesn’t ensure if URLs are indexed or identified by Google. Nevertheless, because most web pages use the same robots.txt principles to Moz’s bots because they do to Google’s, this process frequently functions properly for a proxy for Googlebot’s discoverability.

Google Research Console
Google Lookup Console offers a number of useful resources for making your list of URLs.

One-way links stories:


Much like Moz Professional, the Backlinks portion presents exportable lists of goal URLs. Regretably, these exports are capped at one,000 URLs Just about every. You'll be able to implement filters for specific pages, but considering that filters don’t implement to the export, you may need to depend on browser scraping tools—limited to five hundred filtered URLs at any given time. Not excellent.

Efficiency → Search engine results:


This export provides you with an index of web pages acquiring search impressions. Even though the export is proscribed, You need to use Google Research Console API for larger sized datasets. Additionally, there are cost-free Google Sheets plugins that simplify pulling far more intensive info.

Indexing → Pages report:


This part offers exports filtered by issue form, even though these are also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Webpages and Screens default report in GA4 is an excellent source for collecting URLs, having a generous Restrict of a hundred,000 URLs.


Better yet, you can implement filters to build distinctive URL lists, correctly surpassing the 100k limit. For example, if you would like export only website URLs, abide by these actions:

Action 1: Incorporate a section on the report

Action 2: Click “Create a new segment.”


Action three: Outline the phase which has a narrower URL pattern, such as URLs that contains /weblog/


Take note: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply beneficial insights.

Server log documents
Server or CDN log data files are Possibly the last word Device at your disposal. These logs capture an exhaustive list of each URL path queried by end users, Googlebot, or other bots in the course of the recorded interval.

Criteria:

Details sizing: Log information can be large, countless websites only retain the final two months of knowledge.
Complexity: Analyzing log files could be hard, but numerous resources can be found to simplify the method.
Merge, and very good luck
When you finally’ve gathered URLs from each one of these resources, it’s time to combine them. If your web site is small enough, use Excel or, for bigger datasets, applications like Google Sheets or Jupyter Notebook. Assure all URLs are constantly formatted, then deduplicate the list.

And voilà—you now have an extensive listing of present, previous, and archived URLs. Fantastic luck!

Report this page