There are lots of motives you may perhaps have to have to discover many of the URLs on a website, but your actual intention will determine Whatever you’re searching for. For illustration, you might want to:
Establish each individual indexed URL to investigate difficulties like cannibalization or index bloat
Acquire present and historic URLs Google has found, especially for web-site migrations
Locate all 404 URLs to Get well from write-up-migration mistakes
In Just about every state of affairs, one Software received’t give you every little thing you may need. Sadly, Google Research Console isn’t exhaustive, along with a “web-site:example.com” search is limited and tough to extract facts from.
With this post, I’ll walk you thru some equipment to create your URL list and ahead of deduplicating the information utilizing a spreadsheet or Jupyter Notebook, based on your site’s measurement.
Outdated sitemaps and crawl exports
For those who’re looking for URLs that disappeared in the live internet site lately, there’s a chance anyone on your group might have saved a sitemap file or possibly a crawl export prior to the adjustments were built. For those who haven’t already, look for these data files; they will usually deliver what you'll need. But, when you’re reading this, you almost certainly didn't get so Fortunate.
Archive.org
Archive.org
Archive.org is an invaluable tool for SEO responsibilities, funded by donations. For those who seek for a website and choose the “URLs” selection, it is possible to obtain around 10,000 mentioned URLs.
Nonetheless, Here are a few limitations:
URL Restrict: You are able to only retrieve around web designer kuala lumpur 10,000 URLs, which is inadequate for much larger web-sites.
Excellent: Several URLs might be malformed or reference source files (e.g., photographs or scripts).
No export option: There isn’t a designed-in approach to export the listing.
To bypass The dearth of the export button, make use of a browser scraping plugin like Dataminer.io. Having said that, these limitations imply Archive.org may not offer a whole Resolution for greater web sites. Also, Archive.org doesn’t indicate regardless of whether Google indexed a URL—however, if Archive.org identified it, there’s a good possibility Google did, much too.
Moz Professional
While you would possibly ordinarily utilize a connection index to find exterior internet sites linking for you, these equipment also learn URLs on your site in the method.
The way to use it:
Export your inbound inbound links in Moz Professional to get a swift and straightforward listing of target URLs out of your web-site. In the event you’re handling a huge Internet site, think about using the Moz API to export facts outside of what’s manageable in Excel or Google Sheets.
It’s vital that you Be aware that Moz Professional doesn’t confirm if URLs are indexed or found out by Google. On the other hand, considering that most sites use a similar robots.txt principles to Moz’s bots because they do to Google’s, this method typically performs very well for a proxy for Googlebot’s discoverability.
Google Research Console
Google Lookup Console provides quite a few important sources for building your list of URLs.
One-way links studies:
Just like Moz Professional, the Backlinks part provides exportable lists of goal URLs. Sad to say, these exports are capped at 1,000 URLs Just about every. You can utilize filters for unique web pages, but considering the fact that filters don’t implement for the export, you could need to rely upon browser scraping resources—limited to five hundred filtered URLs at any given time. Not perfect.
General performance → Search engine results:
This export offers you a summary of webpages getting lookup impressions. Though the export is proscribed, you can use Google Look for Console API for much larger datasets. There are also no cost Google Sheets plugins that simplify pulling additional substantial details.
Indexing → Pages report:
This area presents exports filtered by problem sort, nevertheless they're also restricted in scope.
Google Analytics
Google Analytics
The Engagement → Webpages and Screens default report in GA4 is an excellent source for collecting URLs, with a generous limit of a hundred,000 URLs.
Better still, you are able to implement filters to produce distinctive URL lists, efficiently surpassing the 100k limit. For example, if you'd like to export only web site URLs, follow these actions:
Stage one: Increase a section on the report
Phase two: Click on “Create a new phase.”
Stage three: Define the section using a narrower URL sample, including URLs made up of /website/
Take note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they offer valuable insights.
Server log information
Server or CDN log information are Most likely the final word Resource at your disposal. These logs capture an exhaustive list of each URL path queried by customers, Googlebot, or other bots through the recorded period.
Considerations:
Facts measurement: Log data files might be huge, a great number of web sites only retain the last two weeks of information.
Complexity: Analyzing log information might be complicated, but various tools are available to simplify the procedure.
Blend, and very good luck
As soon as you’ve gathered URLs from all these sources, it’s time to mix them. If your site is small enough, use Excel or, for bigger datasets, instruments like Google Sheets or Jupyter Notebook. Make sure all URLs are continuously formatted, then deduplicate the list.
And voilà—you now have a comprehensive listing of present, old, and archived URLs. Excellent luck!