Over the past year, more and more webmasters have been posting on Twitter and on Reddit that they’ve been having issues in having large portions of the websites they work on getting indexed.
Anecdotally, since November 2021, I’ve also seen more and more indexing fluctuations across websites of all sizes – with the greater impacts being witnessed on websites 100k, and 100-million URL plus.
Search engines at their core are a business, a profit-driven organization with multiple shareholders and invested stakeholders who ultimately want to see financial gain.
Yes, there are social entity search engines, e.g., Ecosia, who give a percentage of their profits to planting trees, which by their own publicly available financial information is around 45-50% of revenue generated each month. But Ecosia has an agreement with Microsoft, which Microsoft makes a profit from, and does a lot of the heavy lifting in terms of producing a working and “live” index to present on a search results page (SERP).
The likes of (Microsoft) Bing, Google, Yandex, and DuckDuckGo are designed to make a profit – and the “organic” part of the search experience, the primary data source with which search engines populate the SERPs, is a loss leader, of sorts. The search engines make money through selling advertisements, additional services such as the relatively new local services ads (Google), and diversification into other product areas, like food delivery (Yandex Lavka).
Indexing, processing, and storing large quantities of data costs money and consumes energy, which in turn produces pollutants and demand on infrastructure. Google currently runs 23 data centers globally, which from my research I think puts it in the top 10 companies for the number of data centers owned and operated.
So indexing, and storing, the content produced by the world on the internet is/has been pivotal to the business model, and who then was able to categorize, sort, and understand the information to produce better results for users created opportunity and market share to then sell real-estate and opportunities to see, click, and sell.
The thing is, we’re hitting an indexing ceiling.
The internet advanced at a quick rate as it had all of human history to be written about and featured; now aside from some small niches and edge cases, we’re regurgitating the same content, the same “how to” guides, and the same products and information.
In fact, at SEO Day in March 2022, Gary Illyes highlighted a stat that about 60% of the internet is duplicate.
So with this in mind, why would Google (and other search engines) need to consistently crawl, parse, and index all the content on the internet (something that Google says is impossible to do anyway).
With the viewpoint that 60% of the internet is duplicated, why not conserve crawling efforts? And by that measure, reduced crawling means reduced load and resources required to constantly keep re-tokenizing and assessing existing content.
To an extent, Google already does this with the “quality threshold”.
URLs (content pages) that fall below the quality threshold, fall out of the index. They can be temporarily reprieved through an indexing request in Google Search Console, that has been confirmed to give it a short-term boost before it eventually falls back down.
We also see this in Google Search Console, where Google will override the canonical with the error message: Duplicate, Google chose different canonical than user.
Google is already making judgment calls on which URLs they’re including in the index based on what they perceive to be:
- Direct duplication of the page content
- Duplication of the value proposition
- Duplication of user satisfaction, e.g., the content serves the same purpose as other content on the website on a different URI when it comes to answering a user query
We’ve also seen this in the past, with websites losing indexing (and total ranking keywords) off the back of an update, confirmed or otherwise). More often than not, these have been keywords that the site didn’t deserve to rank for in the first place, and this is Google “streamlining” it’s index to provide only the most relevant and worthy results.
Outside of this, we can see elements of index pruning in the August 2022 Helpful Content update.
Like the majority of Google updates, they’re made to improve the search results for the user – and counter the tactics of SEOs. Penguin tackled link spam, Panda tackled the early days of content… Crap.
The Helpful Content update is needed because of AI content.
AI content is not bad, AI content is fine when used correctly and adds value. But a lot of AI content has been produced to scale content productions, the regurgitation of existing topics for those 500-word blog posts to fulfill SEO retainers.
This great scaling in content production means that more and more new URIs are being produced, adding to the load and calling for resources for Google, and other search engines to index. So how do you curb the increased demand for resources?
The helpful content update.
This update isn’t anything new. In Google’s own blog outlining the update, they reference the long-standing advice they’ve been given for years (2011 blog post referencing Panda), and references to content being helpful have existed the Quality Rater Guidelines for as long as EAT (which gets a lot more coverage).
Also in the QRGs, alongside EAT, are the notions of beneficial purpose and page quality. A couple of key takeaways from the current quality rater guidelines around beneficial purpose are:
Most pages are created to be helpful for people, thus having a beneficial purpose.
Some pages are created merely to make money, with little or no effort to help people. Some pages are even created to
Highest quality pages are created to serve a beneficial purpose and achieve their purpose very well. The distinction
between High and Highest is based on the quality and quantity of MC, as well as the level of reputation and E-A-T.
Beneficial purpose is also used to describe characteristics of high quality, medium quality, and low-quality pages.
The quality raters are also given strong advisement around beneficial purpose, and how important it is when identifying Page Quality, and assigning a score on the “needs met” scale:
If a page lacks a beneficial purpose, it should always be rated Lowest Page Quality regardless of the page’s Needs Met rating or how well-designed the page may be.
So with the Helpful Content update, raising the bar on what is helpful, has a strong beneficial purpose, from a source demonstrating expertise in the primary (and closely related) topics, and not just content not expressing expertise, opinion, or something “new”, the duplicated content (whilst it may be unique in terms of mechanics), it serves no purpose in the index as it adds no value.
Content that’s helpful also helps users accurately forecast their experience with your brand, product, or service.
User Experience Forecasting is something I’ve been advocating for the past couple of years and have seen success in taking this mindset and approach with client pages in terms of indexing, and ranking. You can read my original article on this from 2020 here, a deck I put together on this for a University talk here, and the episode of the Voices of Search Podcast covering the topic here.
But what would happen if search engines stopped crawling most of the internet? How would they discover new content?
We already have, in my opinion, the answers.
When Google rolls out a new update, it takes time to propagate across the globe and typically you see it affect queries (and SERPs) on queries that have high demand, e.g., high average monthly search volumes, because the more people that search a query, typically the more important it is to businesses – so more eyes, more potential clicks/sales, so more opportunity to sell adverts by keeping the SERPs as high quality as possible.
By maintaining crawls and reprocessing for “important” queries, the search engines can maintain the important SERPs.
So how will the search engines discover new content, if they’re not actively crawling the internet for it? How do companies get the new content on their websites discovered? We know the answer.
By having webmasters submit URLs via IndexNow, you remove the “crawl” process completely for a large number of websites and URLs greatly reducing the cost/resources required.
Bing and Yandex currently support this, and Google is testing IndexNow for sustainability.
So if this happens, what do we do?
The first thing we’ll need to address is a change in mindset, and this will mean some difficult conversations with clients, with the understanding that:
- 100% indexed is not likely a reality
- We can’t just submit and request every page not indexed for indexing, we need to be tactful about this and prioritize
- Content volume and quantity are subjective to what’s required on a query, and topic, basis
So in a practical sense, which pages should we care about most (outside of those that drive traffic, leads, and revenue)?
Brand and brand compounds
Pages that are important for queries researching your brand, navigating to pages for a purpose with your brand (e.g., dan taylor coupons, dan taylor discounts), and brand navigation (e.g., dan taylor terms and conditions).
You need to own this space for both new prospective customers, as well as retaining existing ones, as owning this space helps prevent misinformation from reaching your user base.
By extension, this also includes your brand entity hubs.
Smarter content production and search term targeting
This is also the opportunity for us to take our existing, standard SERP analysis, and go one further.
Right now a lot of tools and audits classify by result type using search intent (Commercial, Informational, Navigational, Transactional) and mistakenly assume that all informational and all commercial are the same.
When Google talks about quality thresholds, they also talk about “source types”, and different source types have different thresholds for the same search query.
The most common misinterpretation of this I come across is for a query with multiple common interpretations and a fractured intent, and the SERP is predominantly non-commercial entities and informational in nature, e.g. TechCrunch, Mashable, PCMag, Cosmopolitan… And a company thinks their blog can compete. For some queries, it can, but if Google is clearly not showing a commercial source-type publishing informational content for that query, it’s unlikely you’ll move onto page one unless Google determines that it needs to change the SERP (based on user data).
Google’s John Mueller has also publicly commented on the approach to ranking for queries with fractured intent/an ambiguous interpretation:
So I think, first of all, a query like “programming” is so ambiguous that there is no absolute right or wrong when it comes to ranking something there.
So that’s something where I would assume that the results that you see there are going to be kind of mixed and it’s going to be hard to just say, I’m going to create a piece of content on the topic of “programming” and Google will rank it number one.
This is consistent rhetoric in line with a lot of Google documents.
EAT is important, as is the topic scope
This is another element of search that has been hiding in plain sight (and talked about by Google) a number of times.
We’re all very familiar with the concepts of EAT (Expertise, Authority, Trust) and YMYL (Your Money, Your Life), and how this relates to beneficial purpose, and the quality threshold required for the source type.
This oftentimes gets oversimplified to be just “authorship”, when we know that the content author is not a direct ranking factor. But the author (depending on the page beneficial purpose) can impact things, depending on the author’s known level of qualification and authority in the field – not their reputation.
This I find interesting, given that Prize and Notability are two metrics used to rank entities, so in theory, a medical practitioner with a higher reputation and stature should outweigh a medical practitioner without the same reputation – and then the author of content piece A, the higher ranked entity, should have a higher EAT level and quality than content piece B, written by the lesser known entity… But we’re well and truly in the weeds here.
Either way, if you’re writing content with the goal of getting search traffic with authors who write about various topics on your website and no (or worse, a copy and paste) author biography, you’re not demonstrating why your content should be trusted.
Even if this is conjecture, and this article goes down as a 2,000-word conspiracy theory – all of the search engines are pointing towards the same goal, and that’s increasing the quality of search results pages for users.
If at the same time they’re able to reduce costs, and energy consumption (and then probably carbon emissions), that’s great.
If they’re not, everything they’re showing us in the updates, SERP changes, and what they publicly say all point towards the same goals – you need to offer unique perspectives, helpful, high-quality content in the field (or related field) that you’ve demonstrated expertise in for consideration to be indexed, processed, and ranked.