How TF*IDF & LSI Optimisation Can Lead To Correlative Results

Caveat: Google does not use LSI (latent semantic indexing). If anything they use a form of PLSI (probabilistic LSI), but, it’s not something a tool can compute to the level they use it.

Along with a large number of helpful and useful articles, there are also an equal number of poor articles and dated advice emerging in conversations across Twitter, LinkedIn, and in client briefs, with core culprits emerging each time, these being TF*IDF and LSI.

The correlation v causation fallacy is something that in SEO we can’t get away from and is common in a lot of articles, ranking studies, and other guides.

This is, in my opinion, driven by three key drivers:

The need to show capabilities and ROI of SEO implementations performed by the consultant/agency
The need to show relevancy and innovation, and new thought processes performed by the consultant/agency
A lack of understanding from the consultant/agency in understanding the wider ecosystem and the field of SEO

The first two points are from a brand/commercial/ego perspective, whereas the third, in my opinion, is from a laziness perspective.

Ten years ago SEO (search engine optimization) was a very different practice and in many ways a lot more linear, with cause and effect a lot more… plausible. Whereas now Google and search engines have evolved to grow from basic sort engines using relative information retrieval practices to more sophisticated systems.

TF*IDF & LSI

As machine learning and content analysis reach new levels, as SEOs we have to find ways of adding process and scalability to our actions.

This has lead to TF*IDF and LSI emerging as front running trends.

TF*IDF = Term Frequency / Inverse Document Frequency
LSI = Latent Semantic Indexing

Both as principles are valid within information retrieval, however, their validity as an SEO practice, as an SEO tool or used for analysis is questionable.

For example, Google has written a number of whitepapers looking at the semantic topic modeling in modern search, not mentioning LSI.

CTR has also come under a lot of criticism, along with dwell time as being important ranking factors, with a number of SEOs coming down on either side of the fence.

Latent Semantic Indexing

LSI uses a technique called singular value decomposition to scan unstructured data within documents and identify relationships between the concepts contained therein.

In essence, it finds the hidden (latent) relationships between words (semantics) in order to improve information understanding (indexing).

It provided a significant step forward for the field of text comprehension as it accounted for the contextual nature of language.

TF*IDF

Term Frequency / Inverse Document Frequency is an old and basic content metric. Bill Slawski on Twitter, for me, summed up TFIDF as eloquently as it can be:

TF*IDF is a statistical method that can tell a search engine the importance of a word used on a page may be to the corpus that page exists in. It is not a semantic analysis tool nor SEO Tool. It is a tool that a search engine with access to its own corpus information can use.

TF*IDF for me is a current flavor-of-the-month and will pass in time, as yes Google may use it – John Mueller has even acknowledged its role in information retrieval, but just because Google uses it doesn’t mean that we can, in turn, replicate the same results simply because we don’t have access to the same corpus.

TF*IDF isn’t even the best information retrieval algorithms, others such as Okapi BM25 take into account more document factors. If anything TF*IDF evolved into BM25, and then further into what we can widely classify as “machine learning”.

Wikipedia has even run experiments using BM25 versus machine learning, key takeaways from the First assessment of learning-to-rank are:

We [Wikipedia Search Platform Team] found that users were slightly more likely to engage with MLR- provided results than with BM25 results (assessed via the clickthrough rate and a preference statistic).
We [Wikipedia Search Platform Team] also found that users with machine learning- ranked results were statistically significantly more likely to click on the first search result first than users with BM25-ranked results, which indicates that we are onto something.

Studies have shown TF*IDF to be a good term frequency metric, not a relevancy metric. Which is great in some niches where competition is relatively low, but also why it’s important to focus on context vectors and entities as well.

Mueller’s Advice On TF*IDF

John Mueller (during a webmaster’s hangout) referenced its use for weeding out stop words (i.e. words like and, the, and that). That seems a fitting use for older technologies such as this.

A basic algorithm like this could very well be limited to contributing to the simple task of identifying stop words.

So, don’t just focus on artificially adding keywords. Make sure that you’re doing something where all of the new algorithms will continue to look at your pages and say, well this is really awesome stuff. We should show it more visibly in the search results.

Correlations Are Not Always Bad

In my opinion, if you’re optimizing content in one way or another, whether it be blindly using TF*IDF or another tool’s metric, it can lead to positive results.

Naturally, if you’re investing in increasing content quality, words on the page, and number of topics, as well as feeding in keyword research and competitor research – you’re going to see some form of positive gain.

Post Views: 120