Python Script: SEO Content Analysis of your Competitor

Analyzing the content of your competitors can provide you with valuable insights about your own operations and goals. This basic Python script can give you information on n-Grams in seconds.

This Python Script is a very basic version of a content analysis of your competitor. The main idea is to get a quick overview of what the writing focus looks like. A very lean approach is to fetch all URLs in the sitemap, parse the URL slugs and run a n-Gram analysis on it. If you want to know more about n-Gram analysis, please also have a look at our Free N-Gram Tool. You can apply it not only for URLs but also keywords, titles, etc.

As a result you will get a list of used n-Grams in the URL slugs together with the amount of pages that used this n-Gram. This analysis will only take a few seconds even on large sitemaps and will run with less than 50 lines of code.

SEO Content Analysis with n-Grams

Additional approaches

If you want to get deeper insights I can recommend to go on with these approaches:

  • Fetch the content of each URL in the sitemap
  • Create n-Grams found in headlines
  • Create n-Grams found in content
  • Extract keywords with Textrank or Rake
  • Extract known entities for your SEO business

But let’s start simple and take a first look into the rabbit hole with this script. Based on your feedback I may add some more sophisticated approaches. Before you run the script you just need to enter the sitemap URL you want to analyze. After running the script, you will find your results in sitemap_ngrams.csv. Open it in Excel or Google Sheets and have fun with analyzing the data.

Bonus Tip:

If you run it twice – once for your competitor and once for your own sitemap – you can easily create a Gap-Analysis within Excel.

Here is the Python code:

# Pemavor.com Sitemap Content Analyzer
# Author: Stefan Neefischer

import advertools as adv
import pandas as pd


def sitemap_ngram_analyzer(site):

    sitemap = adv.sitemap_to_df(site)
    sitemap = sitemap.dropna(subset=["loc"]).reset_index(drop=True)

    # Some sitemaps keeps urls with "/" on the end, some is with no "/"
    # If there is "/" on the end, we take the second last column as slugs
    # Else, the last column is the slug column
    slugs = sitemap['loc'].dropna()[sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-2].str.replace('-', ' ')
    slugs2 = sitemap['loc'].dropna()[~sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-1].str.replace('-', ' ')

    # Merge two series
    slugs = list(slugs) + list(slugs2)

    # adv.word_frequency automatically removes the stop words
    word_counts_onegram = adv.word_frequency(slugs)
    word_counts_twogram = adv.word_frequency(slugs, phrase_len=2)

    output_csv = pd.concat([word_counts_onegram, word_counts_twogram], ignore_index=True)\
                    .rename({'abs_freq':'Count','word':'Ngram'}, axis=1)\
                    .sort_values('Count', ascending=False)

    #Save input csv with scores
    output_csv.to_csv('sitemap_ngrams.csv', index=False)
    print("csv file saved")


# Provide the Sitemap that should be analyzed
site = "https://searchengineland.com/sitemap_index.xml"
sitemap_ngram_analyzer(site)
#the results will be saved to sitemap_ngrams.csv file

Join the conversation on LinkedIn

More Similar Posts

Menu