SEO Keyword Clustering with Python

Gaining new insights for your SEO projects takes less than 50 lines of code with Python.

Today I will show you how to run your own keyword grouping tool in Python. This will work for thousands of keywords and give you interesting insights of main topics that were not visible before. I put the whole concept also in a free online keyword grouping tool on our website – feel free to take the shortcut. If you want to use Python for SEO and PPC for similar use cases keep on reading!

The basic concept of the clustering script looks like this:

  • Read the keyword list from a file (queries.csv):
    A good free keyword source to start with keyword clustering can be googles keyword planner or your own Queries found in Google Search Console. Of course you can also use third party SEO keyword tools. The bigger the keyword list is the better the results of your clustering.
  • Apply stemming to every word within the query:
    We make use of the Porter Stemmer that is available in the python NLTK module. It is language independent. You can also try the Snowball Stemmer that works language specific and might give better results. The whole stemming part is done to bring down words to their basic root form – this will help us to group those words together.
  • Use TfidfVectorizer to create a feature vector over all queries:
    Clustering algorithms work with numbers – for that reason we transform every keyword to a word vector. This vector contains every stemmed word that was found in the input keyword set and contains the TF-IDF weights.
  • Run a cluster algorithm on top of the query vectors:
    In this script we use sklearn to do the keyword clustering. There are several clustering algorithms you can use easily. Maybe you are surprised why I do not use the most common k-Means algorithm but DBSCAN instead: I do not want to bother you with the part of estimating a good “k” (number of clusters) with k-Means using approaches like the elbow method.
  • Look at the results in clustered_queries.csv:
    You will find the keyword clusters in this output file. Keywords that belong to the same group are concatenated together with a pipe delimiter. If you run the keyword clustering script for the first time with a new keyword set you might realize that in some areas the found clusters look not that good. Try to play around with the SENSITIVITY and MIN_CLUSTERSIZE parameters that are used for DBSCAN Clustering. This can improve your results.

This is what the complete script looks like:

# Free Keyword Clustering
import pandas as pd
import re
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
import nltk
from nltk.stem.snowball import SnowballStemmer
snow_stemmer = SnowballStemmer(language='english')
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
import csv
def stemmList(list):
    stemmed_list = []
    for l in list:
        words = l.split(" ")
        stem_words = []
        print(l)
        for word in words:
            x = snow_stemmer.stem(word)
            #x = porter_stemmer.stem(word)
            stem_words.append(x)
        key = " ".join(stem_words)
        print(key)
        stemmed_list.append(key)
    return stemmed_list
textlist = []
#queries.csv: your queries that should be clustered
df = pd.read_csv('queries.csv', delimiter=',')
textlist = df.iloc[:, 0].to_list()
labellist = textlist
textlist = stemmList(textlist)

#-------------------------------------
LANGUAGE = 'english' # used for snowball stemmer
SENSITIVITY = 0.2 # The Lower the more clusters
MIN_CLUSTERSIZE = 2
tfidf_vectorizer = TfidfVectorizer(max_df=0.2, max_features=10000,min_df=0.01, stop_words=LANGUAGE,use_idf=True, ngram_range=(1,2))
tfidf_matrix = tfidf_vectorizer.fit_transform(textlist)
ds = DBSCAN(eps=SENSITIVITY, min_samples=MIN_CLUSTERSIZE).fit(tfidf_matrix)
clusters = ds.labels_.tolist()

cluster_df = pd.DataFrame(clusters, columns=['Cluster'])
keywords_df =  pd.DataFrame(labellist, columns=['Keyword'])
result = pd.merge(cluster_df, keywords_df, left_index=True, right_index=True)
grouping = result.groupby(['Cluster'])['Keyword'].apply(' | '.join).reset_index()
grouping.to_csv("clustered_queries.csv",index=False)

This was just one example how a solution can look like. Maybe you realized that there are some parameters used in the vectorizer and also the DBSCAN clustering algorithm. This is the fine tuning part and might be different for your project. Of course you can also use the wellknown k-means algorithm for your clustering.

Join the conversation on LinkedIn

Python in PPC / SEO

More Similar Posts

Menu