SEO keyword clustering with Python

Gaining new insights for your SEO projects takes less than 50 lines of code with Python.

Today, I’ll show you how to run your own keyword grouping tool in Python. This will work for thousands of keywords and give you interesting insights of main topics that weren’t visible before. I put the whole concept also in a free online keyword grouping tool on our website – feel free to take the shortcut. If you want to use Python for SEO and PPC for similar use cases keep on reading.

On this post
    SEO keyword clustering with Python

    The basic concept of the clustering script looks like this:

    • Read the keyword list from a file (queries.csv):
      A good free keyword source to start with keyword clustering can be Google Keyword Planner or your own Queries found in Google Search Console. Of course, you can also use third party SEO keyword tools. The bigger the keyword list is the better the results of your clustering.
    • Apply stemming to every word within the query:
      We make use of the Porter Stemmer that is available in the python NLTK module. It’s language independent. You can also try the Snowball Stemmer that works language specific and might give better results. The whole stemming part is done to bring down words to their basic root form – this will help us to group those words together.
    • Use TfidfVectorizer to create a feature vector over all queries:
      Clustering algorithms work with numbers – for that reason, we transform every keyword to a word vector. This vector contains every stemmed word that was found in the input keyword set and contains the TF-IDF weights.
    • Run a cluster algorithm on top of the query vectors:
      In this script, we use sklearn to do the keyword clustering. There are several clustering algorithms you can use easily. Maybe, you’re surprised why I don’t use the most common k-Means algorithm but DBSCAN instead: I don’t want to bother you with the part of estimating a good “k” (number of clusters) with k-Means using approaches like the elbow method.
    • Look at the results in clustered_queries.csv:
      You’ll find the keyword clusters in this output file. Keywords that belong to the same group are concatenated together with a pipe delimiter. If you run the keyword clustering script for the first time with a new keyword set you might realize that in some areas the found clusters look not that good. Try to play around with the SENSITIVITY and MIN_CLUSTERSIZE parameters that are used for DBSCAN Clustering. This can improve your results.

    This is what the complete script looks like:

    # Free Keyword Clustering
    import pandas as pd
    import re
    from collections import defaultdict
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.cluster import DBSCAN
    import nltk
    from nltk.stem.snowball import SnowballStemmer
    snow_stemmer = SnowballStemmer(language='english')
    from nltk.stem import PorterStemmer
    porter_stemmer = PorterStemmer()
    import csv
    def stemmList(list):
        stemmed_list = []
        for l in list:
            words = l.split(" ")
            stem_words = []
            print(l)
            for word in words:
                x = snow_stemmer.stem(word)
                #x = porter_stemmer.stem(word)
                stem_words.append(x)
            key = " ".join(stem_words)
            print(key)
            stemmed_list.append(key)
        return stemmed_list
    textlist = []
    #queries.csv: your queries that should be clustered
    df = pd.read_csv('queries.csv', delimiter=',')
    textlist = df.iloc[:, 0].to_list()
    labellist = textlist
    textlist = stemmList(textlist)
    
    #-------------------------------------
    LANGUAGE = 'english' # used for snowball stemmer
    SENSITIVITY = 0.2 # The Lower the more clusters
    MIN_CLUSTERSIZE = 2
    tfidf_vectorizer = TfidfVectorizer(max_df=0.2, max_features=10000,min_df=0.01, stop_words=LANGUAGE,use_idf=True, ngram_range=(1,2))
    tfidf_matrix = tfidf_vectorizer.fit_transform(textlist)
    ds = DBSCAN(eps=SENSITIVITY, min_samples=MIN_CLUSTERSIZE).fit(tfidf_matrix)
    clusters = ds.labels_.tolist()
    
    cluster_df = pd.DataFrame(clusters, columns=['Cluster'])
    keywords_df =  pd.DataFrame(labellist, columns=['Keyword'])
    result = pd.merge(cluster_df, keywords_df, left_index=True, right_index=True)
    grouping = result.groupby(['Cluster'])['Keyword'].apply(' | '.join).reset_index()
    grouping.to_csv("clustered_queries.csv",index=False)
    

    This was just one example how a solution can look like. Maybe, you realized that there are some parameters used in the vectorizer and also the DBSCAN clustering algorithm. This is the fine tuning part and might be different for your project. Of course, you can also use the well-known k-means algorithm for your clustering.

    Next Steps: Useful keyword sources to run your clustering script

    Of course, you need some good keyword sources to run the cluster logic on. The autosuggest of search engines is a great way of extracting relevant keywords. We shared a python solution doing this with the google autosuggest. If you like you can add this code to get keyword extracting and clustering within the same script run. If you want to start right away with keyword data using the autocomplete, have a look at our free online version of autosuggest keyword extraction.

    Do you need a custom solution with Python?

    One-size-fits-all solutions can’t quite meet everybody’s unique needs. We know that. It’s time to explore the endless possibilities for you. We can provide a custom Python solution. Right now, contact us.

    More Similar Posts

    Menu