Using Python & Sheets for Competitor Keyword Research

Update: While this approach is still useful, we've developed a topic clustering algorithm that enables us to know exactly which content to create, in what order, that consistently delivers insanely fast results. If anything here doesn't work (APIs change over time), reach out to me on LinkedIn.

There are two primary things we’re focused on when it comes to our SEO consultancy:

  • Delivering revenue for our clients by leveraging data
  • Leveraging technology to improve accuracy without diluting quality

Saving time, energy, and resources are absolutely critical for an SEO agency to stay profitable and scale.

Many agencies do this by delegating the work to junior SEO who lack experience or outsourcing it entirely to white-label vendors.

This inevitably leads to quality being diluted if not obliterated if checks and balances aren’t put in place. 

We take a different approach at Ardent Growth. Instead of scaling by delegation, we scale by intelligent automation. 

How we do this with keyword research is what I’ll be discussing in this article as well as how you can do it too. 

Let’s dive in. 

Overview

We’re going to do a deep dive into the standard we do keyword research (for existing content) in the Blueprint community. We’ll be discussing:

  • The theory behind the approach
  • An explanation of the process

Then we’ll explore ways we can get more out of it by:

  • Collecting more data
  • Building more accurate projections using a bit of very simple calculus
  • Automating it using a dash of Python and some basic Google Sheet formulas

Note: If you want to access to Ryan Stewart's templates from The Blueprint Training, you'll need to purchase them. If you don't have them, you can still use the method outlined below to get the same results — it'll just take a little more elbow grease.

The Keyword Research Process

The goal of this type of keyword research is to identify existing pages that have a good chance of ranking with far less investment than producing new content.

Existing pages can often be much easier to rank because they may already have some existing equity in the form of links and established topical authority. 

This approach is especially helpful when kicking off a new campaign with a brand new client because a few quick wins early on can really help you establish credibility with your clients right from the very beginning, which will have a direct impact on your retention rates. I'll be using a lawyer SEO campaign we helped out with as an example in this post.

Setting Up the Keyword Research File

The way we do set up our keyword research file is somewhat dependent upon pulling in data from an already completed website quality audit (a type of SEO audit we do for clients).

Pulling data from a the Blueprint WQA into the Blueprint Keyword Research

In our website quality audit, we identify which URLs will benefit the most from on-page updates — we tag these URLs with a URL action called “Update “On-Page.” These are the URLs we target and pull into our keyword research phase for further analysis.

From our website quality audit we pull in the following columns of data and drop them into our keyword research file:

  • URL Actions
  • URLs
  • Category
  • The URLs main keyword, its associated monthly search volume, and current ranking
  • The URLs best keyword,  its associated monthly search volume, and current ranking

Pulling in data from the Blueprint Training WQA to the Keyword Research for Existing Pages file
  • The number of organic sessions that URL has received over the past 12 months
  • The number of backlinks the URL has
Pulling in GA data from the WQA to the Keyword Research file

Adding Domain Rating URL Rating from Ahrefs

One of the final elements we add to the spreadsheet before we dive into the actual keyword research process is the Domain Rating and URL Rating (Ahrefs metrics) of each URL.

Note: The standard Blueprint process actually pulls in Moz’s DA/PA instead of Ahrefs’ DR/UR but either is acceptable so long as you stay consistent.

Determining the Primary Keyword & Secondary Keywords the Page Should be Ranking For

There are three primary approaches we take to determine the primary keyword a page should be ranking for:

  1. Examining the pages existing keywords
  2. Manual inspection and keywords explorer
  3. Competitor analysis

I won’t be going in-depth on how to determine the primary keyword and secondary keywords a page should be targeting in this article. That’s something I fully expect most people who read this to have already figured out. 

If you want a quick way to get a primary keyword, just default to the parent keyword in returned in ahrefs' keyword research tool so long the intent you see on the SERPs aligns with your brand or content piece.

Determining Searcher’s Intent

After determining our primary and secondary keywords, we then determine what the searcher’s intent will be for those queries.

Search Intent Mapped to Marketing Funnel Stages

Getting the searcher’s intent right is critical. If you fail at this step and get it wrong, you’re going to have a difficult (if not impossible) time ranking on the first page for your target keywords.

You can learn more about how to determine search intent and user journey here.

Collecting Competitor Data from the SERPs

This is where things begin to get laborious and also where most members contract a virtual assistant who can manage data entry — because that’s exactly what this next step is: pure data entry.

Collecting competitor data from the SERPs for SEO

In the standard approach, we scrape the top 3 results for our primary keywords and collect data points on them to add to our spreadsheet. These include:

  • The content type of the URL (e.g. resource guide, blog post, service page, landing page, etc.)
  • The number of dofollow referring domains pointing at the URL (not domain)
  • An estimate of the page’s link velocity over the past 12 months
  • The PA and DA (we use UR and DR) of the URL

Comparing Our Stats to the Competition on the SERPs

After finally collecting all of the data for the top 3 ranking pages for each of the primary keywords we want to target, we then crosswalk that against our own data and use several custom formulas to help see how we stack up.

Comparing SEO competitor data to see how we stack up

The cross-walked data for each query, across all three results, consists of:

  • The average RD 
  • The average DR
  • The average UR
  • The average monthly link velocity
  • A DR opportunity score
  • A UR opportunity score
  • A link opportunity score
  • A custom aggregate opportunity score
  • A custom priority score
  • The average content type
  • A content analysis formula that lets us know if our content type is fitting
  • A target pages assessment, which tells us whether or not we should focus on that URL at the beginning of a campaign

The end result of this report is a clear picture of what pages and in what order you should be tackling first to get fast results for your client’s campaign.

Instead of basing everything on gut instinct, you will have something backed up with data. This systematized approach is how you can effectively manage keyword research at scale across a variety of clients.

Adapting Blueprint Keyword Research

Ryan and David did a great job developing this process and they both continue to iterate on things as time goes on. 

That being said, one of the beautiful things about the Blueprint Training is that we are encouraged to adapt and modify the sheets to make them fit our own preferences and needs.

The base approach is solid and has resulted in several wins for my own clients (we grew the organic traffic one health and wellness over 518% in a 3 month period using the standard approach) — however, there were a few things I thought we could iterate on to make them even better.

These include:

  • Calculating a more accurate link velocity by including more data points and some basic calculated fields
  • Building more accurate link budget projections using a bit of basic math
  • Calculating a value ratio to improve the accuracy of our priority score
  • Collecting more data from the SERPs to improve the accuracy of our opportunity score
  • Speeding up the data collection process using a bit of Python and simple Google Sheet formulas

True Velocity & Estimated Catch Up Rate

In the original iteration of Ryan’s Keyword Research template, there’s a single column for link velocity.

The way this is calculated involves looking at the Domain Overview report in Ahrefs, looking at the Backlinks profile graph, and estimating the average velocity at which the domain is earning unique referring domains over the past 12 months.

Calculating link velocity for SEO using Ahrefs

True Link Velocity

Looking for ways to improve this process was one of the first things we decided to change as we ran more and more keyword research deliverables for clients.

Rather than estimating the link velocity, we figured it would be a simple fix to just accurately calculate it by adding two additional columns to the template:

  • One for the number of unique referring domains 12 months ago
  • One for the number of unique referring domains as of now
Calculating true link velocity for SEO using a basic formula

Adding these two fields meant we didn’t have to rely on the arithmetic abilities of a virtual assistant who happened to be collecting the data for us. They simply had to stick to pure data entry. 

Note: You can now export the Referring Domains graph from Ahrefs overview report now. If you're doing this for a large set of keywords you can have a VA these for you to cut down on costs. Just make sure you have a good quality control process in place for data integrity.

We also changed the RD Velocity field into a calculated field to make things easier on us as well. 

That formula is structured as follows:

=arrayformula(if(X3:X <> "", ((X3:X-W3:W)/12),""))

The data in X3:X is simply checking to ensure data exists in the RD End column — if it doesn’t, the formula won’t run (saves on computational resources).

The second half of the formula is just taking the difference between RD End and RD Start and dividing it by 12 to calculate the monthly rate at which the URL has been acquiring unique referring domains over the past year.

Estimated Catch Up Rate

One thing that we always had a difficult time estimating is what the client needed to budget for link building in order to be effective. 

Determining this with absolute accuracy is practically impossible — so instead we decided to take an approach that would at least get us closer to reality than what we were currently doing (which was basically pissing in the dark). 

Calculating the monthly link velocity needed to catch up to the competition in SEO

We added a calculated field to estimate the monthly velocity at which we would have to acquire referring domains to catch up at the individual URL level.

=iferror(arrayformula(if(A4:A <> "",(AR4:AR-H4:H)/(BH1),"")),0)

We calculated this by taking the difference between the average referring domains (Avg RD — Column AR) column and the column that recorded the number of referring domains for our client’s URL (Column H) and then dividing that by a variable field we called Catchup by Month.

Using a dynamic variable to calculate monthly link velocity for SEO

This allowed us to change the model dynamically and see the various budget ranges we could be working in — thus helping us find a good fit that would still deliver the client results while also staying within their budget.

We generated those ranges by adding two more columns to calculate a floor and ceiling for budgets (as represented below).

Calculating a floor and ceiling for link budgets

For folks who aren’t familiar with the symbols: 

  • ⌊x⌋ → Floor
  • ⌈x⌉ → Ceiling

These are basically just the upper and lower bounds of our budget estimates. We calculated these by taking the estimated link velocity required and multiplying them by variables that represent the low end of what we might expect to have to spend in terms of resources on a link vs the high end. 

Again, these are just estimates and as such shouldn’t be treated as definitive representations of what you can expect a campaign to cost.

Some Flaws with this Approach

Content Quality vs Link Costs

It would be irresponsible of me to not point out a few key flaws with this method of determining an estimated link budget. 

The first is that it does not take into account indeterminate variables like the quality of the content you’re building links to. The higher your content quality, the more likely you are to land a greater number of referring domains without having to spend as much on “link acquisition.” 

That being said, nature abhors a vacuum, and everything balances out one way or another. Any budget you save on links is almost invariably offset by the budget you spend improving the content quality.

However, in my experience, it’s far cheaper, in the long run, to spend more on high-quality content and develop links passively over time than it is to try to shortcut things with a dedicated link budget long-term.

Quality & Relevancy of Referring Domains

This model does not account for both referring domain quality in two ways. First, when collecting the number of referring domains for each competitor on the SERPs it does not filter out low-quality RDs.

This invariably pollutes the data to some extent and could be remedied a bit by setting a quality filter before your account for RDs (more about that at the end). 

Secondly, it does not take into account the average quality nor relevancy of the referring domains that the competitors are getting vs what you may be able to obtain. 

It is our position that domain quality can and should be measured logarithmically. We can simplify this by using Ahrefs Domain Rating metric as a guideline. 

As such, if the average DR of your competitors referring domains is 25 and your average is 50 — it’s reasonable to predict that you will need fewer referring domains to catch up with your competition than the model forecasts. 

The same applies if the topical relevance of the competitor's referring domains is a far stretch from their primary topic and yours are not.

Isolating Links as a Ranking Factor is Not Reliable

If you think links are the final factor to hedge all of your SEO campaign efforts on then you’re simply mistaken. 

Are they important? Absolutely. 

However, don’t ignore the power of good on-page SEO and UX as well. Our core belief is that links and good on-page help you get to the first page and superb UX helps you climb your way to #1 and maintain that position.

Improving Opportunity Score

What’s the easier way to increase the accuracy of our opportunity score? 

Simple.

Add more data points.

If you recall, the opportunity score in this model is based on averages from the top 3 results as they relate to our client’s site.

To improve our score we decided to collect the SERP data (Content-Type, RDs, DR, UR, and Link Velocity) of the top 10 results. This more than triples the number of data points we have to work with when calculating our averages.

I’m betting you’re already seeing the problem with doing this though, right?

We’re trying to speed up and scale the rate at which we can collect our data and conduct our analysis so we can actually get to improving things sooner — how are we going to do that if we’re more than tripling the amount of data we have to collect?

Can we just hire more virtual assistants to collect the data for us? Sure. But the more data that needs to be collected increases the odds of mistakes being made during entry. 

The same applies when you add more VAs into the mix. I’m convinced adding more people to a problem like this only increases the margin of error.

So what are we to do?

Let’s try automating things...even if just a tiny bit. 

A Dash of Python Sheets Wizardry

Okay, so here’s where things are going to get a little more complex and perhaps even a bit foolish.

My approach here is still a work in progress so please, by all means — feel free to contribute!

The Core Problem is Still Data Collection

When we examine how we collect the data, the primary thing that consumes the most time is pulling the organic results from the SERPs for each keyword and pairing that up with data from Ahrefs.

Now, there are several solutions to this problem, and the easiest that we first went with is using the Ahrefs toolbar to quickly export data from the search results into an Excel file that we could use to populate our competitor data columns.

For a bit, we even built out dynamic Google search query links in the sheet that we could drop into a tool like URL Opener and open up all the SERPs at once (and then export the data from the Ahrefs toolbar).

This is still probably the simplest option if you want to go that route. The one thing that will still slow you down is if you’re doing local SEO (which is what we do a lot of) and need to change your geo-location using a Chrome extension like GS Location Changer.

Note: If you're not doing Local SEO you can call Ahrefs API to get these data points directly in Google Sheets. It doesn't work for Local SEO because Ahrefs does not have the means to get geo-specific SERP data via their API nor do they have the means to identify if a local pack appears in the SERPs (which is the quickest indicator of local search intent). The Semrush API can do this if you want to use that though.

We like to use code whenever we can though — if for no other reason to justify our western KY college educations — so that’s why we decided to see what we could figure out using Python and the Query function in Google Sheets.

Fetching Results with SERP API

Create a Filter and Export Your Keywords as a CSV File

The first thing we do is create a new worksheet in our Keyword Research workbook that we call keyword_csv (you can name it whatever you want though). In this worksheet, I collect the keywords from the main Keywords worksheet by using a filter that excludes empty results.

=FILTER(Keywords_BETA!O2:O, Keywords_BETA!O2:O <> "")

A column of data ready for export

With that, I then export it as a CSV and upload it to Github so I can quickly import the data using Python later.

Note: There are definitely more efficient ways to do this step besides exporting it as a CSV to Github. You could import the data directly from the sheet using Python as well — so pardon the hasty implementation. We'll improve on this over time.

Once you have it loaded up on Github (or however you decide to approach it) view the CSV as a raw file on Github and keep that tab up because you’ll need that URL shortly.

Fire Up Colab and Write Your Python Program

The code you’ll need is below, so copy and paste it if you want. However, if you keep reading I’ll explain how it works a little bit as well.

Okay, so here’s a general overview of how this script works:

Imports a set of keywords from a CSV file (from Github), adds those keywords to a queue, then loops through the keyword queue, using the SERP API (you'll need an API key for this to work), to get the top 10 organic results from Google based on the provided geolocation, and finally loops through the 10 organic results for each keyword to extract the corresponding link and print as output.

Don't forget, you'll need to upload your keyword data to Github as a CSV so you can set the url variable on line 5. We suggest using the URL for the raw CSV data from Github here.

How to Make this Function Better:
This could have been made much simpler using a data frame object and using that to output to a CSV. I have a method for this in beta. I'll update this post in Q1'21 when I have it cleaned up and finalized.

Using Ahrefs Batch Analysis: So Much Cooler than the Toolbar

When you run this program it will give you an output of the organic results for each query you passed to the API.

We created a sheet called keywords_serp_data that we copy and paste these URLs into for safekeeping, reference, and future use.

We then take the URLs and run them through Ahrefs Batch Analysis (up to 200 as that’s the max Ahrefs Batch Analysis can run on at once). 

Be sure that you set the Protocol to “As Specified” and Target Mode to “URL.” Setting Protocol to “As Specified” keeps the http/s protocol the same as your input (otherwise it will strip it) and setting Target mode to URL ensures you get URL Rating data in your results as well.

Ahrefs batch analysis for quick data collection during SERP analysis

Once you’ve run your Batch Analysis go ahead and export it and import that data into another new worksheet in your Keywords worksheet (we named ours ahrefs_batch_analysis because we’re creative).

Automating SEO with Python and Google Sheets

Once you have your batch analysis data in your worksheet you’re ready to begin manipulating things a bit using some Google Sheets formulas. If you're new to Google Sheets and need a crash course, check out Ben Collins' website for some amazing Google Sheets tutorials.

Using Query, Transpose, Split, Rept, Trim and the Rest of the Kitchen Sink to Do Something Simple

Okay, so what we want to accomplish is a way to take the URLs for each keyword and match them up with the data from the Ahrefs batch analysis.

Note: There are undoubtedly easier ways to do this...you're encouraged to develop an easier method.

We created another sheet called keyword_serp_data and in it, added our set of keywords we collected SERP data for.

We add just the URLs that I ran through batch analysis into Column C starting at C1 (no header because reasons). 

Then we add the following formula into B1 and let the magic happen:

=arrayformula(trim(transpose(split(query(rept(E2:E23&"♠", 10), ,999^99), "♠"))))

What we’re doing here is creating keyword keys for each of our URLs so that we know what keyword it belongs to.

Using a Simple Transpose on a Query to Format Things Properly

The last thing we’ll do in this sheet (which you can already see in action in the GIF above) is add a query nested inside of a transpose function to get our layout the way we want it.

=transpose(query(B:C,"select C where B = '"&E2&"'"))

We did this by adding that formula to cell F2. The end result will look something like this:

SEO keyword research for this article

Wrapping it Up with a Few Simple VLOOKUPS

The last thing we’ll be adding to finalize our sheet are some relatively straightforward vlookup functions to grab the data we need to populate our competitor’s SERP data:

  • URL of the result
  • RD to result
  • UR of the result
  • DR of the result

You’ll need to adjust the columns you’re targeting as you move it between fields as well as the value you want to return from the keyword_sero_data worksheet. 

This formula collects the top organic ranking URL for each of the primary keywords:

=arrayformula(iferror(if(O3:O <> "", vlookup(O3:O,keyword_serp_data!E:O,2,false),""),""))

This collects the top organic ranking URLs referring domain count:

=arrayformula(iferror(if(O3:O <> "", VLOOKUP(T3:T,ahrefs_batch_analysis!B:I,8,false),""),""))

The rest of the formulas are being left as an exercise for the reader. In seriousness though, if you need help — just drop us a message via the contact form or email me and I’ll make time to make this all make sense for ya.

Next: Iterate & Improve

This method isn’t perfect and it may seem like a lot of work but it actually only takes about 5 minutes to do once you have everything set up.

Unfortunately, we have not yet figured out a way to feasibly automate the link velocity portion of this sheet. I’ve been reviewing the API documentation for SEMrush and playing around with it to see if I can get historical records of referring domains by date the way we can if we adjust it in the domain overview GUI but have yet to figure it out.

Ahrefs is of no help either. They do not currently have a way to extract that historical data from the Backlink profile overview graph in bulk as well. They did recently add the ability to export the overview graphs as CSVs though, so maybe they'll add a way to export these in bulk down the road...

Making the process with Python more seamless is a primary goal, reducing the manual actions a user has to take to get things set up, cleaning up the code overall, and adding some conditional statements to help categorize content types for you by identifying trends with REGEX.

I hope this was helpful to anyone who likes to dabble with a bit of programming and is always looking for ways to improve the efficiency of their current processes or at the very least, learn along the way.

If you liked this article and want to increase the likelihood that other people join me in solving some of these problems, be sure to share it. That’s probably the best way to get others involved.

Authors: