Backup Google Search Console Data via MongoDB (Python)

02/07/2023

ANNa-cropped

The dramatic expiry of Google Search Console data leads to the creeping decay of one’s own data sovereignty. Important data and even whole keyword sets can be secured by just a few lines. I use this short script among other things in the context of Bayesian optimization of meta descriptions through simulations. The variety & size of the backup can vary in a concrete customer case. I only show here in a few lines how semi-automated the basic metrics are loaded into a database.

Requirements for using the script

As of today (February 2023), the script works with the following packages/requirements:

  • Google API Python client installation (pip install –upgrade google-api-python-client).
  • pymongo for MongoDB connection (if transfer to database is desired)
  • Activation of the Google Search Console API via this link: https://console.cloud.google.com/
  • To use the Jupyter Notebook you can use e.g. the Anaconda Navigator (https://docs.anaconda.com/navigator/index.html).
  • Possible additional libraries may be necessary, in this case consider error messages & use documentation
  • And of course read access to the corresponding Google Search Console Property
  • The submission of the MongoDB database has already been done (https://www.mongodb.com/docs/v4.2/installation/)

The Script

TL;DR: https://github.com/marvinjoers/gsc-backup-by-time-range/blob/main/GSC_Backup_Data.ipynb

Import the following modules/libraries:

import os
import sys
from googleapiclient import sample_tools
import pymongo

Connect database

Set connection data for MongoDB. At this point, alternative database systems can also be used (if more trust is placed in MySQL database (https://dev.mysql.com/doc/connector-python/en/connector-python-installation-binary.html)).

clientDB = pymongo.MongoClient("mongodb://___")
storage = clientDB["___"]
collection = storage["___"]

Google Search Console Parameters

property_uri = '___'
start_date = 'YYYY-MM-DD'
end_date = 'YYYY-MM-DD' (same day)

# Set os.path.abspath (./) in a way, that client_secrets.json can be found
# https://console.cloud.google.com/ obtain file here for GSC API
service, flags = sample_tools.init([], 'searchconsole', 'v1',  __doc__, os.path.abspath('./'),scope='https://www.googleapis.com/auth/webmasters.readonly')

Request Data

The requests must be set. Here it depends on the individual case what is to be saved. In a simplified case we only want to save the aggregated clicks, impressions, CTR, position of the type WEB (Does not include Discover or Google News).

TOTAL_REQUEST = {
      'startDate': start_date,
      'endDate': end_date,
      'searchType': 'WEB',
     'rowLimit': 1
  }

The corresponding documentation (https://developers.google.com/webmaster-tools/v1/searchanalytics/query) provides information on the corresponding filter options.

Evaluate Response

With the following command:

TOTAL_RESPONSE = service.searchanalytics().query(siteUrl=property_uri, body=TOTAL_REQUEST).execute()

the required data is loaded into it. By a previously defined auxiliary function:

rows_TTL = getValues(TOTAL_RESPONSE['rows'])
def getValues(r):
    return [r[0]['clicks'], r[0]['impressions'], r[0]['ctr'], r[0]['position']]

we can store the respective values of the first and only produced data set via the dict keys in an array. It depends in which form the data should be stored in the database. Alternatively, the single dict can be conveniently loaded directly into MongoDB. Since every single Row corresponds to a Dict. Maybe the date has to be appended to the dict, because TOTAL_RESPONSE does not contain the start- or end-date. If broken down (filtered) data is to be stored, the requests of the GSC and MongoDB must be adapted.

Load Data

The easiest way to do this is to prepare your own dict to load the extracted data from rows_TTL into the collection with custom keys, for example. The absolutely most basic would be the following sample:

temp = {}
temp["date"] = start_date
temp["your_clicks"] = rows_TTL[0]
temp["your_impressions"] = rows_TTL[1]
temp["your_ctr"] = rows_TTL[2]
temp["your_pos"] = rows_TTL[3]
temp[...] = ...
collection.insert_one(temp)

Get Data by Time Range

Alternatively you can use a for-loop saving 16 months of GSC data instantly and subsequently export the most current date:

import pandas
from datetime import date, timedelta

sdate = date(2021,12,27) 
edate = date(2023,2,5)
all_dates = pandas.date_range(sdate,edate-timedelta(days=1),freq='d').strftime('%Y-%m-%d').tolist()

...

for d in all_dates: