At Mocavo we take a lot of pride in our technology. Whether it’s a core product like Mocavo’s custom-built genealogy search engine, or experimental tools like our handwriting recognition technology, we have some incredibly talented engineers creating technological breakthroughs to bring family history resources online.
One particular challenge about searching over 275,000 datasets for billions of names is to ensure that results are lightening-quick. A fast site is good for many reasons; importantly, our users can quickly find new discoveries and search engines like Google can expose our free content to new people. In fact, over the past few years Google has incorporated site speed as a critical factor in its ranking algorithms; through meticulous testing, they found that faster sites = happier browsers. And they reward swift sites accordingly.
In the summer of 2013 our engineering team set out to re-architect the way we serve up our site, to make it lightning fast for users as well as search engines. This process of optimizing our site for search engines is known as SEO (Search Engine Optimization). I’d like to share the progress.
First, let’s put some context around the size of our collection: my colleague Derrick provided an excellent overview of how we manage our datacenter, which includes over a petabyte of storage. A petabyte is a unit of data storage, represented by a 16 digit number. That’s over 1000 times larger than the storage capacity of the average new PC sold today. For context, a petabyte worth of music would play continuously for over 2000 years; a petabyte of movies would fill over 223,000 DVDs.*
That’s a huge amount of data!
So to create a lightning fast site, we had to figure out a novel way to distribute this massive (and growing) dataset across the servers in our datacenter and retrieve results very quickly with our genealogy search engine.
Changing Search Retrieval Time
Last summer before we started this project, our average end-user load times were more than 4 seconds. Roughly a quarter of this load time was spent on retrieving search results.
After several weeks of testing out different methodologies, our engineering team created a system of custom caching servers that can store and index all of the content very quickly. That means that our site stores a ‘copy’ of every record on our servers and can retrieve this content without a lot of processing overhead. This new caching mechanism allows us to retrieve some records from our search engine in under 20 milliseconds and most book pages in under 75 milliseconds.
Additionally, we spent some time simplifying other client- and server-side code, further reducing page load time. Once all of these changes were deployed in October, our aggregate page load times dropped in half:
Other SEO Friendly Changes
Another important SEO consideration for us was to update and standardize our URL structure. During the previous two years of rapid growth, our URLs assumed various different styles, some legible to users, others not so much. And some of these versions were less-than-ideal in terms of search engine friendliness.
For example, for our popular Social Security Death Index collection, we had all of the following URL styles at the same time:
As a user browsing through a set of results on Google, which style most intuitively indicates what the page is about? Indeed, the third version quickly tells you the who, what and when about the page. Something like /ssdi/16889126271618839064 doesn’t communicate a whole lot of context.
So after careful considerations like this, we overhauled the entire URL structure of the site and then submitted new sitemaps to Google.
Google Crawl Rate
With the combination of a faster site and consolidated URL structure, Googlebot is now eating up our content as fast as it can. In August 2013, Google crawled as few as 75,000 pages per day as the site took over 1.5 seconds to deliver a single page. But after we rolled out and tweaked the custom caching solution, the time for Googlebot to download a page dropped to roughly 242 milliseconds.
As the page loading time decreased, Googlebot increased the number of pages per day that it crawled. Today they’re accessing about 2 million pages per day; that’s over 23 pages per second!
It took a few weeks for Google to digest the various changes, but we’re proud to report that the number of Mocavo pages indexed by Google has increased nearly 10-fold in a few months. Here is a great screenshot from Google Webmaster Tools showing the evolution of our site in the Google index:
A Big Slice of the Web
But just how big is that? According to estimates from http://worldwidewebsize.com/, there are somewhere between 20-50 Billion webpages online. That means Mocavo’s index represents somewhere between 0.11%-0.29% of the entire web. And it’s growing every day!
We’re quite proud of this investment in SEO as the growth in Google means our content is available to an even greater audience, all of it free forever.