Building a Search Engine: Introduction
So this is my first blog post about my new project.
I’ve decided to chronicle this process from (nearly) the beginning, long before the outcome is known or is, really, anywhere in sight. I read an interview many years ago given by Joel Spolsky where the interviewer asked him if he had advice for anyone trying to replicate Fog Creek’s success. He said something to the effect of:
Sure. Start an blog 5 years ago and make it immensely popular. The rest is easy.
Sage advice.
Background
My name is Mike, and I’m a programmer living in San Francisco. For the last 3 or so years I’ve been running a small dev shop called RestlessDev, freelancing and consulting with clients in the Bay Area.
Freelancing is exciting work; you get to work with interesting companies doing interesting things, and I do enjoy the free time between gigs. The one down side to the lifestyle is that it is inherently unpredictable. Projects come and go regularly, and you need to save a lot of money to make sure you can ride out the valleys. In the past I’ve had some other passive income streams set up (primarily Facebook and Twitter apps) but changes in those platforms have made them much harder to maintain and grow without significant work than they used to be.
Criteria
While searching around for ideas on what sorts of things I could build next to supplement my income, I started to distill some criteria for the new project:
- It shouldn’t require much maintenance: Time is very valuable, and the less I spend keeping it running, the better.
- It shouldn’t require network effects to be useful: Too many product ideas are of the “this-will-be-great-when-we-have-10-million-users” variety. Getting that sort of usage is very difficult, and is usually only accomplished through excellent marketing and product design that emphasizes virality. I prefer building products that I want to use myself, and rarely have I ever wanted to use something meant to be viral.
- Monetization should be straightforward: The goal of this project is to provide passive income. If it doesn’t have an obvious path to monetization, it wouldn’t really be worth making.
- It should be interesting to build: As someone who makes websites and apps for a living, I don’t come across many unknowns in my day-to-day work; most are permutations on things I have been doing for years, using well-known tools and frameworks. This project should take me out of the comfort zone and let me learn something new.
Why a Search Engine?
Lots of projects would meet these criteria while being a lot simpler to build than a search engine. Why make it hard on myself?
- It’s pure: Very few products are as conceptually simple as a search engine. Take one or more input terms and produce a listing of results that give people the information they are searching for. Unlike many other products there isn’t a lot of ambiguity.
- There is a good model for success: Google is a wonderful product, and if you’re building a search engine, you can always compare your results to theirs to see how well you are doing. Many other product categories involve searching around in the dark for something, hoping you will find something people need. With search engines you already know the target. You just need to do something better.
- The incumbent has some holes: Now, just because Google is a wonderful product doesn’t mean that they are the best at everything. They can’t go after every vertical, nor should they given the fact that they are a general purpose tool relied on by hundreds of millions or billions of people every day. Therein lies opportunity.
- Solving the problems of a search engine gives you the tools to solve other problems too: Most of the other ideas I have involve massive amounts of data, web spiders, indexers, and things like that. Since I need to figure out all of those things anyway, I can reuse them again and again on other projects.
- It emphasizes engineering over design: At least initially. This plays to my personal strengths.
So What is the Plan?
Building a search engine is a daunting task, made even more daunting by the fact that I am not using many off-the-shelf software components to do it. The search engine will run on a custom database (built to run on top of an existing distributed filesystem) and the web pages will be served up from a custom web framework.
The rationale here is this: To make a search engine work, you need to download many, many terabytes of data from all over the internet, and once you choose a software platform to build on top of, it becomes incredibly difficult to move later. From what I’ve seen, each of the popular NoSQL databases work really well up until the point they don’t, and I don’t want to spend months building around the nuances of something I don’t understand intimately. By building my own database (using some existing components) I’ll know all of the tradeoffs I’ve made and can optimize around them. Assuming it works, (an admittedly big “if”) owning the database also becomes something of a competitive advantage against other small search engines built on standard platforms. My database will be optimized for the problem at hand and will be integrated very tightly with the web framework, leading to faster iteration and rapid feature development.
Of course, I know this won’t be easy. I’ve built a database before, as well as several web frameworks. I’m pretty confident that I can get it done, though.
The project has already started and will proceed over the following steps:
- Build the database. This is written in Java, and will consist of a variable number of identical nodes running in parallel, each having responsibility for writes to different segments of the data on a shared distributed filesystem. This is what I am making now.
- Test the database. Run it through its paces and develop use cases for it. Start up a few nodes and make sure they work together under load. Do some optimizing as needed.
- Build the spider. This is a small program that crawls around the web, grabbing content and sending it into the database for indexing.
- Test the spider.
- Build an indexer. This program takes the pages sent in by the spider and indexes them, extracting links to add to the queue, analyzing on-page relevance for search terms, checking links to other pages, etc. This is where most of the magic of the search engine happens, as much of the source data is generated at this stage.
- When the database, spider, and indexer seem to be ready, I will purchase some hardware to put them on and start slurping up content from around the web.
- Build a throwaway version of the web interface in PHP that I can use to test the search engine while it does its thing. This won’t be styled in any way, and will just be a sanity check to make sure everything is going well.
- Build the web framework. This will also be in Java, built around an embedded web server, probably Jetty. Do this while the initial index is being built.
- Start building out the web site on the new framework. Include the monetization.
- When everything is ready to go, deploy the web framework to an Amazon EC2 instance and open it up to the world.
That’s it: Conceptually simple but tactically difficult.
Documenting the Process
As I build out the search engine, I will write a number of blog entries about it. Some will be insightful, some will be nonsense. Some will be productive, and others false starts.
I was recently talking to a friend of mine about this, an accomplished computer science type, and it was incredible how liberating it felt having someone to bounce my thoughts off of. Whenever you take on a challenge that pushes you it can be an isolating experience; you don’t know what you don’t know, and your head can go down many rabbit holes while you search for answers.
My hope is that my making the process public the holes will be illuminated just a little bit more.
