restlessdev blog

RSS

Building a Search Engine: Introduction

Google's First Production Server by Jurvetson, on FlickrSo this is my first blog post about my new project.

I’ve decided to chronicle this process from (nearly) the beginning, long before the outcome is known or is, really, anywhere in sight. I read an interview many years ago given by Joel Spolsky where the interviewer asked him if he had advice for anyone trying to replicate Fog Creek’s success. He said something to the effect of:

Sure. Start an blog 5 years ago and make it immensely popular. The rest is easy.

Sage advice. 

Background

My name is Mike, and I’m a programmer living in San Francisco. For the last 3 or so years I’ve been running a small dev shop called RestlessDev, freelancing and consulting with clients in the Bay Area. 

Freelancing is exciting work; you get to work with interesting companies doing interesting things, and I do enjoy the free time between gigs. The one down side to the lifestyle is that it is inherently unpredictable. Projects come and go regularly, and you need to save a lot of money to make sure you can ride out the valleys. In the past I’ve had some other passive income streams set up (primarily Facebook and Twitter apps) but changes in those platforms have made them much harder to maintain and grow without significant work than they used to be.

Criteria

While searching around for ideas on what sorts of things I could build next to supplement my income, I started to distill some criteria for the new project:

  1. It shouldn’t require much maintenance:  Time is very valuable, and the less I spend keeping it running, the better.
  2. It shouldn’t require network effects to be useful:  Too many product ideas are of the “this-will-be-great-when-we-have-10-million-users” variety. Getting that sort of usage is very difficult, and is usually only accomplished through excellent marketing and product design that emphasizes virality. I prefer building products that I want to use myself, and rarely have I ever wanted to use something meant to be viral.
  3. Monetization should be straightforward:  The goal of this project is to provide passive income. If it doesn’t have an obvious path to monetization, it wouldn’t really be worth making. 
  4. It should be interesting to build:  As someone who makes websites and apps for a living, I don’t come across many unknowns in my day-to-day work; most are permutations on things I have been doing for years, using well-known tools and frameworks. This project should take me out of the comfort zone and let me learn something new.

Why a Search Engine?

Lots of projects would meet these criteria while being a lot simpler to build than a search engine. Why make it hard on myself?

  • It’s pure: Very few products are as conceptually simple as a search engine. Take one or more input terms and produce a listing of results that give people the information they are searching for. Unlike many other products there isn’t a lot of ambiguity.
  • There is a good model for success: Google is a wonderful product, and if you’re building a search engine, you can always compare your results to theirs to see how well you are doing. Many other product categories involve searching around in the dark for something, hoping you will find something people need. With search engines you already know the target. You just need to do something better.
  • The incumbent has some holes:  Now, just because Google is a wonderful product doesn’t mean that they are the best at everything. They can’t go after every vertical, nor should they given the fact that they are a general purpose tool relied on by hundreds of millions or billions of people every day. Therein lies opportunity.
  • Solving the problems of a search engine gives you the tools to solve other problems too: Most of the other ideas I have involve massive amounts of data, web spiders, indexers, and things like that. Since I need to figure out all of those things anyway, I can reuse them again and again on other projects.
  • It emphasizes engineering over design: At least initially. This plays to my personal strengths.

So What is the Plan?

Building a search engine is a daunting task, made even more daunting by the fact that I am not using many off-the-shelf software components to do it. The search engine will run on a custom database (built to run on top of an existing distributed filesystem) and the web pages will be served up from a custom web framework. 

The rationale here is this: To make a search engine work, you need to download many, many terabytes of data from all over the internet, and once you choose a software platform to build on top of, it becomes incredibly difficult to move later. From what I’ve seen, each of the popular NoSQL databases work really well up until the point they don’t, and I don’t want to spend months building around the nuances of something I don’t understand intimately. By building my own database (using some existing components) I’ll know all of the tradeoffs I’ve made and can optimize around them. Assuming it works, (an admittedly big “if”) owning the database also becomes something of a competitive advantage against other small search engines built on standard platforms. My database will be optimized for the problem at hand and will be integrated very tightly with the web framework, leading to faster iteration and rapid feature development.

Of course, I know this won’t be easy. I’ve built a database before, as well as several web frameworks. I’m pretty confident that I can get it done, though.

The project has already started and will proceed over the following steps:

  1. Build the database. This is written in Java, and will consist of a variable number of identical nodes running in parallel, each having responsibility for writes to different segments of the data on a shared distributed filesystem. This is what I am making now.
  2. Test the database. Run it through its paces and develop use cases for it. Start up a few nodes and make sure they work together under load. Do some optimizing as needed.
  3. Build the spider. This is a small program that crawls around the web, grabbing content and sending it into the database for indexing.
  4. Test the spider.
  5. Build an indexer. This program takes the pages sent in by the spider and indexes them, extracting links to add to the queue, analyzing on-page relevance for search terms, checking links to other pages, etc. This is where most of the magic of the search engine happens, as much of the source data is generated at this stage.
  6. When the database, spider, and indexer seem to be ready, I will purchase some hardware to put them on and start slurping up content from around the web. 
  7. Build a throwaway version of the web interface in PHP that I can use to test the search engine while it does its thing. This won’t be styled in any way, and will just be a sanity check to make sure everything is going well.
  8. Build the web framework. This will also be in Java, built around an embedded web server, probably Jetty. Do this while the initial index is being built.
  9. Start building out the web site on the new framework. Include the monetization.
  10. When everything is ready to go, deploy the web framework to an Amazon EC2 instance and open it up to the world.

 That’s it: Conceptually simple but tactically difficult.

Documenting the Process

As I build out the search engine, I will write a number of blog entries about it. Some will be insightful, some will be nonsense. Some will be productive, and others false starts. 

I was recently talking to a friend of mine about this, an accomplished computer science type, and it was incredible how liberating it felt having someone to bounce my thoughts off of. Whenever you take on a challenge that pushes you it can be an isolating experience; you don’t know what you don’t know, and your head can go down many rabbit holes while you search for answers. 

My hope is that my making the process public the holes will be illuminated just a little bit more.

Thoughts on Hackathons

Dave Winer had a very insightful post the other day, Hackathons are nonsense. He started it off with:

Hackathons are how marketing guys wish software were made.

This really resonated with me; as a developer I’ve been to a few hackathons, and they always seemed to have been initiated by non-developers with often unrealistic assumptions about scope and what could be done in a few nights of intense work. All of the (often exaggerated) tales of developers having built (the kernel of) a real, highly popular website/app in a weekend of inspired work has caused certain marketing types to think that all that stands between them and social/mobile/local riches is finding that genius developer. Hackathons are merely the means to an end, a recruiting event masquerading as a social one.

There are also corporate sponsored hackathons that don’t fit this prototype, but those are mostly to try to drive interest and adoption of the sponsor’s products at the developer level rather than the top-down CIO/CTO route. I see those as marketing seminars rather than hackathons.

Hacker News commenters didn’t seem to agree with Dave, though, which surprised me. Perhaps I haven’t been to the right hackathons.

Creating a product is a process with several steps. Amongst these are:

  • Defining a problem
  • Describing a potential solution
  • Scoping the solution to fit it within constraints. (time, manpower, money, etc)
  • UI/UX
  • Visual design
  • Deciding on languages/frameworks or deciding to roll your own
  • Dividing development tasks
  • Environment setup (setting up the servers, etc)
  • Coding
  • Testing

Some of these lend themselves well to group participation with non-technical people, and some of them require intense focus by one person. Most importantly, the two creative items (visual design and coding) can take a long time of iteration by one person to finish. There is a reason why developers work best in offices with doors; during that phase the non-technical people can’t do much but post on Twitter and ask for status updates. Neither of which get the product done faster.

Maybe they do have value in terms of learning new technologies and building camaraderie with other developers. When I’m learning it is good to have someone to bounce things off of, and there is a palpable energy when you get a bunch of creative brains all humming in unison.

Are there any hackers-and-designers-only hackathons?

Why should I care?

I’ve started blogging again and it’s presenting quite a challenge. Every time I sit down to start writing, one question keeps popping into my head. A question I imagine being asked by some unseen reader on some unknown part of the internet, clicking around Tumblr in his boxers, who happened to find my post.

“Why should I care?”

It’s pretty incisive. Bloggers (specifically, tech bloggers) can be broken down into a few categories:

  1. Successful people who you know for some reason. Entrepreneurs who have exited, developers who work at famous companies or who created some technology with many users, VCs who sit on giant funds, Dallas Mavericks owners. etc.
  2. People who network a lot and report breaking news. Or people who review hot new gadgets.
  3. Smart people who think deeply about things which touch us all and can write up witty posts that tie it together in an eloquent and satisfying matter.
  4. People that fall into several of the above categories.
  5. Blabbering idiots with no unique insight, too much time on their hands, and an inflated ego.

Each group has a different response to the question I asked above.

  1. “I’ve been there before, and this is what I’ve learned. Maybe you’ll find it useful too.”
  2. “You know you want these gadgets and/or news items. This is porn for you, and it beats doing actual work.”
  3. “Read this and you will learn more about yourself and the world around you. And posting it to HN will get you 500 karma points!”
  4. “Read this and you will learn more about yourself and the world around you. And posting it to HN will get you 2500 karma points!”
  5. “Derpy derp derp.”

As I’m trying to answer that question, answer number 5 bounces around my head quite a bit. But I know that if I stay at it, discover my voice, and build my audience, I may be able to reach answer 3. 10,000 hours and all that.

So what is your category?