Web Crawler

One of my most exciting projects of last year involved a makeshift web crawler built in Python. It was a great experience, but as I added more and more features that the original script wasn’t designed, it became apparent that it was due for a rebuild from the ground up.

Goals

  • A robust web-crawler
  • Easy to integrate with other existing scripts
  • Easy to reuse for specific projects

Tools Used

  • Node.js
  • NPM
  • ESLint
  • TypeScript

Choosing the Tools

This was admittedly a fairly easy decision. A lot of my existing web tools are built in Node, so the ease of integration was a huge filter. I also find that it is a lot easier to keep code clean and organized in JavaScript than it is in Python. I always try to learn as many new technologies as I can when building a new project, so I decided that TypeScript would help with keeping the code clean and easy to follow. I wanted to get some experience making stand-alone modules. The ease of just installing a package from NPM made it an appealing solution for my desire to be able to drop the crawler into more specialized projects.

The Process

My first step was figuring out how to write and compile TypeScript. That part was very quick and straightforward, so I’m not going to really detail it.

From my experience the first time around, I knew that the most difficult part was handling URLs. I didn’t want to rely on too many 3rd party dependencies, so I set out to make my own URL parser/identifier.

Once I had my parsing and deciding which URLs that I did and did not, I set out to write tests with all of the edge cases that I could think of. Something that I learned from my original crawler is that it can be very time consuming and frustrating to test the basic functionality during an actual crawl.

The next part was simply a matter of defining the entry point and figuring out how to maintain the page list. The first crawler I made worked off of a list, mainly because my URL processing wasn’t robust enough to avoid adding duplicate pages without it. This time I came equipped with a lot more knowledge and a sturdier foundation, so my initial approach was to find pages recursively. I’m not certain how well this will work on larger sites, so I might switch to a queue based approach.

I now had a working crawler, so it was just a matter of packaging it up for NPM. After some confusion stemming from the different types of JavaScript packages, I got everything figured out. Shortly after, I realized that I wanted to make it easy to crawl multiple sites at the same time, so I tweaked the package a bit from there.

Once I had launched on NPM with minimal documentation and a not-overly-fleshed-out web crawler package, I was absolutely amazed with the initial response. In the first week there was 679 downloads. I can only claim one of those as my own, so it is kind of crazy to me.

Next Steps

There are a few small issues regarding status codes that I want to get sorted out. I also want to write a proper test suite, and some better documentation.

https://www.npmjs.com/package/simple-node-site-crawler