Multi-threaded web crawler in Ruby

Multi-threaded web crawler in Ruby

Netpistols
6 minutes read

Ruby has built-in support for threads yet it’s barely used, even in situations where it could be very handy, such as crawling the web. While it can be a pretty slow process, the majority of the time is spent on waiting for IO data from the remote server. This is the perfect case to use threads.

Theory

Multi-Thread vs Multi-process

For those of you who have never worked with threads it might be confusing: what is the difference between a thread and process? An application with multiple threads is in fact a single process, even though at first glance it might look like more than one thing happening at the same time. In truth the application switches between threads continuously so that each and every one of them has a piece of the total process time for itself. This is also what happens when you run multiple applications on a single-core processor – the operating system takes care of switching between dozens of running processes depending on hard drive, network and CPU usage.

Threads within a single process share memory so that they can use the same variables and resources. This is important as it makes creating multi-threaded applications much easier. On the other hand, a multi-process application is capable of using more than one processor core. Yet since it does not have a common memory, it’s difficult to share data and consequently its much more difficult to write such code.

More information (as always) can be found on Wikipedia:

Thread (computing)

Ruby thread support

We will use three major components from Ruby’s thread library:

• Thread

This is the core class that allows us to run multiple parts of code simultaneously.

• Queue

Queue is an Array-like class that will let us schedule jobs to be run by multiple threads. It supports both blocking & non-blocking modes and works in FIFO (first in, first out) mode. This means that the first job added to queue is also the first one processed.

• Mutex

Mutex is responsible for synchronizing access to shared resources. It ensures that the code run within the block will always be run completely, without switching to another thread in the meantime.

The Code

The idea is to write a small application that will crawl external sites and fetch some basic information about a few US TV series. It will utilize Ruby on Rails’ ActiveRecord library to access the database, though other than that, the rest is pure Ruby.

Parts of the code have been skipped in this article as it’s supposed to focus mostly on thread-related topics. The full source code can be found on GitHub @ https://github.com/kdurski/crawler.

The application is divided into three major components:

• Module

Modules will be the core components. When running the application we will supply a list of modules it’s supposed to run. This way our crawler can be scheduled to perform different tasks at different times (for example using a crontab). One of the module’s tasks is to create multiple threads and tell the crawlers what to do.

• Crawler

Crawler classes will be used to visit remote websites and fetch data. A single crawler can be used within more then one module.

• Model

Models, just like in Ruby on Rails, are responsible for storing and retrieving data to and from the database.

Crawler module

The most important part here is the Crawler module. It’s responsible for setting up the environment and connecting to the database.

Pseudo code:

module Crawlerautoload :Main, 'crawler/main'autoload :Threads, 'crawler/threads'autoload :ModuleBase, 'crawler/module_base'autoload :CrawlerBase, 'crawler/crawler_base'
  
  # @return [void]def self.setup_env; end

  # @return [Mutex]def self.mutex@mutex ||= ::Mutex.new
  end
end
 
Crawler.setup_env

We can see a few autoload calls here for major components that are inside the lib/ directory. Below that there is the setup_env method that will set up the environment for our application – that is, establish a connection with the database, add app/ directories to the $LOAD_PATH variable and finally include all of the files under app/ directory.

Below that is the mutex method and this is where it starts to get slightly interesting. We create a new instance of it and store it inside the @mutex variable. Whenever we need a mutex inside our app, we can access it by Crawler.mutex.

Crawler::Threads class

This is the code supporting the core feature of the application so it’s necessary to discuss it in detail.

require 'thread'

module Crawlerclass Threadsattr_reader :size, :queue, :threads

    def initialize(size = nil)
      @size = size || ENV['THREADS'].to_i
      @size = 1 if @size <= 0

      @threads = []
      @queue = Queue.newend

    def add(*args, &block)
      queue << [block, args]
    end

    def start
      set_up_thread_pool
      threads.each(&:join)
    end

    private

    def set_up_thread_pool1.upto(size).each do |i|
        @threads << spawn_thread(i)
      endend

    def spawn_thread(i)
      Thread.new dountil @queue.empty? do
          block, args = @queue.pop(true) rescue nil
          block.call(*args) if block
        end
      end
    end
  end
end

As you can see it’s pretty small – that’s because it’s pretty easy to use threads in Ruby as long as you know a few concepts.

Firstly, we initialize a few variables:

@size will tell us how many threads to spawn – it can be supplied manually and defaults to the environment variable THREADS (or 1);

@threads is an empty array in which we will keep track of our threads;

@queue will store our jobs that need to be done.

Each job has to be added to the queue by calling the #add method. It accepts optional arguments and a block. Those blocks are added to the @queue and will be run in the same order (remember the FIFO acronym?). I assume you are familiar with the blocks concept in Ruby, but if are not you might want to Google “blocks in Ruby” and read one of many informative articles.

Once we have added all the necessary jobs we can call the #start method. It will initialize threads and call #join on each of them. Doing this is pretty important – it will tell our application to wait for all threads before continuing itself. Without it the threads would run in the background, but once the main thread is done with its own job it would kill the entire application.

The thread’s job is pretty simple – as seen inside #spawn_thread, it will fetch a block from the queue by calling the #pop method on it and then run the block with arguments supplied earlier to the #add method. Note the true argument – it will tell it that it’s supposed to run in non-blocking mode. The difference is pretty significant:

• In non-blocking mode, once the queue is empty, it will raise the ThreadError exception and continue (hence the rescue call);

• In blocking mode, the thread would wait for a new job to be added to the queue. At the same time our main thread would wait for the child thread to finish its code execution. Such a condition is called deadlock and results in an application error ‘No live threads left. Deadlock?’ after a short time.

Usage

We can use our Crawler::Threads class to crawl multiple pages at the same time. A small module doing that can be seen below:

class SerieModule < Crawler::ModuleBasedef run
    threads = Crawler::Threads.new

    series.each do |serie_id|
      threads.add do
        crawler = SerieCrawler.new(agent.dup)
        crawler.serie_id = serie_id
        crawler.crawl
      end
    end

    threads.start
  end

  protected

  def seriesYAML.load(File.binread(Crawler.root.join('config', 'imdb.yml')))['series']
  end
end

Now we can run some code and see the results. Let’s try it with a single thread:

crawler@master kamil% ruby crawler.rb serie
Spawning 1 threads...
Saved House of Cards
Saved The Flash
Saved Suits
Saved Arrow
Saved Breaking Bad
Saved Game of Thrones
Saved Daredevil
Saved Stargate SG-1
Saved Stargate: Atlantis
Saved Stargate: Universe
---- Time taken total: 10.84s

It took 10 seconds to visit 10 pages and fetch some basic information from them. Sounds reasonable. What about 10 threads?

crawler@master kamil% THREADS=10 ruby crawler.rb serie
Spawning 10 threads...
Saved Daredevil
Saved Stargate SG-1
Saved The Flash
Saved Stargate: Atlantis
Saved Breaking Bad
Saved Stargate: Universe
Saved Arrow
Saved House of Cards
Saved Suits
Saved Game of Thrones
---- Time taken total: 1.51s

This time it took less than 2 seconds! Clearly there is a significant gain in using threads. The difference is visible as a lot of time in those 10 seconds is wasted as the application waits for data from the remote server. During that time it does virtually nothing. Additionally, note the difference in the order of output between those two calls. When using a single thread it will always be the same as in the config file. However, when using more than a single thread the order will be random – this is due to the fact that some of the threads finished faster than the others, even if they started earlier.

Thread safety

Since we are on the topic of outputting data to the console, let’s go back a few steps and discuss Mutex. Our code outputs some information to the console using puts. You might not know this, but this very common method is not thread-safe as it does two things:

• it outputs a given string;

• then it outputs the new line (NL) character.

What may randomly occur is that when using puts inside threads you will see NL characters appear out of place. For example:

Saved House of CardsSaved Daredevil
Saved Arrow

This is because the thread was switched in the middle and another gained control before the NL character was printed. In other words, we can say that the puts method is not atomic or thread-safe. This is where mutex can help us – it allows us to create a custom #log method that will output our information to the console but wrap it in mutex:

def log(msg)
  Crawler.mutex.synchronize do
    puts msg
  end
end

This will ensure that the output in the console always appears in order, as the thread will not be switched until puts finishes.

Conclusion

I hope this will help some of you understand what threads are and how to use them. This was written as a side project since web crawling is a topic that often shows up at my work. The initial code had even more features that I removed, though these can be added with ease when necessary, such as the ability to use (and switch on the fly) proxies and even support for the TOR network. The latter is great for anonymity but unfortunately makes the code terribly slow (this is due to the entire TOR network being quite slow). Feel free to view and poke around the entire source code at https://github.com/kdurski/crawler.

On-demand webinar: Moving Forward From Legacy Systems

We’ll walk you through how to think about an upgrade, refactor, or migration project to your codebase. By the end of this webinar, you’ll have a step-by-step plan to move away from the legacy system.

moving forward from legacy systems - webinar

Latest blog posts

Ready to talk about your project?

1.

Tell us more

Fill out a quick form describing your needs. You can always add details later on and we’ll reply within a day!

2.

Strategic Planning

We go through recommended tools, technologies and frameworks that best fit the challenges you face.

3.

Workshop Kickoff

Once we arrange the formalities, you can meet your Polcode team members and we’ll begin developing your next project.