Listopad 12, 2015 Technologie
Multi-threaded web crawler in Ruby
Ruby has built-in support for threads yet it’s barely used, even in situations where it could be very handy, such as crawling the web. While it can be a pretty slow process, the majority of the time is spent on waiting for IO data from the remote server. This is the perfect case to use threads.
Multi-Thread vs Multi-process
For those of you who have never worked with threads it might be confusing: what is the difference between a thread and process? An application with multiple threads is in fact a single process, even though at first glance it might look like more than one thing happening at the same time. In truth the application switches between threads continuously so that each and every one of them has a piece of the total process time for itself. This is also what happens when you run multiple applications on a single-core processor – the operating system takes care of switching between dozens of running processes depending on hard drive, network and CPU usage.
Threads within a single process share memory so that they can use the same variables and resources. This is important as it makes creating multi-threaded applications much easier. On the other hand, a multi-process application is capable of using more than one processor core. Yet since it does not have a common memory, it’s difficult to share data and consequently its much more difficult to write such code.
More information (as always) can be found on Wikipedia:
Ruby thread support
We will use three major components from Ruby’s thread library:
This is the core class that allows us to run multiple parts of code simultaneously.
Queue is an Array-like class that will let us schedule jobs to be run by multiple threads. It supports both blocking & non-blocking modes and works in FIFO (first in, first out) mode. This means that the first job added to queue is also the first one processed.
Mutex is responsible for synchronizing access to shared resources. It ensures that the code run within the block will always be run completely, without switching to another thread in the meantime.
The idea is to write a small application that will crawl external sites and fetch some basic information about a few US TV series. It will utilize Ruby on Rails’ ActiveRecord library to access the database, though other than that, the rest is pure Ruby.
Parts of the code have been skipped in this article as it’s supposed to focus mostly on thread-related topics. The full source code can be found on GitHub @ https://github.com/kdurski/crawler.
The application is divided into three major components:
Modules will be the core components. When running the application we will supply a list of modules it’s supposed to run. This way our crawler can be scheduled to perform different tasks at different times (for example using a crontab). One of the module’s tasks is to create multiple threads and tell the crawlers what to do.
Crawler classes will be used to visit remote websites and fetch data. A single crawler can be used within more then one module.
Models, just like in Ruby on Rails, are responsible for storing and retrieving data to and from the database.
The most important part here is the
Crawler module. It’s responsible for setting up the environment and connecting to the database.
module Crawler autoload :Main, 'crawler/main' autoload :Threads, 'crawler/threads' autoload :ModuleBase, 'crawler/module_base' autoload :CrawlerBase, 'crawler/crawler_base' # @return [void] def self.setup_env; end # @return [Mutex] def self.mutex @mutex ||= ::Mutex.new end end Crawler.setup_env
We can see a few
autoload calls here for major components that are inside the
lib/ directory. Below that there is the
setup_env method that will set up the environment for our application – that is, establish a connection with the database, add
app/ directories to the
$LOAD_PATH variable and finally include all of the files under
Below that is the
mutex method and this is where it starts to get slightly interesting. We create a new instance of it and store it inside the @mutex variable. Whenever we need a mutex inside our app, we can access it by
This is the code supporting the core feature of the application so it’s necessary to discuss it in detail.
require 'thread' module Crawler class Threads attr_reader :size, :queue, :threads def initialize(size = nil) @size = size || ENV['THREADS'].to_i @size = 1 if @size <= 0 @threads =  @queue = Queue.new end def add(*args, &block) queue << [block, args] end def start set_up_thread_pool threads.each(&:join) end private def set_up_thread_pool 1.upto(size).each do |i| @threads << spawn_thread(i) end end def spawn_thread(i) Thread.new do until @queue.empty? do block, args = @queue.pop(true) rescue nil block.call(*args) if block end end end end end
As you can see it’s pretty small – that’s because it’s pretty easy to use threads in Ruby as long as you know a few concepts.
Firstly, we initialize a few variables:
@size will tell us how many threads to spawn – it can be supplied manually and defaults to the environment variable THREADS (or 1);
@threads is an empty array in which we will keep track of our threads;
@queue will store our jobs that need to be done.
Each job has to be added to the queue by calling the
#add method. It accepts optional arguments and a block. Those blocks are added to the
@queue and will be run in the same order (remember the FIFO acronym?). I assume you are familiar with the blocks concept in Ruby, but if are not you might want to Google „blocks in Ruby” and read one of many informative articles.
Once we have added all the necessary jobs we can call the
#start method. It will initialize threads and call
#join on each of them. Doing this is pretty important – it will tell our application to wait for all threads before continuing itself. Without it the threads would run in the background, but once the main thread is done with its own job it would kill the entire application.
The thread’s job is pretty simple – as seen inside
#spawn_thread, it will fetch a block from the queue by calling the
#pop method on it and then run the block with arguments supplied earlier to the
#add method. Note the
true argument – it will tell it that it’s supposed to run in non-blocking mode. The difference is pretty significant:
• In non-blocking mode, once the queue is empty, it will raise the
ThreadError exception and continue (hence the rescue call);
• In blocking mode, the thread would wait for a new job to be added to the queue. At the same time our main thread would wait for the child thread to finish its code execution. Such a condition is called deadlock and results in an application error ‘No live threads left. Deadlock?’ after a short time.
We can use our
Crawler::Threads class to crawl multiple pages at the same time. A small module doing that can be seen below:
class SerieModule < Crawler::ModuleBase def run threads = Crawler::Threads.new series.each do |serie_id| threads.add do crawler = SerieCrawler.new(agent.dup) crawler.serie_id = serie_id crawler.crawl end end threads.start end protected def series YAML.load(File.binread(Crawler.root.join('config', 'imdb.yml')))['series'] end end
Now we can run some code and see the results. Let’s try it with a single thread:
crawler@master kamil% ruby crawler.rb serie Spawning 1 threads... Saved House of Cards Saved The Flash Saved Suits Saved Arrow Saved Breaking Bad Saved Game of Thrones Saved Daredevil Saved Stargate SG-1 Saved Stargate: Atlantis Saved Stargate: Universe ---- Time taken total: 10.84s
It took 10 seconds to visit 10 pages and fetch some basic information from them. Sounds reasonable. What about 10 threads?
crawler@master kamil% THREADS=10 ruby crawler.rb serie Spawning 10 threads... Saved Daredevil Saved Stargate SG-1 Saved The Flash Saved Stargate: Atlantis Saved Breaking Bad Saved Stargate: Universe Saved Arrow Saved House of Cards Saved Suits Saved Game of Thrones ---- Time taken total: 1.51s
This time it took less than 2 seconds! Clearly there is a significant gain in using threads. The difference is visible as a lot of time in those 10 seconds is wasted as the application waits for data from the remote server. During that time it does virtually nothing. Additionally, note the difference in the order of output between those two calls. When using a single thread it will always be the same as in the config file. However, when using more than a single thread the order will be random – this is due to the fact that some of the threads finished faster than the others, even if they started earlier.
Since we are on the topic of outputting data to the console, let’s go back a few steps and discuss
Mutex. Our code outputs some information to the console using
puts. You might not know this, but this very common method is not thread-safe as it does two things:
• it outputs a given string;
• then it outputs the new line (NL) character.
What may randomly occur is that when using
puts inside threads you will see NL characters appear out of place. For example:
Saved House of CardsSaved Daredevil Saved Arrow
This is because the thread was switched in the middle and another gained control before the NL character was printed. In other words, we can say that the
puts method is not atomic or thread-safe. This is where mutex can help us – it allows us to create a custom
#log method that will output our information to the console but wrap it in mutex:
def log(msg) Crawler.mutex.synchronize do puts msg end end
This will ensure that the output in the console always appears in order, as the thread will not be switched until
I hope this will help some of you understand what threads are and how to use them. This was written as a side project since web crawling is a topic that often shows up at my work. The initial code had even more features that I removed, though these can be added with ease when necessary, such as the ability to use (and switch on the fly) proxies and even support for the TOR network. The latter is great for anonymity but unfortunately makes the code terribly slow (this is due to the entire TOR network being quite slow). Feel free to view and poke around the entire source code at https://github.com/kdurski/crawler.
Kamil Durski, Senior Ruby Developer