Ruby has built-in support for threads yet it’s barely used, even in situations where it could be very handy, such as crawling the web. While it can be a pretty slow process, the majority of the time is spent on waiting for IO data from the remote server. This is the perfect case to use threads.
Multi-threaded web crawler in Ruby
Ruby has built-in support for threads yet it’s barely used, even in situations where it could be very handy, such as crawling the web. While it can be a pretty slow process, the majority of the time is spent on waiting for IO data from the remote server. This is the perfect case to use threads.
Theory
Multi-Thread vs Multi-process
For those of you who have never worked with threads it might be confusing: what is the difference between a thread and process? An application with multiple threads is in fact a single process, even though at first glance it might look like more than one thing happening at the same time. In truth the application switches between threads continuously so that each and every one of them has a piece of the total process time for itself. This is also what happens when you run multiple applications on a single-core processor – the operating system takes care of switching between dozens of running processes depending on hard drive, network and CPU usage.
Threads within a single process share memory so that they can use the same variables and resources. This is important as it makes creating multi-threaded applications much easier. On the other hand, a multi-process application is capable of using more than one processor core. Yet since it does not have a common memory, it’s difficult to share data and consequently its much more difficult to write such code.
More information (as always) can be found on Wikipedia:
Ruby thread support
We will use three major components from Ruby’s thread library:
• Thread
This is the core class that allows us to run multiple parts of code simultaneously.
• Queue
Queue is an Array-like class that will let us schedule jobs to be run by multiple threads. It supports both blocking & non-blocking modes and works in FIFO (first in, first out) mode. This means that the first job added to queue is also the first one processed.
• Mutex
Mutex is responsible for synchronizing access to shared resources. It ensures that the code run within the block will always be run completely, without switching to another thread in the meantime.
The Code
The idea is to write a small application that will crawl external sites and fetch some basic information about a few US TV series. It will utilize Ruby on Rails’ ActiveRecord library to access the database, though other than that, the rest is pure Ruby.
Parts of the code have been skipped in this article as it’s supposed to focus mostly on thread-related topics. The full source code can be found on GitHub @ https://github.com/kdurski/crawler.
The application is divided into three major components:
• Module
Modules will be the core components. When running the application we will supply a list of modules it’s supposed to run. This way our crawler can be scheduled to perform different tasks at different times (for example using a crontab). One of the module’s tasks is to create multiple threads and tell the crawlers what to do.
• Crawler
Crawler classes will be used to visit remote websites and fetch data. A single crawler can be used within more then one module.
• Model
Models, just like in Ruby on Rails, are responsible for storing and retrieving data to and from the database.
Crawler
module
The most important part here is the Crawler
module. It’s responsible for setting up the environment and connecting to the database.
Pseudo code:
module Crawlerautoload :Main, 'crawler/main'autoload :Threads, 'crawler/threads'autoload :ModuleBase, 'crawler/module_base'autoload :CrawlerBase, 'crawler/crawler_base' # @return [void]def self.setup_env; end # @return [Mutex]def self.mutex@mutex ||= ::Mutex.new end end Crawler.setup_env
We can see a few autoload
calls here for major components that are inside the lib/
directory. Below that there is the setup_env
method that will set up the environment for our application – that is, establish a connection with the database, add app/
directories to the variable and finally include all of the files under
app/
directory.
Below that is the mutex
method and this is where it starts to get slightly interesting. We create a new instance of it and store it inside the @mutex variable. Whenever we need a mutex inside our app, we can access it by Crawler.mutex
.
Crawler::Threads
class
This is the code supporting the core feature of the application so it’s necessary to discuss it in detail.
require 'thread' module Crawlerclass Threadsattr_reader :size, :queue, :threads def initialize(size = nil) @size = size || ENV['THREADS'].to_i @size = 1 if @size <= 0 @threads = [] @queue = Queue.newend def add(*args, &block) queue << [block, args] end def start set_up_thread_pool threads.each(&:join) end private def set_up_thread_pool1.upto(size).each do |i| @threads << spawn_thread(i) endend def spawn_thread(i) Thread.new dountil @queue.empty? do block, args = @queue.pop(true) rescue nil block.call(*args) if block end end end end end
As you can see it’s pretty small – that’s because it’s pretty easy to use threads in Ruby as long as you know a few concepts.
Firstly, we initialize a few variables:
• @size
will tell us how many threads to spawn – it can be supplied manually and defaults to the environment variable THREADS (or 1);
• is an empty array in which we will keep track of our threads;
• @queue
will store our jobs that need to be done.
Each job has to be added to the queue by calling the method. It accepts optional arguments and a block. Those blocks are added to the
@queue
and will be run in the same order (remember the FIFO acronym?). I assume you are familiar with the blocks concept in Ruby, but if are not you might want to Google “blocks in Ruby” and read one of many informative articles.
Once we have added all the necessary jobs we can call the method. It will initialize threads and call
join
on each of them. Doing this is pretty important – it will tell our application to wait for all threads before continuing itself. Without it the threads would run in the background, but once the main thread is done with its own job it would kill the entire application.
The thread’s job is pretty simple – as seen inside , it will fetch a block from the queue by calling the
method on it and then run the block with arguments supplied earlier to the
method. Note the
true
argument – it will tell it that it’s supposed to run in non-blocking mode. The difference is pretty significant:
• In non-blocking mode, once the queue is empty, it will raise the ThreadError
exception and continue (hence the rescue call);
• In blocking mode, the thread would wait for a new job to be added to the queue. At the same time our main thread would wait for the child thread to finish its code execution. Such a condition is called deadlock and results in an application error ‘No live threads left. Deadlock?’ after a short time.
Usage
We can use our Crawler::Threads
class to crawl multiple pages at the same time. A small module doing that can be seen below:
class SerieModule < Crawler::ModuleBasedef run threads = Crawler::Threads.new series.each do |serie_id| threads.add do crawler = SerieCrawler.new(agent.dup) crawler.serie_id = serie_id crawler.crawl end end threads.start end protected def seriesYAML.load(File.binread(Crawler.root.join('config', 'imdb.yml')))['series'] end end
Now we can run some code and see the results. Let’s try it with a single thread:
crawler@master kamil% ruby crawler.rb serie Spawning 1 threads... Saved House of Cards Saved The Flash Saved Suits Saved Arrow Saved Breaking Bad Saved Game of Thrones Saved Daredevil Saved Stargate SG-1 Saved Stargate: Atlantis Saved Stargate: Universe ---- Time taken total: 10.84s
It took 10 seconds to visit 10 pages and fetch some basic information from them. Sounds reasonable. What about 10 threads?
crawler@master kamil% THREADS=10 ruby crawler.rb serie Spawning 10 threads... Saved Daredevil Saved Stargate SG-1 Saved The Flash Saved Stargate: Atlantis Saved Breaking Bad Saved Stargate: Universe Saved Arrow Saved House of Cards Saved Suits Saved Game of Thrones ---- Time taken total: 1.51s
This time it took less than 2 seconds! Clearly there is a significant gain in using threads. The difference is visible as a lot of time in those 10 seconds is wasted as the application waits for data from the remote server. During that time it does virtually nothing. Additionally, note the difference in the order of output between those two calls. When using a single thread it will always be the same as in the config file. However, when using more than a single thread the order will be random – this is due to the fact that some of the threads finished faster than the others, even if they started earlier.
Thread safety
Since we are on the topic of outputting data to the console, let’s go back a few steps and discuss Mutex
. Our code outputs some information to the console using puts
. You might not know this, but this very common method is not thread-safe as it does two things:
• it outputs a given string;
• then it outputs the new line (NL) character.
What may randomly occur is that when using puts
inside threads you will see NL characters appear out of place. For example:
Saved House of CardsSaved Daredevil Saved Arrow
This is because the thread was switched in the middle and another gained control before the NL character was printed. In other words, we can say that the puts
method is not atomic or thread-safe. This is where mutex can help us – it allows us to create a custom log
method that will output our information to the console but wrap it in mutex:
def log(msg) Crawler.mutex.synchronize do puts msg end end
This will ensure that the output in the console always appears in order, as the thread will not be switched until puts
finishes.
Conclusion
I hope this will help some of you understand what threads are and how to use them. This was written as a side project since web crawling is a topic that often shows up at my work. The initial code had even more features that I removed, though these can be added with ease when necessary, such as the ability to use (and switch on the fly) proxies and even support for the TOR network. The latter is great for anonymity but unfortunately makes the code terribly slow (this is due to the entire TOR network being quite slow). Feel free to view and poke around the entire source code at https://github.com/kdurski/crawler.
On-demand webinar: Moving Forward From Legacy Systems
We’ll walk you through how to think about an upgrade, refactor, or migration project to your codebase. By the end of this webinar, you’ll have a step-by-step plan to move away from the legacy system.
Latest blog posts
Ready to talk about your project?
Tell us more
Fill out a quick form describing your needs. You can always add details later on and we’ll reply within a day!
Strategic Planning
We go through recommended tools, technologies and frameworks that best fit the challenges you face.
Workshop Kickoff
Once we arrange the formalities, you can meet your Polcode team members and we’ll begin developing your next project.