Scraping the Web with Ruby

Jul 6, 2023

| Web Development

7 minutes read

Marcin Ruszkiewicz - Senior Ruby on Rails Developer

When dealing with data from the Web, nowadays we usually think about accessing APIs and dealing with JSON data. Sometimes this is simply not possible, perhaps because the API simply doesn’t exist or we’re looking for some specific data set we want to assemble together from multiple separate pages of content.

In this article we’ll be looking at this exact case – one event in the MMO game Eve Online is an annual PvP competition, the Alliance Tournament, in which player groups can participate. We’re interested in obtaining historical match data so that we can rank alliances on their performance in the tournament. This data isn’t available from any API or single page, so we have to go through all tournament results and determine how many matches a team won out of the total played. To do this, we’ll be looking at multiple web pages, both static and Javascript generated.

Scraping the Web with Ruby

Marcin Ruszkiewicz - Senior Ruby on Rails Developer

7 minutes read

Let’s start with something really simple for now. If we go to the Eve University wiki page on the Alliance Tournament, we can see a table of previous tournament winners. What’s more interesting for us is that this table also has links to detailed results that we can use to gather more information on the matches.

If we look closer, we can see two patterns in this table: the more recent tournaments link to Challonge pages that show a whole tournament bracket, while most older results are located on Eve Online’s community site. Based on this we will need to build at least two different scrapers to collect the information from all different sites, and this is exactly what we will do.

You can find all the code examples used in this article on github.

Grabbing website contents in pure Ruby

Before we get to obtaining the results themselves, let’s just grab the winners table from the Eve University page so that we have links for the results locally and don’t have to copy them by hand every time we need to use them.

First, let’s make a winners.rb file and grab the contents of the page. We can just use a few lines of Ruby code to do this, and we don’t need any extra libraries apart from those already included in every Ruby installation:

# frozen_string_literal: true

require "open-uri"

url = "https://wiki.eveuniversity.org/Alliance_Tournament"
html = URI.parse(url).open

puts html.read

When ran in the console with Ruby, winners.rb will output the HTML code of the requested page.

Using Nokogiri to parse HTML content

Acquiring the HTML code alone isn’t enough for what we need though, and this data would be much more useful in some format with better readability. To convert it, we can use Nokogiri to parse the HTML and extract the data we want. Let’s do so now and expand our winners.rb file into its final form:

# frozen_string_literal: true

require "open-uri"
require "nokogiri"
require "json"

url = "https://wiki.eveuniversity.org/Alliance_Tournament"
html = URI.parse(url).open
doc = Nokogiri::HTML(html)

results = []

doc.css("table.wikitable tr").each do |row|
  winners_name = row.css("td:nth-child(3)").text.strip
  tournament_name = row.css("td:nth-child(1)").text.strip
  next if winners_name.empty?
  next if tournament_name.empty?

  tournament_year = row.css("td:nth-child(2)").xpath('text()').text.strip
  tournament_yc_year = row.css("td:nth-child(2) span").text.strip

  bracket_link = row.css("td:nth-child(5) a:contains('Bracket')").attr("href")&.value
  results_link = row.css("td:nth-child(5) a:contains('Results')").attr("href")&.value

  tournament = {
    name: tournament_name,
    year: tournament_year,
    yc: tournament_yc_year,
    winners: winners_name,
    bracket: bracket_link,
    results: results_link
  }

  results << tournament
end

json = results.to_json

File.open("winners.json", "w") do |f|
  f.write json
end

This code is a bit more advanced, so let’s break it down to make it easier to understand.

require "open-uri"
require "nokogiri"
require "json"

url = "https://wiki.eveuniversity.org/Alliance_Tournament"
html = URI.parse(url).open
doc = Nokogiri::HTML(html)

results = []

doc.css("table.wikitable tr").each do |row|
  # ...
end

First, we get the page contents from the internet just as we did before and use Nokogiri to open it. Then we can target the table that interests us with a CSS selector – there’s only one of those on the page, which makes it easy. We also want to have an array for our results.

winners_name = row.css("td:nth-child(3)").text.strip
tournament_name = row.css("td:nth-child(1)").text.strip
next if winners_name.empty?
next if tournament_name.empty?

Some of the rows in the table aren’t interesting – one is the current year’s tournament, which doesn’t have a winner yet, and additionally there were two years in which the tournament didn’t happen at all, so we might as well just skip those rows right away.

tournament_year = row.css("td:nth-child(2)").xpath('text()').text.strip
tournament_yc_year = row.css("td:nth-child(2) span").text.strip

The second column of the table contains two dates: one is the real world year and the other is an ingame representation of it. For example, 2023 is represented as YC 125. The normal date is just direct content of the td tag, while the ingame date is in an extra span tag inside the table cell.

Targeting the span is easy enough with the CSS selector, but to get to the direct text content without including the child elements we need to use an XPath function text() instead.

bracket_link = row.css("td:nth-child(5) 
a:contains('Bracket')").attr("href")&.value
results_link = row.css("td:nth-child(5) 
a:contains('Results')").attr("href")&.value

Lastly, the links. As explained before, there are two types, so we can just differentiate them through the link’s title and save them in an appropriate field.

tournament = {
  name: tournament_name,
  year: tournament_year,
  yc: tournament_yc_year,
  winners: winners_name,
  bracket: bracket_link,
  results: results_link
}

results << tournament

Now that we have all the data, we can put it together into a hash and push it out into the results array.

json = results.to_json

File.open("winners.json", "w") do |f|
  f.write json
end

And finally, we convert our results to a JSON file and write it to the disk.

Using Vessel to quickly parse multiple pages

Before we dive into getting data from the brackets, let’s start working on obtaining historical results from Eve Online’s community pages. This should be a bit easier since they are just static pages.

This time we’ll be using Vessel instead of just getting the page contents with plain Ruby, which will allow us to quickly scrape multiple web pages in parallel.

# frozen_string_literal: true

require "json"
require "vessel"

class CommunityResults < Vessel::Cargo
  domain "community.testeveonline.com"

  def parse
    page_title = at_css("#content h1.content-title")&.text

    matches = []
    css(".at-results table tr").each do |row|
      team1 = row.at_css("td:nth-child(3) span:nth-child(1)")
      team2 = row.at_css("td:nth-child(3) span:nth-child(3)")
      next unless team1 && team2

      winner = team1.attribute("class") == "winner" ? team1 : team2

      matches << {
        team1: team1.text,
        team2: team2.text,
        winner: winner.text
      }
    end

    full_results = {
      name: page_title.sub("Alliance Tournament ", "AT").sub("Results", "").strip,
      matches: matches
    }

    yield full_results
  end
end

results = File.read("winners.json")
json = JSON.parse(results)

urls = []
json.each do |tournament|
  next unless tournament["results"] =~ /community\.testeveonline\.com/
  urls << tournament["results"]
end

match_results = []
CommunityResults.run(start_urls: urls){ |a| match_results << a }

json_results = match_results.to_json
File.open("match_results.json", "w") do |f|
  f.write json_results
end

This is the whole script for scraping all of the match results. Let’s break it down again.

results = File.read("winners.json")
json = JSON.parse(results)

urls = []
json.each do |tournament|
  next unless tournament["results"] =~ /community\.testeveonline\.com/
  urls << tournament["results"]
end

match_results = []
CommunityResults.run(start_urls: urls){ |a| match_results << a }

json_results = match_results.to_json
File.open("match_results.json", "w") do |f|
  f.write json_results
end

This part should be pretty easy by now, as it’s just mostly more file operations. We’ll be reading the JSON file from the last step and add the results from the community page to an array. Then we use our parser to get the match results and write them to a separate file.

The interesting thing here is the parser. To use Vessel, you just pass the URL array to a parser class – its run method then returns every value that the class’s parse method yields.

This might look a bit obscure at first, so let’s go through the parser class step by step.

class CommunityResults < Vessel::Cargo
  domain "community.testeveonline.com"

  def parse
    # ...
  end
end

We’re going to define our parser class, which inherits from an appropriate Vessel class, and define our parse method. This method will be automatically called by Vessel’s run method, so we need to have it.

page_title = at_css("#content h1.content-title")&.text

Inside the parse method, we’re just using the same Nokogiri methods we used before, but with the caveat that we don’t have any explicit Nokogiri object. This part is handled automatically by Vessel, so we can simply just use at_css and css methods. Here we’re getting a title element to know which tournament we’re actually obtaining the results for.

matches = []
css(".at-results table tr").each do |row|
  team1 = row.at_css("td:nth-child(3) span:nth-child(1)")
  team2 = row.at_css("td:nth-child(3) span:nth-child(3)")
  next unless team1 && team2

Next we’ll populate the matches array. There’s a results table on each page, so we go through all of the rows and get the two team names. We can skip the row if we don’t have both names – those will be the table headers.

winner = team1.attribute("class") == "winner" ? team1 : team2

We also want to know which team was the winner in any given match. A quick inspection of one of the pages shows that they are marking the winner by using the winner class, so we’ll just get the name of whichever team is marked this way.

 matches << {
    team1: team1.text,
    team2: team2.text,
    winner: winner.text
  }
end

Now we populate our match results array.

full_results = {
  name: page_title.sub("Alliance Tournament ", "AT").sub("Results", "").strip,
  matches: matches
}

yield full_results

Finally we can output the results of our parsing, after a bit of string substitution to align the format of the tournament name to that which we had in our previous scraper.

Notice that we don’t return the final value here but yield it instead. This allows us to use the block to manipulate all the results, like below:

CommunityResults.run(start_urls: urls){ |a| match_results << a }

Here we just output it to our results array, but we could manipulate it more at this point if we needed to.

Using Tanakai to get data from Javascript pages

We can now go back to getting the most recent tournament’s results from the brackets published on their challonge pages. Since this is a React-powered website, we can’t really use a basic Ruby scraper because it would be missing the Javascript necessary to obtain the content.

We’ll be using a different library this time,Tanakai, which is a fork of the currently unmaintained but still quite popular Kimurai. Unfortunately we also have to rely on a fork of this code, as the main repository is not compatible with Ruby 3 yet and will throw errors if we try.

# frozen_string_literal: true

require "open-uri"
require "json"
require "tanakai"

class BracketsSpider < Tanakai::Base
  @name = "brackets_spider"
  @engine = :selenium_chrome
  @start_urls = ["https://challonge.com/atxviii", "https://challonge.com/atxvii", "https://challonge.com/AllianceTournamentXVI"]
  @config = {
    user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
    before_request: { delay: 4..7 }
  }

  def parse(response, url:, data: {})
    page_title = response.css(".tournament-banner-header h4")&.text

    matches = []
    response.xpath("//g[contains(@class, 'match')]").each do |match|
      teams = match.xpath(".//text[contains(@class, 'match--player-name')]").map(&:text)
      winner = match.at_xpath(".//text[contains(@class, 'match--player-name')][contains(@class, '-winner')]").text

      matches << {
        team1: teams.first,
        team2: teams.second,
        winner: winner
      }
    end

    full_results = {
      name: page_title.sub("Alliance Tournament ", "AT").strip,
      matches: matches
    }

    save_to "bracket_results.json", full_results, format: :json
  end
end

BracketsSpider.crawl!

This code is somewhat similar to that which we used with Vessel, but has some important differences due to Tanakai handling all file saving for us automatically.

# frozen_string_literal: true

require "open-uri"
require "json"
require "tanakai"

class BracketsSpider < Tanakai::Base
  @name = "brackets_spider"
  @engine = :selenium_chrome
  @start_urls = ["https://challonge.com/atxviii", "https://challonge.com/atxvii", "https://challonge.com/AllianceTournamentXVI"]
  @config = {
    user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
    before_request: { delay: 4..7 }
  }

  def parse(response, url:, data: {})
    # ...
  end
end

The first big difference is the setup. Sadly Tanakai doesn’t have any good way to pass the URLs, so we have to hardcode them in the class. This isn’t ideal, but for most purposes – for example, when you know which pages you need to go through or have one website as a starting point – it’s not terrible.

We also need to pretend that our crawler is a real browser by adding a user agent, otherwise the website will just show us a cloudflare error page.

page_title = response.css(".tournament-banner-header h4")&.text

As before, we grab the tournament’s name with the CSS selector.

matches = []
response.xpath("//g[contains(@class, 'match')]").each do |match|
  teams = match.xpath(".//text[contains(@class, 'match--player-name')]").map(&:text)
  winner = match.at_xpath(".//text[contains(@class, 'match--player-name')][contains(@class, '-winner')]").text

  matches << {
    team1: teams.first,
    team2: teams.second,
    winner: winner
  }
end

After that we go through the page, filling up our matches array. Since the matches are drawn as SVG elements and have multiple classes (and there’s a lot more of those elements compared to just using a table), it’s much easier to just target them with XPath and XPath functions – we can check if an element contains an attribute with some value and have multiple checks as well to get the winner of the match.

If you’re interested in learning more about XPath and its multiple options, you can check out an XPath cheatsheet or its general documentation.

full_results = {
  name: page_title.sub("Alliance Tournament ", "AT").strip,
  matches: matches
}

save_to "bracket_results.json", full_results, format: :json

Finally we prepare our results object and use Tanakai’s save_to helper to output them to a JSON file.

What’s next?

We obtained our results, but the job’s not done yet. There are still problems with the results we collected – the matches list we got from the challonge brackets isn’t complete, while parsing the earlier historical results from the community sites revealed some matches that ended in draws and should be handled as well. However, delving into these intricacies warrants a separate article. In the meantime, we encourage readers to explore potential solutions and address these challenges on their own.

On-demand webinar: Moving Forward From Legacy Systems

moving forward from legacy systems - webinar

Latest blog posts

Why Custom AI Chatbots are the Future of E-commerce

Jul 24, 2024 by Anna Bober

Top 5 Myths About Team Augmentation Debunked

Jul 17, 2024 by Mikołaj Brach

Custom Software Solutions Transform Patient Care

Jul 10, 2024 by Bartosz Malinowski

Ready to talk about your project?

Tell us more

Fill out a quick form describing your needs. You can always add details later on and we’ll reply within a day!

Strategic Planning

We go through recommended tools, technologies and frameworks that best fit the challenges you face.

Workshop Kickoff

Once we arrange the formalities, you can meet your Polcode team members and we’ll begin developing your next project.

Web Development

Frontend Development

Backend Development

Mobile Development

Product Design

eCommerce Development

Our Process

Workshops

Our Webinars

Our Ebooks

Our Podcasts

Scraping the Web with Ruby

Scraping the Web with Ruby

Grabbing website contents in pure Ruby

Using Nokogiri to parse HTML content

Using Vessel to quickly parse multiple pages

Using Tanakai to get data from Javascript pages

What’s next?

On-demand webinar: Moving Forward From Legacy Systems

Latest blog posts

Why Custom AI Chatbots are the Future of E-commerce

Top 5 Myths About Team Augmentation Debunked

Custom Software Solutions Transform Patient Care

Ready to talk about your project?

Tell us more

Strategic Planning

Workshop Kickoff