How to Build a Web Crawler in Swift 🕷

More often than not, my App Suite is thirsty for information that can be found around the Web.

Unless it’s something that takes a couple of minutes, for this kind of needs I like to automate as much as possible (it saves hours of work!).

My go-to languages have always been Python and Javascript however, the recent launch of John Sundell’s Marathon made me wonder: is Swift ready for scripting? 🤔

Building A Web Crawler in Swift

Since I needed to extract some information from the Internet, this time I’ve decided to give Swift a try.

There are two main ways to do scripts in Swift:

  1. by following Hector Matos’s awesome guide here
  2. by using John Sundell’s Marathon (the README.md is all you need)

It turns out, coding a Swift script is not any different than coding a new Swift class, function etc. I’ve never felt out of home.

Also, building a Web Crawler in Swift is incredibly easy:

import Foundation

// Input your parameters here
let startUrl = URL(string: "https://developer.apple.com/swift/")!
let wordToSearch = "Swift"
let maximumPagesToVisit = 10

// Crawler Parameters
let semaphore = DispatchSemaphore(value: 0)
var visitedPages: Set<URL> = []
var pagesToVisit: Set<URL> = [startUrl]

// Crawler Core
func crawl() {
  guard visitedPages.count <= maximumPagesToVisit else {
    print("🏁 Reached max number of pages to visit")
    semaphore.signal()
    return
  }
  guard let pageToVisit = pagesToVisit.popFirst() else {
    print("🏁 No more pages to visit")
    semaphore.signal()
    return
  }
  if visitedPages.contains(pageToVisit) {
    crawl()
  } else {
    visit(page: pageToVisit)
  }
}

func visit(page url: URL) {
  visitedPages.insert(url)
  
  let task = URLSession.shared.dataTask(with: url) { data, response, error in
    defer { crawl() }
    guard
      let data = data,
      error == nil,
      let document = String(data: data, encoding: .utf8) else { return }
    parse(document: document, url: url)
  }
  
  print("🔎 Visiting page: \(url)")
  task.resume()
}

func parse(document: String, url: URL) {
  func find(word: String) {
    if document.contains(word) {
      print("✅ Word '\(word)' found at page \(url)")
    }
  }
  
  func collectLinks() -> [URL] {
    func getMatches(pattern: String, text: String) -> [String] {
      // used to remove the 'href="' & '"' from the matches
      func trim(url: String) -> String {
        return String(url.characters.dropLast()).substring(from: url.index(url.startIndex, offsetBy: "href=\"".characters.count))
      }
      
      let regex = try! NSRegularExpression(pattern: pattern, options: [.caseInsensitive])
      let matches = regex.matches(in: text, options: [.reportCompletion], range: NSRange(location: 0, length: text.characters.count))
      return matches.map { trim(url: (text as NSString).substring(with: $0.range)) }
    }
    
    let pattern = "href=\"(http://.*?|https://.*?)\""
    let matches = getMatches(pattern: pattern, text: document)
    return matches.flatMap { URL(string: $0) }
  }
  
  find(word: wordToSearch)
  collectLinks().forEach { pagesToVisit.insert($0) }
}

crawl()
semaphore.wait()

I believe the script is pretty self-explanatory: you input a starting web page, the number of maximum webpages to crawl, and the word you’re interested in.

In my case, I didn’t need these inputs to change: but if you do want them to, Hector Matos explains how to write a script that takes arguments here.

Once launched, the script will start crawling the web from the given page and jump around all the links it can find.

log

If you want a deeper explanation, please see here.

In conclusion, I’d say that, depending on what you want to do, and by using the right tools, Swift sure is one of the very viable options for scripting!

A Note About Semaphores

If you’re wondering why there is a DispatchSemaphore in the script above, it’s because any Swift Script exits as soon as its control flow reaches the end.

If you must handle asynchronicity (like in my case, due to the loading of webpages), you need to have something that makes your script stay alive.

Shameless plug: if want to know more about Semaphores, make sure to read my previous article here.

Scripts and Xcode Playgrounds

If your script allows it, I’d suggest you to use Xcode Playgrounds during the script development:

This way it’s much faster to test it without the need to keep switching between your terminal and your code editor.

Once ready, you can quickly turn it into a real script.

Code Snippets

For this post I’ve made a small repository, Selenops 🕷 (a spider that flies, yay! 😱), on GitHub: in there you'll find the very same script written as an Xcode Playground, as a Marathon Script, and as a standard “Command Line Tool” Script.

Happy Scripting!

⭑⭑⭑⭑⭑

Further Reading