Generating rich link previews 10x faster

unfurl is one of my favorite open source libraries. Not because it’s clever or complex, but because it’s a tiny kotlin library that does one thing well: generate rich link previews by extracting their social metadata. You can try out its demo on your computer:

brew install saket/repo/unfurl
unfurl https://dog.ceo

If you’ve ever pasted a link into Slack, Twitter or WhatsApp and watched a link preview appear, you’ve seen unfurling in action. The term “unfurling” sounds fancy, but its implementation is actually very simple:

  1. Make an HTTP request to the URL
  2. Receive the response headers to see if it is HTML
  3. Parse the entire HTML using jsoup
  4. Extract social metadata such as og:title, twitter:description, etc.

This works well, but there’s something I had always disliked about it: unfurl was extremely inefficient. It downloaded an entire web-page to read a few HTML tags, only to discard almost all of it. Modern web-pages are large these days, but the data unfurl cares about lives entirely in the <head> section of web-pages, usually just the first few kilobytes.

Can I avoid downloading entire web pages?

Attempt #1: String matching

My initial (and naive) idea was to stream the HTTP response in chunks and stop reading as soon as bytes.contains("</head>") returned true. Turns out, you can’t reliably detect HTML structure with string matching. It’s such a bad idea that it once drove a Stack Overflow user to the brink of madness. I’m embarrassed to admit I’d never read that famous post until now.

Attempt #2: Range HTTP header

val request = okhttp3.Request.Builder()
  .url(url)
  // Ask the website to send as little of the page as possible,
  // hoping that the social tags are present in the initial 32 KB.
  .header("Range", "bytes=0-32768")
  .build()

My next idea came in when I was analyzing Slack’s GET requests made by its unfurler using Request Catcher. I noticed it was using the Range header. In practice, this was hit or miss. It worked with some websites like The New York Times, but The Verge and others ignored the header.

Attempt #3: Streaming HTML

I learned something amazing last week: I’m stupid. Jsoup has had native support for streaming HTML all this time! It can incrementally parse HTML in real-time and stop the download as soon as a condition is met:

// Before:
fun parseHtml(httpResponse: Response): Node {
  return Jsoup.parse(
    /* in = */ httpResponse.body.source().inputStream(),
    /* charsetName = */ httpResponse.body.contentType()?.charset()?.name(),
    /* baseUri = */ httpResponse.request.url,
  )
}

// After:
fun parseHtml(httpResponse: Response): Node {
  val streamer = StreamParser(Parser.htmlParser())
  streamer.parse(
    /* input = */ httpResponse.body.charStream().buffered(),
    /* baseUri = */ httpResponse.request.url,
  )
  return streamer.use {
    // Stream the HTTP response until the block is received.
    it.selectFirst("head")
  }
}

The savings are huge. unfurl 2.3.0 is now an order of magnitude faster and downloads far less data. Here’s a comparison of bytes downloaded before and after the change:

  Before After
Inside Elon’s “extremely hardcore” Twitter 158 KB 24.5 KB
The 100 Best Movies of the 21st Century 515 KB 32 KB

Bonus: Multiple user agents

Downloading less data unlocked another feature that wasn’t feasible earlier: trying out multiple HTTP user agents. Some websites block certain user agents to block scrapers, and unfurl looks like one from the outside. Choosing a single user agent that worked for all websites had proved tough in the past. A few examples:

  • Kraken frequently returns an HTTP 403 for Chrome’s user agent, but reliably allows Slack’s "Slackbot-LinkExpanding 1.0" and WhatsApp’s "WhatsApp/2" agents.
  • Notion serves a stub HTML for Slack’s user agent and HTTP 404 for WhatsApp’s agent.
  • American Airlines times out for Slack’s user agent.
  • Expedia and Wall Street Journal block all user agents. They’re probably using something more sophisticated.

The solution? Brute force by trying multiple user agents, staggered by a small delay. This is now feasible because unfurl doesn’t download entire web-pages anymore. Starting with the latest version, unfurl now ships with three HTTP user agents by default, so the chances of success are a bit higher.

val httpUserAgents = listOf(
  "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Mobile Safari/537.36",
  "WhatsApp/2",
  "Slackbot-LinkExpanding 1.0 (+https://api.slack.com/robots)",
)
httpUserAgents.mapIndexed { index, userAgent ->
  flow {
    // Most web pages should be reachable using the first user agent. Remaining
    // requests are staggered so that earlier user agents get a chance to succeed
    // before firing later ones.
    delay(1.seconds * index)
    emit(downloadHtml(url, userAgent))
  }
}
  .merge()
  .filterNotNull()
  .firstOrNull()

Try it out

Both changes ship in unfurl 2.3.0.

What’s next?

What about websites that can’t be unfurled by a bot? WSJ and Twitter load content via JavaScript, which unfurl can’t fake. Can I spin up a headless browser on Android in background? Problem for future Saket.