Genearting rich link previews 10x faster

unfurl is one of my favorite open source libraries. Not because it’s clever or complex, but because it’s a tiny kotlin library that does one thing well: generate rich link previews by extracting their social metadata. You can try out its demo on your computer:

brew install saket/repo/unfurl
unfurl https://dog.ceo

If you’ve ever pasted a link into Slack, Twitter or WhatsApp and watched a link preview appear, you’ve seen unfurling in action. The term “unfurling” sounds fancy, but its implementation is actually very simple:

Send a GET request to the URL using OkHttp
If its content type is text/html, download its response body
Parse the HTML using jsoup
Find social metadata such as og:title, twitter:description, etc.

This works well, but there’s something I had always disliked about it: unfurl was extremely inefficient. It downloaded an entire web-page to read a few HTML tags, only to discard almost all of it. Modern web-pages are large these days, but the data unfurl cares about lives entirely in the <head> section of web-pages, usually just the first few kilobytes.

Can I avoid downloading entire web pages?

Attempt #1: String matching

My initial (and naive) idea was to stream the HTTP response in chunks and stop reading as soon as bytes.contains("</head>") returned true. Turns out, you can’t reliably detect HTML structure with string matching. It’s such a bad idea that it once drove a Stack Overflow user to the brink of madness. I’m embarrassed to admit I’d never read that famous post until now.

Attempt #2: Range HTTP header

val request = okhttp3.Request.Builder()
  .url(url)
  // Ask the website to send as little of the page as possible,
  // hoping that the social tags are present in the initial 32 KB.
  .header("Range", "bytes=0-32768")
  .build()

My next idea came in when I was analyzing Slack’s GET requests made by its unfurler using Request Catcher. I noticed it was using the Range header. In practice, this was hit or miss. It worked with some websites like The New York Times, but The Verge and others ignored the header.

Attempt #3: Streaming HTML

I learned something amazing last week: I’m stupid. Jsoup has had native support for streaming HTML all this time! It can incrementally parse HTML in real-time and stop the download as soon as a condition is met:

// Before:
fun parseHtml(httpResponse: Response): Node {
  return Jsoup.parse(
    /* in = */ httpResponse.body.source().inputStream(),
    /* charsetName = */ httpResponse.body.contentType()?.charset()?.name(),
    /* baseUri = */ httpResponse.request.url,
  )
}

// After:
fun parseHtml(httpResponse: Response): Node {
  val streamer = StreamParser(Parser.htmlParser())
  streamer.parse(
    /* input = */ httpResponse.body.charStream().buffered(),
    /* baseUri = */ httpResponse.request.url,
  )
  return streamer.use {
    // Stream the HTTP response until the <head> block is received.
    it.selectFirst("head")
  }
}

The savings are huge. unfurl 2.3.0 is now an order of magnitude faster and downloads far less data. Here’s a comparison of bytes downloaded before and after the change:

	Before	After
Inside Elon’s “extremely hardcore” Twitter	158 KB	24.5 KB
The 100 Best Movies of the 21st Century	515 KB	32 KB

Bonus: Multiple user agents

The savings in downloaded data allowed me to implement another feature that wasn’t feasible earlier: trying out multiple HTTP user agents. Some websites block certain user agents to discourage scraping. From the outside, unfurl looks exactly like that. Choosing a single user agent that worked for all websites had proved tough in the past. A few examples I ran into:

Kraken frequently returns an HTTP 403 for Chrome’s user agent, but reliably allows Slack’s "Slackbot-LinkExpanding 1.0" and WhatsApp’s "WhatsApp/2" agents.
Notion returns an empty HTML for Slack’s user agent and HTTP 404 for WhatsApp’s agent.
American Airlines times out for Slack’s user agent.
Best Buy throttles requests with Slack’s user agent by 8-9s.
Expedia and Wall Street Journal block all user agents. They’re probably using something more sophisticated.

The solution? Brute force by trying multiple user agents, staggered by a small delay. This is now feasible because unfurl doesn’t download entire web-pages anymore. Starting with the latest version, unfurl now ships with three HTTP user agents by default, so the chances of success are a bit higher.

What’s next?

I often wonder, could unfurl use a headless browser on Android to download web pages that block bot requests, like The Wall Street Journal and Twitter?

Attempt #1: String matching

Attempt #2: Range HTTP header

Attempt #3: Streaming HTML

Bonus: Multiple user agents

What’s next?

Related Posts

This is your next Reddit app — Dank

Sunsetting Dank

Reinventing the dial-up modem

Smoothly reacting to keyboard visibility changes in Android