URI Encoding

Sep 22, 2025 11:50 Β· 848 words Β· 4 minute read

I had to URI-encode the components of a URL in node the other day. You’d think this would be trivial, particularly in the language of the browser. Turns out, not so much. I ended up deep-diving the relevant RFCs, so I thought I’d summarize here to save you (well, future me) the effort.

Let’s get into it.

The URL in Question πŸ”—

Given https://app.wiz.io/reports/cicd_scans#~(cicd_scan~'1234),

What do you think a “correct” URI-encoding would be?

  1. https://app.wiz.io/reports/cicd_scans#~(cicd_scan~'1234)

    This is what I got from doing the “obvious” thing, encodeURIComponent. It escaped nothing, which made me think I must be misusing it.

  2. https://app.wiz.io/reports/cicd_scans#~%28cicd_scan%271234%29

    This is what I got from urlencoder.org, further convincing me I was doing something wrong.

  3. https://app.wiz.io/reports/cicd_scans#%7E%28cicd_scan%271234%29

    This is what I got when I pasted the un-encoded URL into a Chromium address bar.

The answer is these can all be correct, depending on your perspective and intent.

Standards History πŸ”—

When it comes to URI (or URL) syntax, different RFCs exist with slightly different, and overlapping guidance on what to escape. In the wild, different tools are going to target different RFCs. There are reasons to target each, depending on if you’re encoding or decoding, and if you care more about compatibility with others, or stricter adherence to the most modern specification.

It’s also legal according to all RFCs to escape any character within a component, so a “better safe than sorry” over-escaping of characters no RFC specifically calls out is a valid approach.

RFC1738 πŸ”—

In 1994, when RFC1738 introduced the idea of a URL, it meandered through various sets of characters that are unsafe, and ultimately defined the escaping rules in an obtuse way. It first defined the reserved and special sets of characters in paragraph form, which I’ll reproduce in a nicer syntax here:

reserved    = ";" | "/" | "?" | ":" | "@" | "=" | "&"

special     = "$" | "-" | "_" | "." | "+" | "!" | "*" | "'" |
              "(" | ")" | ","

It then defined the escaping rules in terms of those:

[O]nly alphanumerics, the special characters […], and reserved characters used for their reserved purposes may be used unencoded within a URL

This construction in terms of what need not be encoded hurts my brain, but the good news is I think we can ignore this RFC as it’s been superseded. It’s only worth noting as the original source for escaping ~, even though none of the modern RFCs require it. Also, if you ever find yourself pulling your hair out between space, +, and %20, you can thank this RFC for that too.1

RFC2396 πŸ”—

In 1998, RFC2396 “updated” RFC1738. I have to imagine this was at least in part to clarify the escaping rules. Its reserved set is clear and precise, only including those characters with delimiting behavior:

reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
              "$" | ","

As mentioned above, encoding to this specification is the behavior you’ll find in JavaScript’s encodeURIcomponent. I like it because it’s minimal, meaning your URLs can remain more aesthetically pleasing. It being the behavior of the built-in JavaScript function also means anything decoding URLs will almost certainly have to accept this. That said, it is obsolete, so you really shouldn’t be targeting it for new encoding logic.

RFC3986 πŸ”—

Lastly, in 2005, RFC3986 “obsoleted” RFC2396 (and by extension RFC1738). It extends the reserved set with a few more characters, despite them not having any delimiting uses:

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

Encoding to this specification is the behavior you’ll find in Haskell, Python (since urllib v3.7), and non-built-in Node. These are just the languages I sampled, but I’d imagine that most languages' implementations are going to target this newer RFC. The JavaScript documentation for encodeURIComponent also prominently includes a code sample for implementing an RFC3986-compliant encoder on top of it.

Summary πŸ”—

So, which option is correct?

  1. https://app.wiz.io/reports/cicd_scans#~(cicd_scan~'1234)
  2. https://app.wiz.io/reports/cicd_scans#~%28cicd_scan%271234%29)
  3. https://app.wiz.io/reports/cicd_scans#%7E%28cicd_scan%271234%29)

Perhaps unsurprisingly, it depends.

Option 1 is a noise-free non-encoding that will almost certainly continue to function in practice given that the browser language itself still targets RFC2396. Naively using encodeURIComponent and accepting this behavior is safe, because compatibility is unlikely to go away until-and-unless that function itself also changes its behavior.

Option 2 is arguably the most “correct” option, if you are looking to follow standards to the letter, including doing no more work than they explicitly require. If in any language other than JavaScript, I would target this RFC because it’s both easiest and highly correct.

Finally, Option 3 is a decent compromise in over-encoding to gain a little more compatibility without going too far. I would really only do this if I were both in JavaScript, and had already decided to implement an RFC3986-compliant encoding on top of encodeURIComponent. Once that’s happening, adding ~ as another character to escape is trivial.