image map linking to parts of siteSearch Me?What I DidWhat I DoWriting: archived articlesBlog: Glenn's Daily Thoughts

Encoding and Decoding URLs via perl (including decimal to hex conversion)

Version history:
5/24/00: Thanks to 'campbeln' for improving the script - catching a couple small bugs - and turning this into a resource for CPAN.
12/00: Jesús Quiroga provided me with an internationalization of the encoding that handles a wider variety of accented and related characters. Thanks!
11/07: Dan Black sent in a very nifty compression of the whole routine into a single line, which I've incorporated below. Thanks!
Wednesday, November 7, 2001

I was trying to do something extraordinarily simple - so simple, my pea brain obviously couldn't figure it out. I wanted to find every instance of a restricted character in a prospective URL and turn it into the "url encoded" equivalent inside a perl script.

Restricted characters include most punctuation marks; to transmit these as part of a URL without causing an error or the wrong interpretation on the receiving end, you need to convert the characters into its ASCII code. (ASCII is a long-time standard for numbered letters, control characters, and other symbols.)

The character code gets represented in the URL as a percentage sign followed by the hexadecimal (base 16) two-digit number for the ASCII code. For instance, an exclamation point is decimal 33 in ASCII, or hex 21. To include this in a URL, you use %21. Spaces can be represented as plus signs (+) or %20 (ASCII 32).

You'd think this would be simple, right? But in my search of the Web and perl documentation, I found a lot of information on decoding URLs and turning hex into decimal, but not the reverse of either.

For instance, if you want to decode a URL, you use a very simple search pattern:

sub URLDecode {
    my $theURL = $_[0];
    $theURL =~ tr/+/ /;
    $theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;
    $theURL =~ s/<!--(.|\n)*-->//g;
    return $theURL;
}

This pattern takes the hex characters and decodes them back into real characters. The function hex() turns a hex number into decimal; there is no dec() that does the reverse in perl. The "e" at the end of the regexp means "evaluate the replacement pattern as an expression."

After a lot of hunting around and installing the URI module for perl - which I could never get to do proper encoding despite man page instructions - I finally figured out how to do this myself:

Note here's the revised version as of 11/7/01 with help noted at the top of this page:

sub URLEncode {
    my $theURL = $_[0];
   $theURL =~ s/([\W])/"%" . uc(sprintf("%2.2x",ord($1)))/eg;
   return $theURL;
}

The missing piece was the sprintf formatting. The string "%x" means, "take the input and turn it into a hexadecimal character string. Ord converts a character into an ASCII code equivalent in decimal; the %2.2x format turns that into an exactly two-digit hex number.

The reason there's no dec() function in perl is, ostensibly, because %x exists in printf. However, given perl's lifeblood of multiple ways to accomplish everyting, it's surprising that a simple feature exists in one specific way that's hard to find.

I hope this page helps someone save some time and sanity.

(For historical reasons, the original script looked like this:

sub URLEncode {
    my $theURL = $_[0];
    my $MetaChars = quotemeta( ';,/?\|=+)(*&^%$#@!~`:');
   $theURL =~ s/([$MetaChars\"\'\x80-\xFF])/"%" . uc(sprintf("%2.2x",         ord($1)))/eg;
   $theURL =~ s/ /\+/g;
   return $theURL;
}

)