Hackers News

Why do regexes use `$` and `^` as line anchors? • Buttondown

March 25, 2024

A history that will satisfy nobody.

Next week is April Cools! A bunch of tech bloggers will be writing about a bunch of non-tech topics. If you’ve got a blog come join us! You don’t need to drive yourself crazy with a 3000-word hell essay, just write something fun and genuine and out of character for you.

But I am writing a 3000-word hell essay, so I’ll keep this one short. Last week I fell into a bit of a rabbit hole: why do regular expressions use $ and ^ as line anchors?

This talk brings up that they first appeared in Ken Thompson’s port of the QED text editor. In his manual he writes:

b) “^” is a regular expression which matches
character at the beginning of a line.

c) “$” is a regular expression which matches
character before the character (usually at the end of a line)

QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.

Okay, but then why did Ken Thompson choose those characters?

I’ll sideline ^ for now and focus on $. The original QED editor didn’t have regular expressions. Its authors (Butler Lampson and Peter Deutsch) wrote an introduction for the ACM. In it they write:

Two minor devices offer additional convenience. The character “.” refers to the current line and the character “$” to the last line in the buffer.

So $ already meant “the end of the buffer”, and Ken adapted it to mean “the end of the line” in regexes.

Okay, but then why did Deutsch and Lampson use $ for “end of buffer”?

Things get tenuous

The QED paper mentions they wrote it for the SDS-930 mainframe. Wikipedia claims (without references) that the SDS-930 used a Teletype Model 35 as input devices. The only information I can find about the model 35 is this sales brochure, which has a blurry picture of the keyboard:

I squinted at it really hard and saw that it’s missing the []{}\|^_@~ symbols. Of the remaining symbols, $ is by far the most “useless”: up until programming it exclusively meant “dollars”, whereas even something like # meant three different things. But also, $ is so important in business that every typewriter has one. So it’s a natural pick as the “spare symbol”.

Yes this is really tenuous and I’m not happy with it, but it’s the best answer I got.

If we’re willing to stick with it, we can also use it to explain why Ken chose ^ to mean “beginning of line”. ^ isn’t used in American English, and the only reason QED wasn’t using it was because it wasn’t on the Teletype Model 35. But Ken’s keyboard did have ^, even when it wasn’t standardized at the time, so he was able to use it.

(Why did it have ^? My best guess is that’s because ASCII-67 included it as a diacritic and keyboards were just starting to include all of the ASCII characters. The Teletype 35 brochure says “it follows ASCII” but didn’t include many of the symbols, just uses the encoding format.)

So there you have it, an explanation for the regex anchors that kinda makes sense. Remember, April Cools next week!

If you’re reading this on the web, you can subscribe here. Updates are once a week. My main website is here.

My new book, Logic for Programmers, is now in early access! Get it here.

admin

The realistic wildlife fine art paintings and prints of Jacquie Vaux begin with a deep appreciation of wildlife and the environment. Jacquie Vaux grew up in the Pacific Northwest, soon developed an appreciation for nature by observing the native wildlife of the area. Encouraged by her grandmother, she began painting the creatures she loves and has continued for the past four decades. Now a resident of Ft. Collins, CO she is an avid hiker, but always carries her camera, and is ready to capture a nature or wildlife image, to use as a reference for her fine art paintings.

Related Articles

Leave a Reply