Regexes in Clojure

June 03, 2014

Summary: With a few functions from the standard library, Clojure lets you do most of what you want with regular expressions with no muss.

Clojure is designed to be hosted. Instead of defining a standard Regular Expression semantics that works on all platforms, Clojure defers to the host's semantics. On the JVM, you're using Java regexes. In ClojureScript, it's Javascript regexes. That's the first thing to know.

Other than the semantics of the regexes themselves, the API is standardized across all platforms in the core library. And the syntax is convenient because you don't need to double escape your special characters.

Literal representation

Regexes can be constructed in Clojure using a literal syntax. Strings with a hash in front are interpreted as regexes.

#"regex"

On the JVM, the above line will create an instance of java.util.regex.Pattern. In ClojureScript, it will create a RegExp. Remember, the two regular expression languages are similar but different.

Matching (with groups)

There is a nice function that matches the whole string. It is called re-matches. The return is a little complex. If the whole string does not match, it returns nil, which is nice because nil is falsey.

=> (re-matches #"abc" "zzzabcxxx")
   nil

If the string does match, and there are no groups (parens) in the regex, then it returns the matched string.

=> (re-matches #"abc" "abc")
   "abc"

If it matches but there are groups, then it returns a vector. The first element in the vector is the entire match. The remaining elements are the group matches.

=> (re-matches #"abc(.*)" "abcxyz")
   ["abcxyz" "xyz"]

The three different return types can get tricky, but in general I do have groups, so it's either a vector or nil, which is easy to handle. You can even destructure it before you test it.

(let [[_ fn ln] (re-matches #"(\w+)\s(\w+)" full-name)]
  (if fn ;; successful match
    (println fn ln)
    (println "Unparsable name")))

Matching substrings

re-matches matches the whole string. But often, we want to find a match within a string. re-find returns the first match within the string. The return values are similar to re-matches.

No match returns nil

=> (re-find #"sss" "Loch Ness")
nil

Match without groups returns matched string

=> (re-find #"s+" "dress")
"ss"

Match with groups returns a vector

=> (re-find #"s+(.*)(s+)" "success")
   ["success" "ucces" "s"]

Finding all substrings that match

The last function from clojure.core I use a lot is re-seq, which returns a lazy seq of all of the matches, not just the first. The elements of the seq are whatever type re-find would have returned.

=> (re-seq #"s+" "mississippi")
   ("ss" "ss")

Replacing regex matches within a string

Well, matching strings is cool, but often you'd like to replace a substring that matches with some other string. clojure.string/replace will replace all substring matches with a new string. Let's take a look:

=> (clojure.string/replace "mississippi" #"i.." "obb")
   "mobbobbobbi"

This function is actually quite versatile. You can refer directly to the groups in the replacement string:

=> (clojure.string/replace "mississippi" #"(i)" "$1$1")
   "miissiissiippii"

You can also replace with the value of a function applied to the match:

=> (clojure.string/replace "mississippi" #"(.)i(.)"
     (fn [[_ b a]]
       (str (clojure.string/upper-case b)
            "--"
            (clojure.string/upper-case a))))
   "M--SS--SS--Ppi"

You can replace just the first occurence with clojure.string/replace-first.

Splitting a string on a regex

Let's say you want to split a string on some character pattern, like one or more whitespace. You can use clojure.string/split:

=> (clojure.string/split "This is a string    that I am splitting." #"\s+")
   ["This" "is" "a" "string" "that" "I" "am" "splitting."]

Nice!

Other functions

Those are all of the functions I use routinely. There are some more, which are useful when you need them.

re-pattern

Construct a regex from a String.

re-matcher

This one is not available in ClojureScript. On the JVM, it creates a java.util.regex.Matcher, which is used for iterating over subsequent matches. This is not so useful since re-seq exists.

If you find yourself with a Matcher, you can call re-find on it to get the next match (instead of the first). You can also call re-groups from the most recent match. Unless you need a Matcher for some Java API, just stick to re-seq.

Conclusion

Well, that's regexes as I use them. They're super useful and easy to use in Clojure once you get the hang of them.

If you're interested in learning the fundamentals of Clojure, may I suggest my own LispCast Introduction to Clojure video series. It guides you through a deep experience of the language. You'll learn REPL skills, how to set up a project, and how to develop a DSL, all in a fun, interactive way.

You might also like