Arc Forum | There's nothing joyful about regular expressions. For one thing, as above, it le...

Arc Forum

1 point by partdavid 6690 days ago | link | parent

There's nothing joyful about regular expressions. For one thing, as above, it leaks logic all over your code. Secondly, it's unclear--it would be quite difficult to identify a bug in your expression. A related clarity problem is that you have restricted inputs to a subset of valid inputs, and it's hard to see how or why. Third, they are brittle and hard to instrument for diagnostics.

1 point by lojic 6689 days ago | link

I don't see how it "leaks logic all over your code". But I like to keep an open mind - what are you suggesting as an alternative for the above example?

Regarding the difficulty in identifying a bug in the expression, I tend to agree. That's why I have a lot of unit test cases for each meaningful regular expression.

Regarding restricting the inputs to a subset of valid inputs, which inputs would you like to accept that the regex rejects (for U.S. currency only)? I haven't had any complaints yet, but that's not to say it won't happen in the future.

-----

1 point by partdavid 6680 days ago | link

1) If you have capture patterns, you have code in one place dependent on the expression in another without the coupling being clear. More mildly, you are married to regular expression operators because you have direct references to your (regular-expression-defined) subprogram all over. You can't decide not to make it a regular expression; or to make it two, or make it a much clearer expression and a programmatic dress-up. The alternative is to not use regular expressions.

2) Eh, unit tests can catch the mistakes you anticipated making. Lots of other mistakes are possible. Why write a complex regular expression and page full of unit testing code for it when you could write more straightforward logic.

3) Like I said, it's not clear why you have the expressions you do.

I'm not saying regular expressions never ever have their place. In particular, they can be a convenient method to offer users to specify search and validation patterns and that kind of thing. But fixing program logic into them is a bad idea.

Now, if it's inconvenient or inefficient to express that textual extraction in some way other than regular expressions, I'm suggesting that is a failure of the language (for example, because pattern matching is weak or specifying grammars is cumbersome), not a point for recommending regular expressions.

-----

2 points by lojic 6665 days ago | link

Sorry, I just now saw this. The Arc forum makes it darn near impossible to realize someone has replied to an older item :(

I think an example of what you're talking about would be great. If you have a better way to validate textual data than regular expressions, then naturally I would want to know about it.

Here's a few regular expressions I've collected. I realize they're not perfect (e.g. the email regex), but they're good enough for my purposes presently.

    REGEX_EMAIL     = /^([-\w+.]+)@(([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})|(([-\w]+\.)+[a-zA-Z]{2,4}))$/
    REGEX_FLOAT     = /^([0-9]+|[0-9]{1,3}(,[0-9]{3})*)?(\.[0-9]*)?$/
    REGEX_INTEGER   = /^([0-9]+|[0-9]{1,3}(,[0-9]{3})*)$/
    REGEX_ISO_8601  = /^(\d{4})-(\d{2})-(\d{2})[Tt](\d{2})[-:](\d{2})[-:](\d{2})[Zz]$/
    REGEX_MONEY     = /^(\$[ ]*)?([0-9]+|[0-9]{1,3}(,[0-9]{3})*)?(\.[0-9]{0,2})?$/
    REGEX_PHONE     = /^(\((\d{3})\)|(\d{3}))[-. ]?(\d{3})[-.]?(\d{4})[ ]*([^\s\d].{0,9})?$/
    REGEX_SSN       = /^(\d{3})-?(\d{2})-?(\d{4})$/
    REGEX_ZIP_CODE  = /^(\d{5})[- ]?(\d{4})?$/

So, what would you use to accomplish the same thing without regular expressions that is as concise? The regular expressions allow an easy way to both validate user input and parse it via groups. They're declarative vs. imperative. I have these in a Ruby module with associated functions for parsing (which primarily uses the groups) etc., so they're encapsulated together.

I think you mentioned you're an Erlang programmer, so how would the non-regex Erlang code look to validate an email address corresponding to the REGEX_EMAIL expression above?

-----

1 point by partdavid 6654 days ago | link

Ah, you're right, it's a bit hard to see when folks have replied. Yes, I'm an Erlang programmer.

In response to your question, I don't accept your premise that replicating a particular regular expression is a real programming task. You say your email regular expression isn't perfect, but it's not clear to me why you chose those particular set of restrictions beyond what's defined in the RFC--so it's a little hard for me to replicate (for example, the local-part and domain of the address can have a more kinds of characters that what you have defined).

Instead, I'll offer this as a non-equivalent but interesting comparison. I've elided the module declarations (as have you), including the imports that allow some of these functions without their module qualifiers:

  email(S) ->
     [User, Domainp] = tokens(S, "@"),
     {User,
      case {address(Domainp), reverse(tokens(Domainp, "."))} of
         {{ok, Addr}, _} -> Addr;
         {_, RDomain = [Tld|_]} when length(Tld) >= 2,
                                     length(Tld) =< 4 ->
            join(reverse(RDomain), ".")
      end
     }.

I don't know how the terseness of this compares with your example, given that it includes some things that yours doesn't (a way to call it, a format for the return value rather than the capture variables). Terseness, of course, in the pg sense of code tree nodes, whatever they are. :)

The Erlang function above returns a tuple of the local-part and the domain part and throws an error if it can't match the address. If this were something I wanted to ship around to other functions or send to another machine or store in a database table or something, I would have email/1 return a fun (function object) instead.

If either one of us wanted something better than what we have (or even if we don't--it seems like coming up with The Right Thing To Do With Email Addresses is worth a bit of time to do only once) I would write a grammar. The applicable RFC 2822 more or less contains the correct one, which is only a few lines.

At the "low" end of text processing power, there are basic functional operations on strings and lists, and at the "high" end there are grammar specifiers and parser generators. In the band in between lives regular expressions, and I am not convinced that that band is very wide. I like regular expressions (and, indeed, I would like it very much if Erlang had better support for them) but for me they are a specialized tool, particularly useful (like wildcard globbing) for offering as an input method to users.

But they aren't a general solution to every kind of problem, and for that reason I don't think Arc or any other general-purpose language benefits from baking them into the basic syntax--they belong in a library.

-----