Hacker News new | past | comments | ask | show | jobs | submit login
Regex Golf (alf.nu)
285 points by subbz on Dec 20, 2013 | hide | past | favorite | 169 comments



  ^(?!(..+)(\1)+$)
Why does that work on primes? I got it by mistake when fiddling with the parenthesis locations but I was expecting to have to deal with xx separately.


Nice find. It works because it rejects "2 or more x's" repeated "2 or more times". So xx doesn't get rejected, but any multiple of that (xxxx, xxxxxx, ...) will be. The same way xxx doesn't get rejected, but any multiple of that (xxxxxx, xxxxxxxxx, ...) will be.

You've solved it using the actual definition of prime numbers, no trickery needed. Well played.

FYI, you don't need brackets around the \1, so can score 286.


More interesting than the definition of primes, it's almost the definition of multiplication that is embedded in this regex. We have two numbers(of occurrences) being multiplied:

- the first one is represented by the group (..+) it represents the number of occurrences n between 2 and +∞

- the second one is represented by (\1)+. We will repeat the first number m times, between 1 and +∞ times.

So the result of the multiplication is n*(m+1), which cannot be a prime. We just have to take the opposite with negative lookahead. It's very beautiful indeed.

See http://regex101.com/r/qN2fQ8 or http://www.regexper.com/#^%28%3F!%28..%2B%29\1%2B%24%29 to follow the above explanations.


Thanks for the web site links! Both are pretty interesting and I actually learned something from the detailed description(s) that regex101 provides.

(I learned that for (\1)+, "Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data")


Later, I realized that in a regular program / on the command line, the negative lookahead can be avoided by using the !~ (doesn't match) operator,i.e., we just check that the number is not a non-prime using a simpler regex:

  DB<63> print "Matches!" if (("x" x 31) !~ /^(..+)\1+$/)
Matches!

  DB<64> print "Matches!" if (("x" x 18) !~ /^(..+)\1+$/)
The Regex Golf site only asserts matches, i.e., it's using =~. That's why the negation using negative lookahead was needed.

(The simpler regex merely looks for non-primes by matching any number of characters which are a multiple of two numbers, n x m, i.e., those which can be factorized. n comes from (..+), m comes from \1+).


Thanks, I was going for the definition of primes but was just going a bit loopy about the negation. I didn't want to trim surplus characters until I understood it but yes I can see those brackets are unnecessary.


This is amazing.


Shouldn't the objective be to get the lowest score if it's called "golf?"


The goal in code golf is lower character count, not lower score - which is also the case here.


Therefore, the score it provides should be a running tally of how many characters you've used to make it match the scoring system of golf; which is the lowest number of strokes wins.

The scoring system for this is incremental which is the opposite of golf.

A proper scoring system with this would provide a character limit (par) for each section and the goal would be to write a shorter regex formula to complete the task. Final score would be how many characters under or over the total character limits (course par) you scored.

Seems this is more like Regex Darts or something like that.

But it's fun nonetheless.


I don't get what this is all about.

The title was (most probably) derived from Code Golf, which is a competition in coding something with as few characers as possible. Code Golf was derived from Golf, where you want to use as few turns as possible.

The score going up and not down, which is done because you get more points the more objectives you fulfill, does not change the objective of this game or what it is based on.


> The score going up and not down, which is done because you get more points the more objectives you fulfill, does not change the objective of this game or what it is based on.

Golf scoring penalizes you with more "points" by how many strokes you take.

If you are playing a Par 4 and it takes you 6 swings to get in the cup you just got _penalized_ +2

If your partner gets in the cup in 2 swings he is awarded -2

Therefor "Regex Golf" has its scoring reversed


From my quick read, Code Golf uses a similar scoring system as to the sport of golf, the lower the score the better.

This game is the opposite, meaning the higher the score the better.

This has nothing to do with the objectives of the games in question, just the scoring method.

To make the sport of golf have a similar scoring system as this game you would grant ten points for every stroke under par, deduct ten points for every stroke over par, and deduct one point for every penalty stroke.


Alternatively, it's a loose analogy and the main point of the game is to save keystrokes (with the positive scoring system added to reward partial effort; if it came down to a scored competition the leaders would anyway all probably have full credit for solving the problem).

Regex golf is to code golf as paintball golf is to normal golf. Or something.


The golf part is that you can only get the most points by writing the shortest regex possible. However, I think the game avoids just giving a score based on the number of characters. To do that, you would be required to match all on the left and none on the right (the equivalent of actually getting the golf ball in the hole). This game lets you pick up your ball whenever you want and scores you based on distance to the hole. For the golfers, think of it more like a closest-to-the-hole competition than the more common stroke play.


I have the most trouble with regex, and something about seeing my matches as I type made this incredibly useful for me!


The best online tool I've seen is http://www.debuggex.com/


I'll add http://www.regexper.com/ to the various sites everyone posted, I find that this one output the nicer graph.


http://regexpal.com/ or for a paid desktop program http://www.regexbuddy.com/, which is most excellent


Regexbuddy is great. Well worth the coin I paid way back when


I've used the previously mentioned regexpal quite often. http://www.rubular.com is another one for Ruby only.


While people are recommending online regex tools.. http://regex101.com/


Many times I would match the exact opposite opposite of what I wanted. Is there a general rule for inverting regexps? ^ and ?! don't seem general purpose.


You need to pin the match to the start and end of the string with ^ and $ respectively otherwise the negation just matches an empty string or other irrelevant string.


Just to follow my thought pattern.

I start with

(.)(.)\2\1

Now to invert I change it to

(?!(.)(.)\3\2)

As you say, the negation matches an empty string. So I put on anchors:

^(?!(.)(.)\3\2)$

But now it matches nothing. I'm not even sure what that regexp says. The entire string is a negation? Would anything match that regexp?

I see elsewhere on this page that the answer involves putting in an extra dummy character, putting that new negation-and-dummy-character in parens, and then requiring that, between the anchors, there be 0-or-more of negation-and-dummy-character.

^((?!(.)(.)\3\2).)∗$

Two questions:

1. Why are my backrefs still \3 and \2? I added another pair of parens. (I thought ?! might not count, but it counted in my second example above.

2. Why does abba no longer match? It has no matches to the negation-and-dummy character construct, which ∗ should match, right?

NB: I used ∗ as my asterisk to avoid bb-code.


  >  ^(?!(.)(.)\3\2)$
  >  But now it matches nothing.
Things like (?!..) are called assertions. I like to think of assertions as patterns that match the space between two characters. So (?!a) means that it matches the space, where the next character is not 'a'.

So in this case, your regex matches an empty string (since it doesn't match any character at all between ^ and $, only spaces.)

  >  ^((?!(.)(.)\3\2).)∗$
  >  1. Why are my backrefs still \3 and \2?
Well, you're referencing the second and third opening parenthesis.

  >  but it counted in my second example above
Maybe that's because your second example is wrong? I mean, I don't really know what you are trying to achieve in your second example.

By the way, my solution for abba is

  ^(?!.*(.)(.)\2\1)
I guess you'll will be able to figure out what this regex means by now :)


Remember

  (?!....) 
is a lookaround, so it matches 0 characters. So

  ^(?!....)$ 
will only match an empty string! (Think about it). Try

  ^(?!....$) 
instead (the $ anchor inside the lookaround)


  ^(?!.*(.)(.)\2\1)
You may also need to fill in the places where it could be anything. The above worked on abba for me.

BTW a double space indent then formats as code on HN I think.


Your approach scores higher, but it only matches the (imaginary) space before the "good" words.

I went with:

^(?!.(.)(.)\2\1).$

with the thought that if I really wanted those matches I would want the whole strings.

Fun game!


oops,

  ^(?!.*(.)(.)\2\1).*$
With proper formatting hopefully.


Yeah, I couldn't figure out how to invert the abba one (.)(.)\2\1 without help.


I figured out Ranges!

  abac|accede|adead|babe|bead|bebed|bedad|bedded|bedead|bedeaf|caba|caffa|dace|dade|daff|dead|deed|deface|faded|faff|feed
Edit: /s


Prime:

    ^x{2,3}$|^x{5}$|^x{7}$|^x{11}$|^x{13}$|^x{17}$|^x{19}$|^x{23}$|^x{29}$|^x{31}$|x{33}


josephlord's solution is much nicer:

^(?!(..+)(\1)+$)


Look at the parent.


[a-f]{4,} works as well... but I do like your solution!


[a-f]{4} is shorter.


  ^[a-f]*$
Is the same length too.


Glob (333) without cheating (replace ⁕s with asterisks, they get turned into italics):

^(\⁕?)(\w⁕)(\⁕?)(\w⁕)(\⁕?)(\w⁕) .⁕ ((.(?!\1))+|\1)\2((.(?!\3))+|\3)\4((.(?!\5))+|\5)\6$

Edit: ((.(?!\1))+|\1) is used to conditionally match .+ iff a * has been found. .(?!\1) Matches any character if it is followed by \1. When * has been found then it matches no character, when * is not found it matches every character.

Edit 2: Formatting to avoid the *s becoming italics :/


  le[^*]|co|ito|dr|^p|su|gi|nr|hw|fa|[eo]b|ide
Glob 376 although it isn't pretty and better may be possible.

Indent two spaces with a blank line above to avoid code mangling.


Glob 380

  ^([wlpb]|c[hor]|do|re|mi|\*[pifvt]|\*er)


en(?!c)|rr|il|eat|^(do|p|b|c(?!a))

Glob 386 although not the cutest way to do it.


An interesting bit on the computational complexity of solving this problem (with a slightly different scoring function):

http://cstheory.stackexchange.com/questions/1854/is-finding-...


Hrm, on number 8 "Four" using:

    (.)(.*\1){3,}
I got all but the "do not match" for "Ternstroemiaceae"

The challenge appeared to be to match words with four instances of the same letter. "Ternstroemiaceae" contains four 'e's, and thus should be in the "match" column, instead of the "don't match" column, no? Did I miss something?


Look closer, there's something different about the matches and this word.



I like the idea, but words there seems to be pretty random. I can't figure out the pattern, not even talking about writing regex... :(


I got a 3 character pattern for the first puzzle and a 4 character pattern for the second puzzle (and reading a comment here I see there is a 2 char pattern for the second puzzle).

Only used 1 special char between them (anchoring something). So the first two sets have repeated 3 character sequences, and in one of them they follow an additional pattern.

For the third sequence, 8 character solution, lots of specials.


Read a) name of the puzzle b) the little help below the score. It gives out the pattern you should (not) look for.

edit: Not in Glob. What is Glob about?


It's about implementing * as a wildcard character:

https://en.wikipedia.org/wiki/Globbing

I haven't figured out how to solve this with the parts of ERE and PCRE that I know. (I definitely don't know the entirety of PCRE.)

It's straightforward for me to write a substitution using regular expressions to create a pattern-matcher for a given glob (just anchor the ends and replace literal ? with . and literal * with .*) but here we have to do it inside a single regular expression.

I don't think there's a BRE solution if the number of stars is unbounded because I don't think this is a regular language.


Gist with my answers: https://gist.github.com/jpsim/8057500

If you look at the revisions, you'll see my 1st iteration was mostly identifying patterns, then with more and more cheating (and looking at this thread) to squeeze every point possible.



  5.  193  ^(?!.*(.)(.)\2\1)
  11. 379  ^\*(er|[fiptv])|^([blpw]|c[hor]|do|mi|re)


What are the rules? That is, are these Perl regexes, POSIX regexes, …? (Come to that, what is this site? Going up one level to alf.nu gives me a lot of suggestions for what I can do by modifying the address, but no clue of who's doing it on my behalf.)


challenge: use machine learning to find the best solutions.

They might improve on those intended by exploiting accidental regularity in the corpus - though charmingly, the golf-cost of regex length helps combat this overfitting. They might also find genuinely cleverer solutions.


Reminds me of Vim Golf.

http://vimgolf.com/


Does anybody know how to do math with regex? (triples)

And conditionals don't seem to work?


From: http://quaxio.com/triple/

    ^([0369]|[258][0369]*[147]|[147]([0369]|[147][0369]*[258])*[258]|[258][0369]*[258]([0369]|[147][0369]*[258])*[258]|[147]([0369]|[147][0369]*[258])*[147][0369]*[147]|[258][0369]*[258]([0369]|[147][0369]*[258])*[147][0369]*[147])*$


The idea is as thus:

Remember division rules? If the sum of digits is divisible by 3, then the number is divisible by 3

So, if you have 0, 3, 6, 9, it's as you can remove them, the value won't change

Remaining are the 2, 3, etc digit numbers divisible by 3


569pts: ([^31]0|31|[017]2|[03]03|[^1]4|(900|01|7)5|6|[48]7|[57]8|09)$


Originally had 568, then saw this and improved. :)

580pts: 00(0$|3|6|9|12|15)|[^0]14|.53|^3[^38]|55|43|23|9.7


Nice. A smidge better at 582:

    5[54]|2[437]|00($|[369]|1[25])|^8[17]|^3[29]|9.7


One more smidgen, behold 584:

    5[54]|2[437]|00($|[369]|1[25])|^[83][1729]|9.7


586:

    ^[378][12479]|00($|[369]|1[25])|5[45]|2[347]


589: ^[378][12479]|00($|[369]|1[25])|55|2[347]


My guess is that 147 must appear the same number of times as 258, but I'm not sure if that's even expressible in regexp.


Refined: [0369] can appear any number of times [147] and [258] must appear an equal number of times, or for any remaining: [147] must appear a multiple of 3 times [258] must appear a multiple of 3 times


No. 111 is a multiple of 3.


Sadly that wouldn't work for: 140091876 147 = 4 times 258 = 1 time


I got 187 with this

  [369](?![7238])|(?:0)12
It's sloppy and certainly not correct, but a decent score.


^[01349][064] works for 197


Ternstroemiaceae contains four es; anyone else hit that issue / know what that's not in the valid results for Four? I'm guessing there's a pun in there that I didn't get :/


Fun learning tool for regex.


Practise, maybe. Learning, probably not so much. Certainly fun, though, if you can cope with them!


Learning, kinda. I've often been motivated to learn something after being given a problem that can only be solved (or be solved much more easily) using it. Every other Excel trick I know is a result of that.


I like the concept but the word choice doesn't seem too "regular", it is more about catching all the particulars rather than finding a pattern as far as I can tell.


There are good patterns, the first regex for me was /foo/, the second /ick$/, for example. They're there.


What purpose do the "plain strings" serve?


I got Four for 196 with

    (.).*\1.\1.*\1
and Order for 156 with

    ^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$


Order for 198:

   ^[^o].....?$
Probably not what was wanted, but it works (or maybe it was to trick people onto a false path)


This right here is why test cases are always blackboxed...


Makes my similarly off-track score of 184 seem foolish:

  ^[a-g][^airn]|[fs].*y$|ot$


Nice, I earned 197 with: ^[a-m].{1,5}$

I must say Regex golf is a lot of fun.


Nice, turn it around for 199:

  ^.{5}[^e]?$


or ^[^o].{1,5}$

198 points


Balance (194)

  ^<.*>$|^<<.*>>$|^<<<.*>>>$


Balance (286)

  ^(<(<(<(<(<(<<>>)*>)*>)*>)*>)*>)*$
If we had named captures:

  ^(((?<P><)|(?<-P>>))*(?(P)(?!)))$


Balance (288) with one bad match: ^(<(<(<(..)>)>)>)$


Balance (212)

[<>]{28}


Four: 199

    (.)(.\1){3}
Order: 168

    ^a*b*c*d*e*f*g*h*i*l*m*n*o*p*r*s*t*w*y*z*$
(You just didn't remove unused letters.)


Order for 195

    ^[a-g]*[^a-g]*$


Order for 170.

  ^([ab][c-gio][^d]|c[ehlo]|d|f[il]|gh|mo)
Not as clever as you guys though!


Order for 196:

    ^[a-i]*[l-z]*$


Or just (.).\1.\1.\1 for Four (198)


So long as we are going down this route, you can shave off another character with (.)(.\1){3} (199).


Can chop out one check to save a character:

(.)...\1.\1


For Abba...

Why doesn't (.)(.)\2[^\1] work?

I thought backreferences matched the captured literal, so negating it would match? But this looks the same as (.)(.)\2\1...


You cannot negate a capture, only a character literal. A capture might only be a character but it is NOT a literal,


Gotcha. That makes perfect sense. Thank you! :)


You got it :-)


(.)(.)\2\1 ?


[deleted]


Anchors (208):

  k$
Ranges (202):

  ^[a-f]+$
Backrefs (199):

  (\w{3}).*\1


Backrefs (201):

(...).*\1


Wait, so what's with the 'point system' here? Why's shdon answer for backrefs 199 and yours 201? (and while you're at it, could you briefly explain (...).*\1 for the regex newbies?)


Regex explained:

    (...) # Match exactly 3 (the dots) characters and save them as a group (the parenthesis)
    .* # Match any character (the dot) 0 or more times (the asterisk)
    \1 # Reuse the first group


backreferencing is new to me, but why this: (...).*(...)

isn't working !! ain't I back referencing the first group which is (...) ?


It isn't working because (...) is just matching 3 characters. What your regex is saying, is basically: Match any 3 characters, then any character 0 or more times, then any 3 characters.

You're reusing the regex and not the match. Back referencing essentially means that you'll match whatever is matched the first time, rather than reusing the regex pattern.

I hope that made a little sense. Grouping, capturing, matching and backreferences is part of what makes regex tricky and very powerful. You might want to use some of the online tools, which can help explain the regex visually.


thanks for clearing that for me, I figured it out my self after giving it sometime.


"You are deducted one point per character you use, and ten if you match something you shouldn't."


I got ranges at 202 with

  [a-f]{4}
198 on plain strings with

  .\*foo[lt]?.*
Then I stopped trying.


Plain string: just 'foo' works.


fml


Indent your regexes by two spaces and precede them with a blank line, that will preserve asterisks ;)


This will earn you some more points: Anchors: k$ Ranges: [a-f]{4}


"sick" :) one for anchors. I was being quite complex with \w+(ick)$ Maybe I should read from right to left.

That ranges one was pretty obvious though. Stuck at abba.


Why publish them?

You only need 1 range.


that's twice as long as it has to be.. not that it matters.


Glob: 277, it didn't match all and not match the others, but it's reasonably high. ^([bcdlmpwr]|\*[efptv])


378:

  ^(\*(er|[fiptv])|b|c(?!a)|do|le|mi|p|re|w)


379:

  ^\*(er|[fiptv])|^([blpw]|c[hor]|do|re|mi)
and it has "do re mi" in it!


380:

  ^.[^bds].*[^e-kjotz] .* [^eiz].+[^lx]..$


390:

  ^[lwp]|fa|r[ro]|[isytd]l|c$|de


390:

    de|eat|fa|rr|ow|[rl]o|^p|[cd]$


Glob 392: ^p|c$|[wrbc][npbro]|ai|fa|il


Glob 396: ai|c$|ep|[bcnprw][bnopr]


Glob 397: ai|c$|^p|[bcnrw][bnopr]


For "6. A man, a plan", I thought it was impossible to match palindromes with regex, am I wrong?


It gives you hints below the score, for 6 it's:

  You're allowed to cheat a little, since this one is technically impossible.
No idea how you cheat...

EDIT, SPOILERS: I get 170 with ^(.?)(.)(.).?\3\2\1$


176 with ^(.)(.).*\2\1$


  ^(.)[^p].*\1$           # 177, "cheat a little"


You're pretty much correct. True regular expressions don't have the expressive power to decide whether arbitrary-length strings are or aren't palindromes.

That said, 'regular expressions', as used in most programming languages, have extensions that extend the expressive power. One such extension is the matched group & backreference, used in other commentor's answers.

From a theoretic stance, these aren't really 'regular expressions', but that's what we call them in practice.


Seems very very very broken. The regex "[a]" apparently matches "crenel"


It doesn't. The green ✔ indicates that you've completed the task successfully — that is, your regex does not match "crenel".


It's a bit confusing. Ticks on the left means it did match, ticks on the right means it didn't


I hit enter and nothing happens.


Use the yellow box, not the name box that gets your focus when you land on the page.

This confused me for a few minutes.


what's the pattern on "Abba"? I thought it was just to exclude doubled letters but I have doubles on two words on the left hand side as well (noisefully and effusive, in case the word lists are the same)


Excluding doubled consonants that have the same vowel on both sides of the pair:

  ^((?!([aeiou])([^aeiou])\3\2).)*$
The following also works for the testcases and is shorter:

  ^((?!(.)(.)\3\2).)*$


Can get 2 more points with:

  ef|^((?!(.)\2).)*$
But I reckon yours wins for having a pair of breasts in the middle


well, if you are strictly interested in matching with the fewest characters, regardless of the apparent pattern:

  ^(.)(.).*\2\1$


Your anchors at the end make it match nothing.


sorry, misreplied. This was for palindromes.


5. Abba (^unv|.u|z|ph|mi|st|vi|tan)

haha


am I being a bit slow? Why doesn't [^g-z] work on 'ranges'?


Because for example "beam" matches [^g-z]. That is, it has a letter somewhere that is not between g and z (namely e and a). I came up with ^[a-f]+$ but I'm guessing it could be shorter :)

Edit: ah, every word in left column has 4 a-f letters. So [a-f]{4} is a shorter match.


ah you're right :) I was being slow, what I thought I wrote was 'words consisting only of letters that arent g-z'


Another dumb question - why does 'beam' not match [bdf][ae] ?


It's complaining because it does match.


Oh. I told you it was a dumb question.


Me and a coworker totaled 3079 points. Anyone beat it?


Got hooked and to 3202, but ca 10-20 more points could be gained according to answers here and there: https://gist.github.com/jonathanmorley/8058871

Kudos to the author of the game, good job.

PS. http://regexcrossword.com/


Doesn't seem to do anything on an iPad...


Why is Kesha an optimal answer? (#2 k$) ;)


Spoiler alert (201 points): f[ao][no]


You can just write: foo


Oh, there's more than one step, I see.


(.+|)foo(.+|)

Lol so awesome this. I oblige. Much love!


Powers:

    ^((((((((((x)\10?)\9?)\8?)\7?)\6?)\5?)\4?)\3?)\2?)\1?$
I feel like there's gotta be a sneakier way of doing this.


Completely not sneaky at all, but does reasonably well at 71 points:

  ^(x|xx|xxxx|x{8}|x{16}|x{32}|(x{64})+)$


Mine is a bit shorter, though a bit more "meh" as well.

    ^x{32}$|^(x{2}){1,8}$|^(x{64})+$|^x$


improved, gives 80

    ^(x|(xx){1,9}|x{32}|(x{64})+)$


I tried to get rid of those redundant ^ and $ but it somehow didn't work. I probably forgot to put it all inside one group.


They are not redundant. Any string of one or more exes is going to constrain a substring of exes of length 2^n (consider n=0 for a trivial proof), so you do need those anchors!


I meant my previous post where I had them in every single possibility.


I did similarly, but managed to squeeze out a few more points:

^(((((((((xx?)\9?)\8?)\7?)\6?)\5?)\4?)\3?)\2?)\1?$

(60 points)


A non-cheating solution, 79 points.

  ^((xx?|xxxx)(\2{7})?)(\1{63})?$


I was working on this pattern:

^(x{1,2}){1,2}$

But it wedged my browser as I started adding more terms. :/


58 points:

    ^x$|^(x+)\1$


^ answers all questions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: