Automatic Generation of Regular Expressions from Examples
Click on "Evolve!"
This web application generates regular expressions (regex) automatically by means of examples.
Each example is a pair of strings, "input string" and "string to match", where the latter must be a substring of the former (or an empty string). The input string must be less than 256 characters.
Optimizing...
Our regex generator occasionally receives spikes of many thousands visits: please be patient.
How it works
Internally, the application makes use of Genetic Programming (GP)
The examples are splitted into 2 subsets: the training set, used to drive a GP-search within a population of candidate regular expressions; and the validation set, used to choose the best regex in the current population. The search is iterated for a predefined amount of iterations.
When there are less than 200 examples, the training set (Type T) and validation set (Type V) have the same size. Otherwise, the training set is composed of 100 examples and the other examples constitute the validation set.
Constraints
- The maximum length of each string is of 256 characters
- The minimum number of examples is fixed to 8
- The maximum total (aggregate) size of the examples is limited to 5 MiB
When one of these constraints is not satisfied the application will not start the generation and show an alert.
F.A.Q.
-
I got a "Server is busy!" error.
We built this web app as a demonstrator for our paper presented at GECCO 2012. It is not designed to handle hundreds of concurrent requests, but we are working toward making the prototype able to tolerate a reasonable amount on traffic. Follow our Twitter account to be notified about improvements (but please do not expect huge scalability: we are a small group with limited resources).
-
Are you going to release the code?
We do not plan to realease the code in the short term. We have limited resources and are currently focussing on the web app.
-
The generated regexes look unoptimized. Why?
There are some optimization which could make a given regex appear better, e.g.,
[\d\d\.\d]
→[\d\.]
. Currently, our prototype does not apply any of these optimizations. Note, however, that optimizing the regexes during the evolution could be counterproductive, because it could reduce diversity and hence hamper the GP ability of exploring the space of solutions. -
Should the examples be representative?
Yes, definitely. As for any approach which synthetizes a solution given only a set of examples, the quality of the generated solution does depend on the quality of the examples. In particular, the examples should be representative of the actual use expected for the solution. Of course, this is a very intuitive and informal description of this issue.
-
Why do you use some examples for "validation" purposes ? Why not use all the examples for learning ?
Validation examples are necessary for assessing the generalization ability of solutions generated based only on learning examples. This information is an essential component of the criterion for choosing the best solution among those generated by the evolutionary search.
-
Why don't you run the evolution on the client side, using JavaScript?
Unfortunately, the JavaScript engine for regular expressions does not use possessive quantifiers. We instead prefer possessive quantifiers because they allow faster evaluations than greedy and lazy quantifiers. Since the evolutionary search performed by GP requires each regex candidate to be applied on each example, using possessive quantifiers would require a tremendous amount of time - too much to be practical.
-
What is Genetic Programming? Which parameter values do you use?
For the details about the algorithm used by this web app, please read our (short) GECCO paper. Note that some of the parameter values used in this web app are different from what presented in the paper, because they are optimized for obtaining a good regex as soon as possible, rather then obtaining a good and compact regex later.
Bartoli, Davanzo, De Lorenzo, Mauri, Medvet, Sorio, Automatic Generation of Regular Expressions from Examples with Genetic Programming, ACM Genetic and Evolutionary Computation Conference (GECCO), 2012, Philadelphia (US)
De Lorenzo, Medvet, Bartoli, Automatic String Replace by Examples, ACM Genetic and Evolutionary Computation Conference (GECCO), 2013, Amsterdam (Netherlands)—the string replace functionality described in this paper is based on an extension of the work showcased on this web app; it is currently not exposed on the web.