We have two kind of on-line tools for different flavors of this problem:

Extraction

You provide:

  1. a set of input strings;
  2. for each string you specify the (possibly empty) substring to be extracted.

The tool attempts to infer a general data extraction pattern from the examples. It generates a regular expression that may be applied to a data stream for extracting only the portions complying with the inferred pattern.

Based on "Inference of Regular Expressions for Text Extraction from Examples", published on IEEE Trasactions on Knowledge and Data Engineering (IEEXplore link; short preliminary version at ACM GECCO 2012)

Classification (regex golf)

You provide:

  1. a set of strings to be matched;
  2. a set of strings to not be matched;

The tool does not attempt to infer any general pattern. It generates a regular expression that classifies the provided strings (with overfitting), in a spirit identical to http://regex.alf.nu/ and http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313-part2.ipynb

Based on "Playing Regex Golf with Genetic Programming", presented at ACM GECCO 2014. Finalist at Humies 2014 - Human Competitive Results Produced by Genetic and Evolutionary Computation.