Regular Expressions/Regex for data cleaning
A regular expression, regex or regexp
A regular expression, regex or regexp is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation. From Wikipedia.
A couple of references as well as the freeCodeCamp one above:
A website with library and test area for Regex coding.
w3scools.com javasccript Regex list.
So what is it good for?
Basically cleaning up and filtering your data. For a lot of the data that you gather is not in a well structured state for using. There could be a white-space at the beginning of a word in a CSV file cell, so this will not match with the identical word in another cell that does not have that white-space at the beginning. So you can use Regex to remove any white-spaces at the beginning of CSV cells. Also same issue at end of cell with white-spaces.
Another thing you can do if you have the same text but some is in UPPER CASE some in Lower Case & Some camelCase. Same word but will be read differently. So you can change everything to UPPER CASE so that you can find matches in the data. So Regex is good for finding patterns in files and also in cleaning up data.
Where I have tried to use it
I have come across them in a few different instances:
- Knime. Even the basic node of Row filter offer a regular expression filter.
- Python has a module for it.
- SQL queries use it.
- Excel can use them but has its own advanced filtering.
- RapidMiner uses it to a degree
- Orange, specifically in the Text Mining Module. In the Pre-processing Text node. In fact it was this instance that I could not use it properly that I put it on my list of things I wanted to find out more about. (see example below).
- OpenRefine uses it or a similar form
Orange Text Mining Example
In this example I wanted to filter out some odd text items and didn’t know how to do it.
The actual Regex here is : \w+, the \w is match any letter & number & underscore [a-zA-Z][0-9]_ and the +plus says look for more ( so it doesn’t just find the first one and stop). So therefore it does not look for punctuation [.,;'”;/\? etc ]but filters it out of the search. So this is handy as you do not want lots of punctuation in your wordcloud or exported data that you may need to clean out.
It can be a bit bothersome at times but when you get it right its brilliant.