blog post about regex (regular expressions)

Finally!

You've been waiting and here it is, the regex (short for regular expressions) post you have all been asking for.

Let me say this to begin with: All regex tutorials are bad. Mine will be worse.

In [1]:
import re

There you go. Can you feel the growing ultimate power of regular expressions in your vicinity as you read this? Importing alone gives me shivers.

To follow tradition, as all regex tutorials start off, I need to impart to you how powerful regexes are.

Regexes are powerful.

Did you feel the power? Give me another chance.

REGEXES ARE POWERFUL

What are regexes for all my readers who don't care and are here just to inflate my Google Analytics scores (I appreciate you)?

A regular expression is probably as irregular as it gets. It's a language completely made up to extract text by assembling patterns and routines to find that text.

To display this ultimate power of finding text, I have stored an entire song into the variable txt by the legendary musical artist, Enrique Iglesias.

(to see contents of txt you can scroll to the bottom of this post)

In [4]:
re.findall(r"art", txt)
Out[4]:
['art', 'art', 'art', 'art', 'art', 'art']

Told you he's an artist. This song is all about art.

What's happening here? You can just input the exact string you're looking for and regex will return all the instances it finds it in a nice little list with the .findall(pattern, text_to_be_searched) method.

That sure is useful for finding things when I know what I am looking for but what if I wanted to see where art is appearing without really knowing where it appears. Because I don't know where it appears. Regex has a solution to this.

Ever add ellipses for the mystery....? Me too.....

In [6]:
re.findall(r"...art...",txt)
Out[6]:
['alvarte a', 'alvarte a', 'alvarte a', 'alvarte a']

Enrique is so sneaky that he's actually hidden art in some of his lyrics. How meta. This proves he's a legendary artist. But it looks like we're returned fewer instances of where art appears now, which means that our regular expression pattern (a regular expression pattern is a combination of characters and symbols that specify what we're looking for).

I want to find all the places that Enrique hides his art.

In [16]:
re.findall(r"(\w*art.)", txt)
Out[16]:
['salvarte', 'salvarte', 'tocarte', 'acariciarte', 'salvarte', 'salvarte']

Whomever he's singing to sure needs a lot of saving. Geez.

Quick explanation on the regular expression above that actually looks like I've had a change of heart and am now searching for warts.

\w is one of those special combos of characters and symbols that hold meaning in regex. This one means, match any word character (equal to [a-zA-Z0-9_])

* this is a quantifier, that matches between zero and unlimited times. I used this one because I didn't know how many times words with art would appear in the text string.

. the period matches any character except for line terminators. This one I used knowing full well that there was most times an e after the string art in most words that art would appear in, in this song.

( ... ) the parenthesis define our capture group, or what we want to return essentially.

If we were really just looking for how many times the word art appears in the song, all we would have to do is use \b on either side of art like so:

In [17]:
re.findall(r"(\bart\art)", txt)
# the \b looks for "word boundaries" or spaces as we've come to know them
Out[17]:
[]

True artists never explictly talk about their art in their art.

Maybe at this point they don't seem that powerful and in all honesty, it's because I truly only started learning regular expressions last week.

To wrap up, I want to show what a regular expression might be used for in real life outside of this blog.

In [23]:
import pandas as pd
hn = pd.read_csv('hacker_news.csv')
hn_urls = hn['url'][:20]
hn_urls
Out[23]:
0                                                                               http://www.interactivedynamicvideo.com/
1                                http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/
2                                      https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429
3                                                            http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0
4       http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/
5                                                                                                                   NaN
6                                                                                                                   NaN
7                     http://firstround.com/review/shims-jigs-and-other-woodworking-concepts-to-conquer-technical-debt/
8                                                             http://www.southpolestation.com/trivia/igy1/appendix.html
9                            http://techcrunch.com/2016/03/15/crate-raises-4m-seed-round-for-its-next-gen-sql-database/
10                                            http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/
11                                                              https://medium.com/@loorinm/coding-is-over-6d653abe8da8
12                                                                                                 https://iot.seeed.cc
13                                            http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html
14                                                                                                                  NaN
15                                                                          http://beta.crowdfireapp.com/?beta=agnipath
16                                                                                                                  NaN
17    https://www.bostonglobe.com/magazine/2015/12/29/years-later-did-big-dig-deliver/tSb8PIMS4QJUETsMpA7SpI/story.html
18                                                                                                 https://www.valid.ly
19                                                                              http://apod.nasa.gov/apod/astropix.html
Name: url, dtype: object

What are these? These are urls in a dataset that's true purpose and reason does not matter to the point of this blog post. What we are going to do is separate these urls out into their component parts, the protocol, domain, and the path. Don't know what these things are? Great, you're on your way to learning more stuff today.

The pattern below will separate the url into protocol, domain, and path, creating a column for each and a title of that column too.

Don't ask me how it works, I didn't figure this one out on my own.

In [25]:
pattern = r"(?P<protocol>https?)://(?P<domain>[\w\.\-]+)/?(?P<path>.*)"
hn_url_parts = hn_urls.str.extract(pattern, flags=re.I)
hn_url_parts.dropna() # makes it look neater
Out[25]:
protocol domain path
0 http www.interactivedynamicvideo.com
1 http www.thewire.com entertainment/2013/04/florida-djs-april-fools-water-joke/63798/
2 https www.amazon.com Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429
3 http www.nytimes.com 2007/11/07/movies/07stein.html?_r=0
4 http arstechnica.com business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/
7 http firstround.com review/shims-jigs-and-other-woodworking-concepts-to-conquer-technical-debt/
8 http www.southpolestation.com trivia/igy1/appendix.html
9 http techcrunch.com 2016/03/15/crate-raises-4m-seed-round-for-its-next-gen-sql-database/
10 http evonomics.com advertising-cannot-maintain-internet-heres-solution/
11 https medium.com @loorinm/coding-is-over-6d653abe8da8
12 https iot.seeed.cc
13 http www.bfilipek.com 2016/04/custom-deleters-for-c-smart-pointers.html
15 http beta.crowdfireapp.com ?beta=agnipath
17 https www.bostonglobe.com magazine/2015/12/29/years-later-did-big-dig-deliver/tSb8PIMS4QJUETsMpA7SpI/story.html
18 https www.valid.ly
19 http apod.nasa.gov apod/astropix.html

Wow neat. How can this be useful? Your guess is as good as mine, I just wanted to show a real application of regex.

in conclusion . . .

Regular expressions seem really neat but in all honesty they just seem like they're really hard to learn and use effectively within a week of learning them. But in an overarching honest thought, I have not really used them enough to know where they would be useful in my own workflow, i.e. I do not have any "regex intuition".

So why do this? I'm sort of an NLP-nerd wanna-be (nlp is natural language processing, used to make computers understand human-speak, text or otherwise) and regex is a computer language designed to process and understand text. I may not make regex a priority of my learning as it is one part of data processing but definitely useful to play around with and know how they work.

Thanks for reading/scanning/scrolling.

Below: the text from txt and sites that aided me in writing this blog post

In [27]:
print(txt)
Si una vez yo pudiera llegar
A erizar de frio tu piel
A quemar que sé yo, tu boca
Y morirme allí después
Y si entonces
Temblaras por mi
Lloraras al verme sufrir
Ay sin dudar tu vida entera dar
Como yo la doy por ti

Si pudiera ser tu héroe
Si pudiera ser tu dios
Que salvarte a ti mil veces
Puede ser mi salvación

Si supieras
La locura que llevo
Que me hiere
Y me mata por dentro
Y que más da
Mira que al final
Lo que importa es que te quiero

Si pudiera ser tu héroe
Si pudiera ser tu dios
Que salvarte a ti mil veces
Puede ser mi salvación

Ah déjame tocarte
Quiero acariciarte
Una vez más
Mira que al final
Lo que importa es que te quiero

Si pudiera ser tu héroe
Si pudiera ser tu dios
Que salvarte a ti mil veces
Puede ser mi salvación

Quiero ser tu héroe
Si pudiera ser tu dios
Porque salvarte a ti mil veces
Puede ser mi salvación
Puede ser mi salvación
Quiero ser tu héroe