You've been waiting and here it is, the regex (short for regular expressions) post you have all been asking for.
Let me say this to begin with: All regex tutorials are bad. Mine will be worse.
There you go. Can you feel the growing ultimate power of regular expressions in your vicinity as you read this? Importing alone gives me shivers.
To follow tradition, as all regex tutorials start off, I need to impart to you how powerful regexes are.
Regexes are powerful.
Did you feel the power? Give me another chance.
What are regexes for all my readers who don't care and are here just to inflate my Google Analytics scores (I appreciate you)?
A regular expression is probably as irregular as it gets. It's a language completely made up to extract text by assembling patterns and routines to find that text.
To display this ultimate power of finding text, I have stored an entire song into the variable
txt by the legendary musical artist, Enrique Iglesias.
(to see contents of
txt you can scroll to the bottom of this post)
['art', 'art', 'art', 'art', 'art', 'art']
Told you he's an artist. This song is all about art.
What's happening here? You can just input the exact string you're looking for and regex will return all the instances it finds it in a nice little list with the
.findall(pattern, text_to_be_searched) method.
That sure is useful for finding things when I know what I am looking for but what if I wanted to see where
art is appearing without really knowing where it appears. Because I don't know where it appears. Regex has a solution to this.
Ever add ellipses for the mystery....? Me too.....
['alvarte a', 'alvarte a', 'alvarte a', 'alvarte a']
Enrique is so sneaky that he's actually hidden art in some of his lyrics. How meta. This proves he's a legendary artist. But it looks like we're returned fewer instances of where
art appears now, which means that our regular expression pattern (a regular expression pattern is a combination of characters and symbols that specify what we're looking for).
I want to find all the places that Enrique hides his art.
['salvarte', 'salvarte', 'tocarte', 'acariciarte', 'salvarte', 'salvarte']
Whomever he's singing to sure needs a lot of saving. Geez.
Quick explanation on the regular expression above that actually looks like I've had a change of heart and am now searching for warts.
\w is one of those special combos of characters and symbols that hold meaning in regex. This one means, match any word character (equal to [a-zA-Z0-9_])
* this is a quantifier, that matches between zero and unlimited times. I used this one because I didn't know how many times words with
art would appear in the text string.
. the period matches any character except for line terminators. This one I used knowing full well that there was most times an
e after the string
art in most words that
art would appear in, in this song.
( ... ) the parenthesis define our capture group, or what we want to return essentially.
If we were really just looking for how many times the word
art appears in the song, all we would have to do is use
\b on either side of
art like so:
re.findall(r"(\bart\art)", txt) # the \b looks for "word boundaries" or spaces as we've come to know them
True artists never explictly talk about their art in their art.
Maybe at this point they don't seem that powerful and in all honesty, it's because I truly only started learning regular expressions last week.
To wrap up, I want to show what a regular expression might be used for in real life outside of this blog.
import pandas as pd hn = pd.read_csv('hacker_news.csv') hn_urls = hn['url'][:20] hn_urls
0 http://www.interactivedynamicvideo.com/ 1 http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/ 2 https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429 3 http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0 4 http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/ 5 NaN 6 NaN 7 http://firstround.com/review/shims-jigs-and-other-woodworking-concepts-to-conquer-technical-debt/ 8 http://www.southpolestation.com/trivia/igy1/appendix.html 9 http://techcrunch.com/2016/03/15/crate-raises-4m-seed-round-for-its-next-gen-sql-database/ 10 http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/ 11 https://medium.com/@loorinm/coding-is-over-6d653abe8da8 12 https://iot.seeed.cc 13 http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html 14 NaN 15 http://beta.crowdfireapp.com/?beta=agnipath 16 NaN 17 https://www.bostonglobe.com/magazine/2015/12/29/years-later-did-big-dig-deliver/tSb8PIMS4QJUETsMpA7SpI/story.html 18 https://www.valid.ly 19 http://apod.nasa.gov/apod/astropix.html Name: url, dtype: object
What are these? These are urls in a dataset that's true purpose and reason does not matter to the point of this blog post. What we are going to do is separate these urls out into their component parts, the protocol, domain, and the path. Don't know what these things are? Great, you're on your way to learning more stuff today.
The pattern below will separate the url into protocol, domain, and path, creating a column for each and a title of that column too.
Don't ask me how it works, I didn't figure this one out on my own.
pattern = r"(?P<protocol>https?)://(?P<domain>[\w\.\-]+)/?(?P<path>.*)" hn_url_parts = hn_urls.str.extract(pattern, flags=re.I) hn_url_parts.dropna() # makes it look neater
Wow neat. How can this be useful? Your guess is as good as mine, I just wanted to show a real application of regex.
Regular expressions seem really neat but in all honesty they just seem like they're really hard to learn and use effectively within a week of learning them. But in an overarching honest thought, I have not really used them enough to know where they would be useful in my own workflow, i.e. I do not have any "regex intuition".
So why do this? I'm sort of an NLP-nerd wanna-be (nlp is natural language processing, used to make computers understand human-speak, text or otherwise) and regex is a computer language designed to process and understand text. I may not make regex a priority of my learning as it is one part of data processing but definitely useful to play around with and know how they work.
Thanks for reading/scanning/scrolling.
Below: the text from
txt and sites that aided me in writing this blog post
Si una vez yo pudiera llegar A erizar de frio tu piel A quemar que sé yo, tu boca Y morirme allí después Y si entonces Temblaras por mi Lloraras al verme sufrir Ay sin dudar tu vida entera dar Como yo la doy por ti Si pudiera ser tu héroe Si pudiera ser tu dios Que salvarte a ti mil veces Puede ser mi salvación Si supieras La locura que llevo Que me hiere Y me mata por dentro Y que más da Mira que al final Lo que importa es que te quiero Si pudiera ser tu héroe Si pudiera ser tu dios Que salvarte a ti mil veces Puede ser mi salvación Ah déjame tocarte Quiero acariciarte Una vez más Mira que al final Lo que importa es que te quiero Si pudiera ser tu héroe Si pudiera ser tu dios Que salvarte a ti mil veces Puede ser mi salvación Quiero ser tu héroe Si pudiera ser tu dios Porque salvarte a ti mil veces Puede ser mi salvación Puede ser mi salvación Quiero ser tu héroe