Finally!
You've been waiting and here it is, the regex (short for regular expressions) post you have all been asking for.
Let me say this to begin with: All regex tutorials are bad. Mine will be worse.
import re
There you go. Can you feel the growing ultimate power of regular expressions in your vicinity as you read this? Importing alone gives me shivers.
To follow tradition, as all regex tutorials start off, I need to impart to you how powerful regexes are.
Regexes are powerful.
Did you feel the power? Give me another chance.
What are regexes for all my readers who don't care and are here just to inflate my Google Analytics scores (I appreciate you)?
A regular expression is probably as irregular as it gets. It's a language completely made up to extract text by assembling patterns and routines to find that text.
To display this ultimate power of finding text, I have stored an entire song into the variable txt
by the legendary musical artist, Enrique Iglesias.
(to see contents of txt
you can scroll to the bottom of this post)
re.findall(r"art", txt)
Told you he's an artist. This song is all about art.
What's happening here? You can just input the exact string you're looking for and regex will return all the instances it finds it in a nice little list with the .findall(pattern, text_to_be_searched)
method.
That sure is useful for finding things when I know what I am looking for but what if I wanted to see where art
is appearing without really knowing where it appears. Because I don't know where it appears. Regex has a solution to this.
Ever add ellipses for the mystery....? Me too.....
re.findall(r"...art...",txt)
Enrique is so sneaky that he's actually hidden art in some of his lyrics. How meta. This proves he's a legendary artist. But it looks like we're returned fewer instances of where art
appears now, which means that our regular expression pattern (a regular expression pattern is a combination of characters and symbols that specify what we're looking for).
I want to find all the places that Enrique hides his art.
re.findall(r"(\w*art.)", txt)
Whomever he's singing to sure needs a lot of saving. Geez.
Quick explanation on the regular expression above that actually looks like I've had a change of heart and am now searching for warts.
\w
is one of those special combos of characters and symbols that hold meaning in regex. This one means, match any word character (equal to [a-zA-Z0-9_])
*
this is a quantifier, that matches between zero and unlimited times. I used this one because I didn't know how many times words with art
would appear in the text string.
.
the period matches any character except for line terminators. This one I used knowing full well that there was most times an e
after the string art
in most words that art
would appear in, in this song.
( ... )
the parenthesis define our capture group, or what we want to return essentially.
If we were really just looking for how many times the word art
appears in the song, all we would have to do is use \b
on either side of art
like so:
re.findall(r"(\bart\art)", txt)
# the \b looks for "word boundaries" or spaces as we've come to know them
True artists never explictly talk about their art in their art.
Maybe at this point they don't seem that powerful and in all honesty, it's because I truly only started learning regular expressions last week.
To wrap up, I want to show what a regular expression might be used for in real life outside of this blog.
import pandas as pd
hn = pd.read_csv('hacker_news.csv')
hn_urls = hn['url'][:20]
hn_urls
What are these? These are urls in a dataset that's true purpose and reason does not matter to the point of this blog post. What we are going to do is separate these urls out into their component parts, the protocol, domain, and the path. Don't know what these things are? Great, you're on your way to learning more stuff today.
The pattern below will separate the url into protocol, domain, and path, creating a column for each and a title of that column too.
Don't ask me how it works, I didn't figure this one out on my own.
pattern = r"(?P<protocol>https?)://(?P<domain>[\w\.\-]+)/?(?P<path>.*)"
hn_url_parts = hn_urls.str.extract(pattern, flags=re.I)
hn_url_parts.dropna() # makes it look neater
Wow neat. How can this be useful? Your guess is as good as mine, I just wanted to show a real application of regex.
Regular expressions seem really neat but in all honesty they just seem like they're really hard to learn and use effectively within a week of learning them. But in an overarching honest thought, I have not really used them enough to know where they would be useful in my own workflow, i.e. I do not have any "regex intuition".
So why do this? I'm sort of an NLP-nerd wanna-be (nlp is natural language processing, used to make computers understand human-speak, text or otherwise) and regex is a computer language designed to process and understand text. I may not make regex a priority of my learning as it is one part of data processing but definitely useful to play around with and know how they work.
Thanks for reading/scanning/scrolling.
Below: the text from txt
and sites that aided me in writing this blog post
print(txt)