In [1]:
!date
Fri Oct  4 10:12:24 CDT 2019

I've just posted something interesting elsewhere on the internet and I may need to use the the blog and site as a sort of annex to talk about it.

Here's the readme for my super awesome exploratory data project with Fortnite:

Fortnite Data Mine

Alrighty. You're here and you're hoping that this is worth your time.

Let me tell you it will be. Maybe.

Probably not but let's face it, there is a bountiful data when it comes to video games and Fortnite is behemoth and of no exception.

I will publically state that too much Fortnite for me personally might not be a great thing but if it's your job than that's that.

It is a game of high skill, teamwork, and constant decision making.

In this project I have a collection of nearly 1000 pictures of screenshots from the final game stats screen in the game ranging from solos, duos, trios, squads playing on the Nintendo Switch.

I've always been interested in game data since Super Smash Bros would give out little awards for ridiculously tracked data points like "highest smash damage while in the air."

The collection of this data set could not have come by myself. I have my squad mates to thank and to honorably say, let's find another video game to play.

What's the point?

I have data that I need in another format to mess with

What I've tried.

I have entered information by hand

I have entered the information by numpad for one data entering session.

I found that the OCR Tesseract Library worked less than extraordinary, not what it's name suggests. It read some data from the picture but not all. Something I definitely wished I had known the outcome of a long time before.

The learning opportunity that was not wasted on me here was after that, I did what I knew to the data set and I found a bad row of data. Yay me for that.

I have goneback to testing pytesseract and using PIL

Right now it's a gimmick sure but I realized as I was writing the code to pinpoint where the letters were or number was in the picture, I was(read: could) also building a cobbled together picture-observation-organizer. I could use that data to train my own machine army to read those numbers and letters.

This is much farther down the line

Right now it's in a complete ruidimentary phase of this.

Which is the whole point right?

Check out the repo. Take a look around. Tell me what you think.


That's it. Tada! It's here

SO OFFICIAL MUCH PROFESH

I also pushed nearly 1000 Fortnite pictures to github so well done me for wasting precious processing power.

What's the point of all this hubub and hurray you ask?

Are you never impressed?

Data processing is super high level now-a-days so if you're here and feel out of your depth with the code I'm talking about, let me tell you, I feel the same way about the NEXT step in my journey from Analyst to full blown scientist.

This project is a total dumb down of that. The completely boiled down, weird now rubbery puck of a food item you started cooking and left to watch Mindhunters and now you cannot quite remember what it was...

The initial data that I pulled from the photo was entered with me punching in stuff on my keyboard. I at least have a numpad. Then when messing with it later I found that I had a bad row of data.

When I got extra lazy of this I went back to what I thought was the original solution and started messing with it some more, which lead to messing with more stuff and now I think I know how I can start trying to help the compter understand better how to decide something.

What I'm saying is I could take all of these pictures and dump them into Watson or something and see what sticks to the wall but I don't think I'm there yet. I think that this project is so uncharacteristically nerdy and has already branched twice from what I thought it would be, it's definitely something I want to continue to build on because there's silly low level things here to build that will give me a huge insight into how the crazy big things are built.

alright that's still kind of heady come back a bit

This is what I'm saying. I'm just a kid in a backyard with a net and a notebook trying to learn about bugs after indulging way too strongly into Animal Crossing on his Gamecube.

The project will embarassingly illuminate this and the time wasted but it turns around to something I know that, even if it never really comes to complete fruition (we all know about the cow project) of being something impressive I know that from messing with it, I can build effective skills.

And then some hiring manager from some third party gaming data cryptocurrency think tank outfit wants to hire me because I wrote accidently wrote the api for essentially a gaming ticker tape and here's the real chocolately center.

It's all because I have the data for the time I was slacking.

Wanna see some code?

my code fiends where you at

do you like puzzles? I like puzzles.

I found this puzzle here

https://exercism.io/my/solutions/6047814092754e48bfbe49d07534ee30

In this one I have to change all of these letters from

G -> C
C -> G
T -> A
A -> U

something about DNA..

In [24]:
import random as rnd
In [47]:
nucleopeptides = ['G','C','T','A']
rnd.shuffle(nucleopeptides)
In [48]:
nucleopeptides
Out[48]:
['G', 'A', 'T', 'C']
In [49]:
nucleopeptides
Out[49]:
['G', 'A', 'T', 'C']
In [53]:
def shuffle_and_return_list():
    n = ['G', 'A', 'T', 'C']
    rnd.shuffle(n)
    return n
In [54]:
# looks pretty random. let's make a gigantic DNA strand
gigantic_dna_strand = [shuffle_and_return_list() for i in range(101)]
In [55]:
gigantic_dna_strand[:10]
Out[55]:
[['C', 'T', 'A', 'G'],
 ['G', 'T', 'C', 'A'],
 ['A', 'C', 'G', 'T'],
 ['C', 'G', 'A', 'T'],
 ['A', 'T', 'G', 'C'],
 ['T', 'A', 'G', 'C'],
 ['A', 'T', 'C', 'G'],
 ['G', 'T', 'C', 'A'],
 ['A', 'T', 'G', 'C'],
 ['A', 'C', 'G', 'T']]

The problem I just ran into for about 10 minutes was trying to figure out that shuffle doesn't explicitly return anything. I had to call the variable again to see that the collection of letters had been shuffled.

Now for haxxxxxxNG!!

In [56]:
import pandas as pd
In [66]:
dna_df = pd.DataFrame(gigantic_dna_strand)
dna_df.head()
Out[66]:
0 1 2 3
0 C T A G
1 G T C A
2 A C G T
3 C G A T
4 A T G C
In [58]:
translated_df = dna_df.copy()
translated_df.head()
Out[58]:
0 1 2 3
0 C T A G
1 G T C A
2 A C G T
3 C G A T
4 A T G C
In [61]:
dna_df[dna_df.loc[:] == 'C'] = 'crap'
dna_df.head()
Out[61]:
0 1 2 3
0 crap T A G
1 G T crap A
2 A crap G T
3 crap G A T
4 A T G crap

I'm having trouble visualizing how to translate the the existing dataframe but I think I have an idea.

In [65]:
# don't worry I've changed the value of dna_df

translated_df.loc[:] = ''
translated_df.head()
Out[65]:
0 1 2 3
0
1
2
3
4
In [69]:
translated_df[dna_df.loc[:] == 'G'] = 'C'
translated_df.head()
Out[69]:
0 1 2 3
0 C
1 C
2 C
3 C
4 C

oooo that looks like it's working

In [70]:
translated_df[dna_df.loc[:] == 'G'] = 'C'
translated_df[dna_df.loc[:] == 'C'] = 'G'
translated_df[dna_df.loc[:] == 'T'] = 'A'
translated_df[dna_df.loc[:] == 'A'] = 'U'
translated_df.head()
Out[70]:
0 1 2 3
0 G A U C
1 C A G U
2 U G C A
3 G C U A
4 U A C G
In [81]:
# now to set up a test
dna_df.loc[:,0].value_counts()
Out[81]:
A    29
C    28
G    24
T    20
Name: 0, dtype: int64
In [84]:
translated_df.loc[:,0].value_counts()
Out[84]:
U    29
G    28
C    24
A    20
Name: 0, dtype: int64
In [88]:
for i in range(3):
    print(dna_df.loc[:,i].value_counts()[i] == translated_df.loc[:,i].value_counts()[i])
True
True
True

not a great test but all the value counts of the letters should be the same.

definitely don't intuitively know how to do this. I'm sure there's an obvious and super pythonic way to do this.

In [95]:
paired_df = translated_df + dna_df
paired_df.head()
Out[95]:
0 1 2 3
0 GC AT UA CG
1 CG AT GC UA
2 UA GC CG AT
3 GC CG UA AT
4 UA AT CG GC
In [99]:
paired_df.loc[0,0][0]
Out[99]:
'G'

silly me. I didn't think of this.

In [109]:
test_df =translated_df.copy()

test_df[translated_df.loc[:] != ''] = ''

test_df[translated_df.loc[:] == 'C'] = 'G'
test_df[translated_df.loc[:] == 'G'] = 'C'
test_df[translated_df.loc[:] == 'A'] = 'T'
test_df[translated_df.loc[:] == 'U'] = 'A'

bool_df = test_df == dna_df
In [112]:
for i in range(3):
    print(bool_df[i].value_counts())
True    101
Name: 0, dtype: int64
True    101
Name: 1, dtype: int64
True    101
Name: 2, dtype: int64

ugh. that turned into way too much.

that's how you translate dna the slow way and confirm it!

I'll try better next time