Comparitive Analysis on Free Apps in Google Play and Apple App Store.

In this guided project, Dataquest has us roleplaying as data analysts at an app building company. The apps that (our company) we build are free. The mission, if we choose to accept it:

"Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users"

As I go along I try to keep track of which questions I'm working on, above was Analzying Mobile App Data, 1/14, questions 1-2.

In [1]:
# function that they created for project
def explore_data(dataset, start, end, rows_and_cols=False):
    
    '''
    function that takes in dataset and explores it based on user's 
    set parameters.
    
    
    __parameters__
    
    dataset: (var to dataset) parameter to be explored
    start: (int) row number where you want the dataset to be explored from
    end: (int) row number where you want dataset to be explored to. 
    rows_and_cols: (bool) defaults to false, if made true, displays
    total rows and columns for entire dataset after exploration print
    out
    
    '''
    
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row)
        print('\n') # adds newline when printing for prettiness
        
    if rows_and_cols:
        print('Number of rows:', len(dataset))
        print('Number of columns', len(dataset[0]))

Opening and Exploring the Data

(2/14) Questions 1-3

In [2]:
# importing csv tools and opening datasets into namespace
# question 1

from csv import reader
applestoredata = list(reader(open('AppleStore.csv',encoding='utf8')))
googstoredata = list(reader(open('googleplaystore.csv',encoding='utf8')))
In [3]:
# question 2 
# exploring data with previously built function
explore_data(applestoredata,0,5,True)
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns 16

(Q3) The structure of our Apple App Store data is 7198 rows and 16 columns. Columns that could assist us with our analysis are track_name (which is the app's name), price, reviews, installs, type, genre, and category.

In [4]:
# q2
# number of rows and columns printed with the True parameter above this
# cell.
# exploring data with previously built function

explore_data(googstoredata,0,5,True)
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns 13

(Q3) The structure of our Google Store App data is 10,842 rows and 13 columns. Less categories than the Apple store data. The column's that might be able to help us out with our analysis from this dataset would be like columns that we've previously found in the apple store dataset, app, category, installs, type, price, and genres.

In [5]:
explore_data(applestoredata,0,1)
explore_data(googstoredata,0,1)
# showing differences in categorical data
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


(Q3)

The documentation shows up as a 404, so I will not be able to post a reliable link for my project.

Deleting Wrong Data

(3/14) The Google Play data set has a dedicated discussion section, and we can see that one of the discussions describes an error for a certain row.

1) Read the discussion and find out what the index of the row is.

2) Print the row at that index to check whether it's indeed incorrect. Take into account the user reporting the error might or might have not removed the header row, so the index number might vary.

3) If the row has an error, remove the row using the del statement. For instance, to remove the row with the index 149 from a data set data that is stored as a list of list, you can use the code del data[149].

4) Make sure you don't run the del statement more than once, otherwise you'll delete more than one row.

5) Read the discussion section for the App Store data set, and see whether you can find any reports of wrong data.
In [6]:
# q2
print(googstoredata[0])
print(googstoredata[10473])
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

The maximum a rating can be in the 'Rating' column can be 5 as said in the corresponding documentation and issue submitted to the dataset.

In [7]:
# q3 will be deleting this row of data with the del statement. highly dangerous. don't try at home without proper
# adult supervision
# q4, not running this cell more than twice

print(googstoredata[10473])
del googstoredata[10473]
print(googstoredata[10473])
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']

(Q5) No wrong data rows in the apple store data csv according to some research through the dataset's discussion however did notice discussions in thread about duplicate rows which will be covered in the next section.

Removing Duplicate Entries: Part One

(4/14)

1) Using a combination of narrative and code, explain the reader that the Google Play data set has duplicate entries. Print a few duplicate rows to confirm.

2) Count the number of duplicates using the technique we learned above.

3) Explain that you won't remove the duplicates randomly. Describe the criterion you're going to use to remove the duplicates.

4) We already suggested a criterion above, but you can come up with another criterion if you want. Make sure you support your criterion with at least one argument.

1) Through data collection and entry efforts we sometimes capture the same data multiple times or in ways that will obfuscate the dataset in unimaginable ways when it comes to building statistical understandings of the data.

In [8]:
for i in googstoredata:
    name = i[0]
    if name == 'Instagram':
        print(i)
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
In [9]:
# 2) using the technique from the lessons
dup_apps = []
uniq_apps = []

for app in googstoredata:
    name = app[0]
    if name in uniq_apps:
        dup_apps.append(name)
    else:
        uniq_apps.append(name)
        
print("number of duplicate apps:", len(dup_apps))
print("number of unique apps:", len(uniq_apps))
number of duplicate apps: 1181
number of unique apps: 9660

(Q3 and Q4)

Instead of removing the duplicates randomly, we can see that in the duplicates example that I printed above the number of reviews gets larger the closer to the bottom of the set. That means more recent data has been entered in the dataset and that we might be able to find this pattern elsewhere with duplicates.

to do this we'll build a dictionary that has a key as the app name and the value as the number of reviews. we'll use this to make sure that we're collecting the dataset entries with the highest number of reviews and then we'll use the dictionary as a set since we'll only have set of the app and it's reviews.

Removing Duplicate Entries: Part Two

(5/14)

Create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

(click here for mission instructions).

In [10]:
reviews_max = {}

for app in googstoredata:
    name = app[0]
    n_reviews = app[3]
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

What we are doing above is this: creating an empty dictionary, initialize a for loop over the google store data, initialize the name column to name, if that name is in dictionary but the review number is lower than the current entry, replace with new entry else if the name is not found in the dictionary, add a new entry with the value being number of reviews.

To prove that we've reduced the number of duplicates down we can do some basic arithmatic.

In [11]:
len(googstoredata) - len(reviews_max)
Out[11]:
1181

Then we'll use the just made dictionary to spot and clean the duplicates out of another list we'll create.

In [12]:
goog_clean = []
goog_added = []

for app in googstoredata:
    name = app[0]
    n_reviews = app[3]
    
    if (reviews_max[name] == n_reviews) and (name not in goog_added):
        goog_clean.append(app)
        goog_added.append(name)

Removing Non-English Apps: Part 1

1) Write a function that takes in a string and returns False if there's any character in the string that doesn't belong to the set of common English characters, otherwise it returns True.

-Inside the function, iterate over the input string. For each iteration check whether the number associated with the character is greater than 127. When a character is greater than 127, the function should immediately return False — the app name is probably non-English since it contains a character that doesn't belong to the set of common English characters.
-If the loop finishes running without the return statement being executed, then it means no character had a corresponding number over 127 — the app name is probably English, so the functions should return True.

2) Use your function to check whether these app names are detected as English or non-English:

'Instagram'
'爱奇艺PPS -《欢乐颂2》电视剧热播'
'Docs To Go™ Free Office Suite'
'Instachat 😜'
In [13]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

This function will help us map the characters in the dataset that are not english characters, or above a certain 'ascii' level, and any level above '127' is above the english character count.

We will actually use it twice in two different for clauses to clean out the lists of applications below.

Removing Non-English Apps: Part Two

In [14]:
goog_engl = []
appl_engl = []

for app in goog_clean:
    name = app[0]
    if is_english(name):
        goog_engl.append(app)
        
for app in applestoredata:
    name = app[0]
    if is_english(name):
        appl_engl.append(app)
In [15]:
explore_data(goog_engl,2000,2005)
['Coupon Sherpa', 'SHOPPING', '4.4', '7793', '6.1M', '500,000+', 'Free', '0', 'Everyone', 'Shopping', 'February 6, 2018', '3.0.3', '4.0.3 and up']


['Gyft - Mobile Gift Card Wallet', 'SHOPPING', '4.1', '9701', '14M', '500,000+', 'Free', '0', 'Everyone', 'Shopping', 'July 25, 2018', '2.4.0', '4.4 and up']


['SavingStar - Grocery Coupons', 'SHOPPING', '4.2', '31519', '21M', '1,000,000+', 'Free', '0', 'Everyone', 'Shopping', 'June 12, 2018', '4.9.1', '4.0 and up']


['The Coupons App', 'SHOPPING', '4.5', '181990', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Shopping', 'April 27, 2018', 'Varies with device', 'Varies with device']


['Shopkick: Free Gift Cards, Shop Rewards & Deals', 'SHOPPING', '4.3', '213735', '43M', '10,000,000+', 'Free', '0', 'Everyone', 'Shopping', 'July 24, 2018', '5.3.73', '4.4 and up']


Isolating the Free Apps

In [16]:
android_final = []
ios_final = []

for app in goog_engl:
    price = app[7]
    if price == '0':
        android_final.append(app)

for app in appl_engl:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
In [17]:
print(len(android_final),len(ios_final))
8862 4056

In the above code cell I take the lists that have been cleaned of all non english terms and check to see what price they are. If they are free (cost of zero), the entries are then added to a last list.

Most Common Apps by Genre: Part 1

1) Give readers more context into why we want to find an app profile that fits both the App Store and Google Play. Explain our validation strategy for an app idea.

2) Inspect both data sets and identify the columns you could use to generate frequency tables to find out what are the most common genres in each market.

Since we're a company that makes both android and apple applications we are looking for apps that have similar traffic over both platforms. Our validation strategy is that we would rapidly develop an android application, get user research data from it, and if we find that the app is appropriately profitable, we then move forward and develop and app for the apple platform and add it to the iOS App Store.

To understand both datasets we need to find commonalities that the datasets share.

Most Common Apps by Genre: Part 2

Building functions to understand data between the datasets

In [18]:
def freq_table(dataset, index):
    
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages

The above function generates a frequency table.

In [19]:
def display_table(dataset, index):
    
    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

This function takes the output from the last made function and flips it into descending order.

In [20]:
display_table(ios_final, -5)
Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032

Most Common Apps by Genre: Part 3

The heavy majority of app users are gamers. From there we're nearly descending equal parts everything else other than entertainment. What this means is that we have a list of the most downloaded free applications that use English in their interface and/or app details on the platform.

In [21]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)
Social Networking : 53078.195804195806
Photo & Video : 27249.892215568863
Games : 18924.68896765618
Music : 56482.02985074627
Reference : 67447.9
Health & Fitness : 19952.315789473683
Weather : 47220.93548387097
Utilities : 14010.100917431193
Travel : 20216.01785714286
Shopping : 18746.677685950413
News : 15892.724137931034
Navigation : 25972.05
Lifestyle : 8978.308510638299
Entertainment : 10822.961077844311
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Book : 8498.333333333334
Finance : 13522.261904761905
Education : 6266.333333333333
Productivity : 19053.887096774193
Business : 6367.8
Catalogs : 1779.5555555555557
Medical : 459.75
In [22]:
categories_android = freq_table(android_final,1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',','')
            n_installs = n_installs.replace('+','')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)
ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1820673.076923077
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15560965.599534342
FAMILY : 3694276.334922527
MEDICAL : 120616.48717948717
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17805627.643678162
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10682301.033377837
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_MAGAZINES : 9549178.467741935
MAPS_AND_NAVIGATION : 4056941.7741935486
In [23]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',','')
    n_installs = n_installs.replace('+','')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)
Out[23]:
3603485.3884615386
In [35]:
temp_list = []
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        temp_list.append((str(app[0]) + ':' + str(app[5])))

temp_list[0:20]
Out[35]:
['E-Book Read - Read Book for free:50,000+',
 'Download free book with green book:100,000+',
 'Wikipedia:10,000,000+',
 'Cool Reader:10,000,000+',
 'Free Panda Radio Music:100,000+',
 'Book store:1,000,000+',
 'FBReader: Favorite Book Reader:10,000,000+',
 'English Grammar Complete Handbook:500,000+',
 'Free Books - Spirit Fanfiction and Stories:1,000,000+',
 'Google Play Books:1,000,000,000+',
 'AlReader -any text book reader:5,000,000+',
 'Offline English Dictionary:100,000+',
 'Offline: English to Tagalog Dictionary:500,000+',
 'FamilySearch Tree:1,000,000+',
 'Cloud of Books:1,000,000+',
 'Recipes of Prophetic Medicine for free:500,000+',
 'ReadEra – free ebook reader:1,000,000+',
 'Anonymous caller detection:10,000+',
 'Ebook Reader:5,000,000+',
 'Litnet - E-books:100,000+']
In [25]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])
Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+
In [26]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
Hafizi Quran 15 lines per page : 1,000,000+
Quran for Android : 10,000,000+
Satellite AR : 1,000,000+
Oxford A-Z of English Usage : 1,000,000+
Dictionary.com: Find Definitions for English Words : 10,000,000+
English Dictionary - Offline : 10,000,000+
Bible KJV : 5,000,000+
NOOK: Read eBooks & Magazines : 10,000,000+
Brilliant Quotes: Life, Love, Family & Motivation : 1,000,000+
Stats Royale for Clash Royale : 1,000,000+
Dictionary : 10,000,000+
wikiHow: how to do anything : 1,000,000+
EGW Writings : 1,000,000+
My Little Pony AR Guide : 1,000,000+
Spanish English Translator : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
JW Library : 10,000,000+
Oxford Dictionary of English : Free : 10,000,000+
English Hindi Dictionary : 10,000,000+
English to Hindi Dictionary : 5,000,000+

In conclusion, the direction this analysis began to illuminate is that it would advantageous to experiment putting out popular books as a stand alone application.

In [ ]: