data science – Rapid Development for Finance

I always find working through a text book that it gets quite dry, and I struggle to maintain interest as the project or subject matter is too esoteric.

So, armed with a little python, StackOverflow and a search engine, I decided to see what I could do with my own data.

What have I been listening to for the last three or four years?

Google allows users access to all their past data, and as a long time user of Google Music I wondered what the oversight on all this would be. Unlike Spotify, there doesn’t seem to be any scope for introspection into musical history or taste, which strikes me as odd. I guess Google is more about harvesting and selling the data, whereas Spotify realises that showing insight into their users activity and taste helps drive traffic and engagement levels.

Google Takeout Track dataset structure

unzipped takeout tracks, highlighting the weirder filenames

Seriously? A gazillion CSV files? Well… 14,646. That’s the best they can do? However, I then thought, hang on… combining multiple CSV files and cleaning up rubbish data in the dataset is the exact sort of basic data science 101 that every project has. Bring it on.

Track csv file combination

Python to combine the .csv files, lifted from my Jupyter Notebook

import os

# change as required, obviously
directory = '/takeout_data/Google Play Music/Tracks'

csv_files = []

for root,dirs,files in os.walk(directory):
    for file in files:
       if file.endswith(".csv"):
            csv_files.append(directory + "\\" + file)
            
print(len(csv_files))

Now lets iterate over the files found in the selected folder, add these to a list and then concatenate these into a pandas DataFrame. I’m painfully aware that the snippet below isn’t fully “pythonic” and wouldn’t win any stack overflow plaudits. However, I am backing myself to improve this at the end and I stated this would be a warts and all development blog. As such this is a “get this working to make sure I am going in the right direction and this is a good idea overall” mode. If its working I can move on and start to make enough progress to motivate myself to keep coding on the train and before bed each night. Also to be fair this code will be run on a once per archive result basis, so doesn’t need to be hugely optimised, considering the other project time dermands.

import pandas as pd

combined_csv = []
for i, csv_file in enumerate(csv_files):
    
    x = pd.read_csv(csv_file)
     
    if not pd.isna(x['Artist'].values) and x['Play Count'][0] > 0:
         if i % 100 == 0:
            print('100th Artist found: {}. Processing {}/{}'.format(x['Artist'].values, i, total))   
    
        combined_csv.append(x)

combined_df = pd.concat(combined_csv)

Which leads to the Jupyter Notebook output:

printing the output solved my “is this working?” worries

After saving this combined .csv file I thought I would have a quick peek into the data, to check what we were actually given.

Track combination sanity check

Ok, so now I know what I’ve listened to is all in one place. I now face some issues if I want to start answering the questions I had in mind.

Is this data reliable, how much cleaning and transformation will be needed?
When did these tracks get played?
Where can I find the metadata for this basic track info?

I created a new Jupyter notebook in order to evaluate the data. As a cursory check I thought the tthe followong would be a good test

See what the song length distribution was like
Top Artists
Total Number of Songs
DataFrame.head() displays

Jupyter Notebook : TakeoutTrackAnalysisBlog

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw GoogleMusicTakeoutTrackAnalysis.ipynb hosted with ❤ by GitHub

Further questions (for another blog post)

What are my most popular artists, albums and genres?
Do I have any dirty / unknown track data
What are my top 10 shortest and longest tracks

As mentioned in my earlier article, I am starting to get into Data Science. I might be quite late to the party in some respects but better late then never I guess.

https://financialraddeveloper.com/2019/09/01/initial-foray-into-data-science-with-python/

Building from my intial thoughts and ideas I thought I would go throug the process, as I experienced it, in a series of posts here. I often read highly polished “how to” articles but I would like to include a warts and all journey through all the obstacles / sticking points and general quagmire of head scratching and confusion that people (maybe it is just me) experience when starting out something new.

To be honest even restarting / setting up this blog after a significant haitus was a > 1hr chore I wasn’t expecting. Even then I am not sure if the WordPress.exe is worth the time or RAM it sucks in.

Equipment:

This shaped quite a bit of my decisions later on down the line so I will state what I have, so that those calls don’t look so odd later on.

The ASUS Powerbook tablet combo was a few hundred quid back in 2015 but my wife didn’t like it, and so I dusted it off after a colleague intrigued me with his much fancier Windows Surface Pro style laptop. For the < £150 that it can be purchased for now it is a total bargain, especially with the ability to put a memory card in the tablet part and having a 500Gb HDD in the keyboard. As I reckon I cannot realistically develop any meaningful Python without a half decent keyboard ( and certainly not an on screen one ) I installed the PyCharm and Anaconda parts to the E: drive that comes up when the unit is “complete”.

One nice side effect is that I have to really consider performance and efficiency when I am both developing and also in my own environment. Which explains my early doors switch away from PyCharm and into Spyder and Jupyter Notebooks. The overhead of the Java based Eclipse part of PyCharm caused my morning train development to really slow to a crawl.

On a side note I got a great deal on my HP Z600 Workstation which until my clients switch to Win 10 a clone of my professional development workstation

This will be reclaimed old stock from somewhere but I love a bargain and the idea of getting that much RAM and CPU for under £500 when some company probably specced that bad boy 5+ years ago for £5k makes me very happy.

Tag: data science

Google Music Takeout – Track Data