DataScience – Rapid Development for Finance

I always find working through a text book that it gets quite dry, and I struggle to maintain interest as the project or subject matter is too esoteric.

So, armed with a little python, StackOverflow and a search engine, I decided to see what I could do with my own data.

What have I been listening to for the last three or four years?

Google allows users access to all their past data, and as a long time user of Google Music I wondered what the oversight on all this would be. Unlike Spotify, there doesn’t seem to be any scope for introspection into musical history or taste, which strikes me as odd. I guess Google is more about harvesting and selling the data, whereas Spotify realises that showing insight into their users activity and taste helps drive traffic and engagement levels.

Google Takeout Track dataset structure

unzipped takeout tracks, highlighting the weirder filenames

Seriously? A gazillion CSV files? Well… 14,646. That’s the best they can do? However, I then thought, hang on… combining multiple CSV files and cleaning up rubbish data in the dataset is the exact sort of basic data science 101 that every project has. Bring it on.

Track csv file combination

Python to combine the .csv files, lifted from my Jupyter Notebook

import os

# change as required, obviously
directory = '/takeout_data/Google Play Music/Tracks'

csv_files = []

for root,dirs,files in os.walk(directory):
    for file in files:
       if file.endswith(".csv"):
            csv_files.append(directory + "\\" + file)
            
print(len(csv_files))

Now lets iterate over the files found in the selected folder, add these to a list and then concatenate these into a pandas DataFrame. I’m painfully aware that the snippet below isn’t fully “pythonic” and wouldn’t win any stack overflow plaudits. However, I am backing myself to improve this at the end and I stated this would be a warts and all development blog. As such this is a “get this working to make sure I am going in the right direction and this is a good idea overall” mode. If its working I can move on and start to make enough progress to motivate myself to keep coding on the train and before bed each night. Also to be fair this code will be run on a once per archive result basis, so doesn’t need to be hugely optimised, considering the other project time dermands.

import pandas as pd

combined_csv = []
for i, csv_file in enumerate(csv_files):
    
    x = pd.read_csv(csv_file)
     
    if not pd.isna(x['Artist'].values) and x['Play Count'][0] > 0:
         if i % 100 == 0:
            print('100th Artist found: {}. Processing {}/{}'.format(x['Artist'].values, i, total))   
    
        combined_csv.append(x)

combined_df = pd.concat(combined_csv)

Which leads to the Jupyter Notebook output:

printing the output solved my “is this working?” worries

After saving this combined .csv file I thought I would have a quick peek into the data, to check what we were actually given.

Track combination sanity check

Ok, so now I know what I’ve listened to is all in one place. I now face some issues if I want to start answering the questions I had in mind.

Is this data reliable, how much cleaning and transformation will be needed?
When did these tracks get played?
Where can I find the metadata for this basic track info?

I created a new Jupyter notebook in order to evaluate the data. As a cursory check I thought the tthe followong would be a good test

See what the song length distribution was like
Top Artists
Total Number of Songs
DataFrame.head() displays

Jupyter Notebook : TakeoutTrackAnalysisBlog

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw GoogleMusicTakeoutTrackAnalysis.ipynb hosted with ❤ by GitHub

Further questions (for another blog post)

What are my most popular artists, albums and genres?
Do I have any dirty / unknown track data
What are my top 10 shortest and longest tracks

Professionally I have undertaken a major segue into Python from the usual Excel VBA development environment.

However this has remained very tightly linked to the quant library and specific business requirement solving as before. So Iface the intresteing situation of nominally becoming a “Python Developer” without having the necessary interview answers or expected jobskills.

What I mean is that with Excel VBA work within Finance, and especially supporting the desks of Wholesale banks certain things are expected. Working with Front Office designed sheets and integrating those and also the desk requirements into solutions using the IT designated strategic frameworks and technology.

So while you are expected to be very experienced / capable in Excel VBA itself it is understood that this will be back up with a large amount of esoteric and niche software across the “tech stack”. Mainly IT developed XLAs / sheet utils and other “best practice” while using a normally mature (and highly guarded IP) layer of Quant library in some other language such as C++ / Java.

So getting things done is a case of elegantly adding in the solution to the spaghetti code thatis already there, without reinventing the wheel and / or making it any worse than it already is. A majority of the heavy lifting on the data manipulation, trade access, market retrieval and manipulation, valuation functionality will already be implemented bythe quant / strats function.

However the open nature of the python distribution and usage paradigm would make it insane for a bank to re invent / ignore numpy, pandas, scikit etc. In my opinion not using the basic python libs available in Anaconda is madness.

So this leaves me with the strange situation that I am using python coupled with the quantlibrary to solve complex business problems without actually really using much of the available libraries themselves.

In effect I will be a 2 year Python developer ( on top of 12 – 15 y of financial software development ) without really being able to back that up. Unless discussing highly proprietary Variable Annuity cases in any future project / interview.

Anyone who knows large Investment / Wholesale banking IT will know that picking up and running with some new ( to the bank ) technology and thinking isnt the done thing. It is a configuration nightmare and makes a lot of senior people in both Front Office and also IT quite nervous.

New is bad, things should creep in slowly. Well… Python has crept in and our small teamhas a chance to utilise most of its power as we see fit. So I plan to familiarise myself with the Data Science elements via extra curricular development and hope that keeps my CV sharp while also providing a chance to try something interesting.

Category: DataScience

Google Music Takeout – Track Data