Welcome to the Spotify Rehydrator!¶
The Spotify Rehydrator was created to provide a simple way to generate full datasets of track features from user-owned Spotify data. It relies on the excellent Spotipy library and brings together a series of API calls in a convenient way that can manage data from multiple different people, as would be common in a research study. It can also be used by individuals who are curious to learn more about their own data! The idea of a rehydrator was inspired by similar work being done to enable sharing of Twitter datasets for research purposes.
Before you use the rehyrdator, please make sure to read the Disclaimers to get an understanding of the limitations of the search strategy used.
User Guide¶
The Spotify Rehydrator primarily operates through the Rehydrator
class. The required inputs for this class are an input folder,
an output folder and a Client ID and Client Secret from the Spotify Developer Portal. These are used for authenticating the API calls. You can then call the run()
method.
Note
To request developer credentials go to Spotify’s developer portal.
You will need to ‘create an app’ which have credentials associated with it.
Your app dashboard will give you access to your Client ID
and a Client Secret
.
- Install the Spotify Rehydrator using
pip
:: pip install spotifyrehydrator
Assuming you have set your Client ID and Client Secret as environment variables then this is an example of how you could run the Rehydrator:
import os
from spotifyrehydrator import Rehydrator
Rehydrator(
input_path=os.path.join(pathlib.Path(__file__).parent.absolute(), "input"),
output_path=os.path.join(pathlib.Path(__file__).parent.absolute(), "output"),
client_id=os.getenv("SPOTIFY_CLIENT_ID"),
client_secret=os.getenv("SPOTIFY_CLIENT_SECRET"),
).run(return_all=True)
The .run()
argument will by default return the following information as columns: spotify track ID of the returned track, the name of the artist of the returned track,
the name of the returned track. This will be joined with the searched artist and track, the person ID where relevant, and the time metadata in the original .json
file.
There are then three optional arguments:
* artist_info = True
will return the popularity of the artist returned and a list of genres attributed to that artist, provided by the Artists API endpoint
* audio_features = True
will return a column for each of the audio features provided by the Tracks API.
* return_all = True
will return both the above.
Be aware that extra arguments involve more API calls and so may take longer.
Expected formats¶
Streaming History JSON¶
This package is designed to work with the files named StreamingHistory.json that are sent to users as part of their data package if they request their own Spotify data. The file will contain up to the past year of the user’s listening data.
This data should be in one or more files with a list of JSON objects that look like this:
{
"endTime" : "2019-01-19 17:01",
"artistName" : "An Artist",
"trackName" : "A Track Name",
"msPlayed" : 19807
}
Input folder¶
The input folder should contain a series of Streaming History JSON files. If you have files belonging to multiple individuals then the package expects the unique identifier for each person to be the prefix, followed by an underscore. For example:
# input folder
person001_StreamingHistory0.json
person001_StreamingHistory1.json
person002_StreamingHistory0.json
This would result in two rehydrated files being saved to the output folder:
# output folder
person001-rehydrated.tsv
person002-rehydrated.tsv
You could also input several files without any underscores to represent individuals. These would all be combined and saved in one output file.
Useful information¶
If the output directory does not exist then it will be created.
Rehydration for one individual can take 15 minutes or more depending on how many songs there are.
If a file for the next individual’s data to be rehydrated already exists in the output directory then that person will be skipped. You will need to delete or remove their file from the output folder for the rehydrator to process their data.
Disclaimers¶
Not all tracks can be retreived from the API. In our experience about 5% of tracks cannot be found on the API. These will have a value of NONE in the output files.
There is not a guaranteed match between the first returned item in a search and the track you want. Comparing msPlayed with the track length is a good way to test this since msPlayed should not exceed the track length.
Code Documentation¶
The main module for the spotifyrehydrator package contains three dataclasses.
Track operates on a single Track instance, starting from just a name and an artist, as would be provided in self-requested data. It is possible to use Track to get information about a single Track.
Tracks contains similar logic as for Track, but makes use of the batch endpoints to save on API calls. Therefore, its more efficient than Track for many calls, and I/O is primarily Pandas DataFrame objects, rather than dictionaries.
Rehydrator is mainly intended to rebuild multiple datasets in instances when you have many listening histories from multiple different users with additional metadata such as datetimes. The Rehydrator is the only class which will write files.
- class utils.Rehydrator(input_path: str, output_path: str, client_id: str, client_secret: str, _person_ids: list = <function Rehydrator._person_ids>)¶
Class to iterate through input files, generate full datasets for each listening history and save the data to the output folder. Will create output folder if it does not exist.
- input_path¶
path to the directory (folder) where the input json files are stored.
- Type
str
- output_path¶
path to the directory (folder) where the output .tsv files are saved.
- Type
str
- client_id¶
Spotify API client ID Credentials
- Type
str
- client_secret¶
Spotify API client secret Credentials
- Type
str
- _person_ids¶
A list of each of the unique ‘people’ files identified for, or None.
- Type
list or None
Example
>>> Rehydrator(input_path, output_path, sp).run()
- rehydrate(person_id: Optional[str] = None, return_all: bool = False, audio_features: bool = False, artist_info: bool = False) pandas.core.frame.DataFrame ¶
For a single person’s set of data, use the Tracks class to get all of the track IDs and features, then join these on the full listening history data. Save out the complete data, and return it too.
- Parameters
person_id (str = None) – Unique ID for the person this set of data belongs to.
return_all (bool, default = False) – Return both audio_features and artist_info
audio_features (bool, default = False) – Return each track’s audio features.
artist_info (bool, default = False) – Return the popularity and genre list for each track’s artist
- run(return_all: bool = False, audio_features: bool = False, artist_info: bool = False) None ¶
Iterate through each person’s set of data by calling the ‘rehydrate’ method on each.
- Parameters
return_all (bool, default = False) – Return both audio_features and artist_info
audio_features (bool, default = False) –
Return each track’s audio features.
artist_info (bool, default = False) –
Return the popularity and genre list for each track’s artist
- class utils.Track(name: str, artist: str, client_id: str, client_secret: str)¶
A class that searches for and returns a spotify ID and other optional information for a track, given a
trackName
and andartistName
.- name¶
The name of the track.
- Type
str
- artist¶
The name of the artist.
- Type
str
- client_id¶
Spotify API client ID Credentials
- Type
str
- client_secret¶
Spotify API client secret Credentials
- Type
str
Example
- get(return_all: bool = False, returned_artist: bool = False, returned_track: bool = False, artist_info: bool = False, audio_features: bool = False) dict ¶
Calls search_results() to get the spotifyID, trying to remove apostrophes and dashes if an IndexError is raised. Returns a dictionary of objects, with spotifyID and then any other objects as defined in function call.
- Parameters
return_all (bool, default = False) – Return both audio_features and artist_info
audio_features (bool, default = False) –
Return each track’s audio features.
artist_info (bool, default = False) –
Return the popularity and genre list for each track’s artist
- search_results(remove_char=None) dict ¶
Searches the Spotify API for the track and artist and returns the whole results object.
Takes remove_char as a char to remove from the artist and track before searching if needed - this can improve results.
- class utils.Tracks(data: pandas.core.frame.DataFrame, client_id: str, client_secret: str)¶
A class that takes a dataframe of listening events with artistName and trackName, and returns these with the trackID and audio features of each track as a dataframe.
- data¶
- Type
A dataframe with two columns ‘artistName’ and ‘trackName’.
- client_id¶
- Type
Spotify API client ID Credentials
- client_secret¶
- Type
Spotify API client secret Credentials
- _sp_auth¶
- Type
Spotipy OAuth object for API calls.
Example
>>> Tracks(data, client_id, client_secret).get(return_all=True)
This will return a
pd.Dataframe
with feature columns filled for each unique track in the original data.- Raises
KeyError – If the input data provided does not contain a
artistName
andtrackName
- get(return_all: bool = False, audio_features: bool = False, artist_info: bool = False) pandas.core.frame.DataFrame ¶
Get the requested data for each track. Returns a dataframe of unique tracks.
- Parameters
return_all (bool, default = False) – Return both audio_features and artist_info
audio_features (bool, default = False) –
Return each track’s audio features.
artist_info (bool, default = False) –
Return the popularity and genre list for each track’s artist
Contributing¶
Contributions to the package are very welcome!
If you would like to add a new feature then