The dataset source is from the Guardian’s datablog.
The 1,000 songs database only had 5 features: Theme, Title, Artist, Year, Spotify URL, so I attempted a beautiful soup webscrape of the wikipedia infobox. While I successfully scraped one, I realized it would be cumbersome to loop through all artists, particularly because they do not have standardized names. So I did the 2000’s old-school method of internet searching each and every unique artist (there were 600+) and getting their genre, location, group/solo, gender. It took me one weekend to complete, and I thought it was well worth it, as I can be more enriching and flexible with my analytics.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url= "http://en.wikipedia.org/wiki/The_Beatles"
page = urlopen(url)?-
soup = BeautifulSoup(page.read(), "lxml")
table = soup.find('table', class_='infobox vcard plainlist')
result = {}
exceptional_row_count = 0
for tr in table.find_all('tr'):
if tr.find('th'):
result[tr.find('th').text] = tr.find('td').text if tr.find('td') else None
else:
exceptional_row_count += 1
if exceptional_row_count > 1:
print ('WARNING ExceptionalRow>1: ', table)
print (result)
soup = BeautifulSoup(page.read(), "lxml")
table = soup.find('table', class_='infobox vcard plainlist')
result = {}
exceptional_row_count = 0
for tr in table.find_all('tr'):
if tr.find('th'):
result[tr.find('th').text] = tr.find('td').text if tr.find('td') else None
else:
exceptional_row_count += 1
if exceptional_row_count > 1:
print ('WARNING ExceptionalRow>1: ', table)
print (result)
{'The Beatles': None, 'Background information': None, 'Origin': 'Liverpool, England', 'Genres': '\n\n\nRock\npop\n\n\n', 'Years active': '1960ñ1970', 'Labels': '\n\n\nParlophone\nApple\nCapitol\n\n\n', 'Associated acts': '\n\n\nThe Quarrymen\nBilly Preston\nPlastic Ono Band\n\n\n', 'Website': 'thebeatles.com', '': None, 'Past members': '\n\nJohn Lennon\nPaul McCartney\nGeorge Harrison\nRingo Starr\n\nSee members section for others'}