Published 4 years, 7 months ago
To kick off a new category of "dataset" blog posts, I'll be sharing a relatively small dataset I scraped from the Sonic Fandom Wiki near the end of 2018.
At the time, I wanted to start messing around with nerual-net based chatbots, and needed some data to use for training. Rather than using a pre-existing dataset, I thought it'd be more fun to scrape my own.
The users over at the Sonic Fandom Wiki have diligently transcribed the scripts of several Sonic TV shows (and recently the Sonic Movie). I thought this would make for a fun and unique dataset to train a chatbot, and I was right! I used the Python Scrapy library to do the scraping, itself.
The dataset is a gzip compressed CSV file with two columns: the first contains a character name, and the second contains a quote from that character. In total it has 34,618 lines of dialog. Unfortunately, I didn't have the crawler record the TV show that each quote originated from. Furthermore, while the quotes may seem ordered, this is not always the case, so don't rely on it.
Do note that I didn't do much preprocessing, so a particular character might go by different names depending on the TV show the quote came from.
sonic-transcripts.csv.gz
#dataset