This work was a result of a two-day Nodebox workshop at Vilnius Academy of Arts. By that time I was very familiar with Nodebox, so I decided to do something a bit more challenging than usual, and picked a data-heavy analysis subject: scripts of hundreds of movies released over the last 50 years.
The goal was to identify visible changes in language. Which words were never used during 60’s but became popular later? Which words used to be very common but are barely spoken now?
To answer these questions I used a lot of scrapers, downloaders and text analysis libraries:
- Identified Top 10 English movies for each year, according to IMDb.
- Downloaded English language subtitles for all of them.
- Cleaned each subtitle file from “markup”, leaving only dialogue.
- Analyzed all dialogues together to identify most common words and additional “stop words” (words so common that it doesn’t make sense to include them, like “and”).
- For 50 most promising words, identified relative word frequency for each word in each year.
Finally, we visualized each chart as a waveform using Nodebox, creating the final poster.
- Juste Ziliute
- Augustinas Paukste
- Python Patterns library