Jekyll2016-11-29T00:04:21-08:00https://bholligan.github.io//HomeA place for sharing data science projects and discoveries.Brian Holliganbrian.matthew.holligan@gmail.comClothing Predictor App2016-10-04T00:00:00-07:002016-10-04T00:00:00-07:00https://bholligan.github.io/projects/fashion_app<p>For my final project at Metis I created a web app that uses convolutional neural networks to identify images with one person in them, and then predict the clothing being worn by that person.</p>
<h3 id="apphttp525318275-app"><a href="http://52.53.182.75/" title="App">App</a></h3>
<p>Feel free to test out your own selfies!</p>
<h3 id="github-repohttpsgithubcombholliganimageclass-repo"><a href="https://github.com/bholligan/image_class" title="Repo">Github Repo</a></h3>
<p>Project repo with further information.</p>
<h3 id="presentation">Presentation</h3>
<p>To get a high level overview of what the app does, check out the video below.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/OYvD0ljakcc" frameborder="0" allowfullscreen=""></iframe>Brian Holliganbrian.matthew.holligan@gmail.comFine-tuning ConvnetsWas Twitter Trashing Rio?2016-09-16T00:00:00-07:002016-09-16T00:00:00-07:00https://bholligan.github.io/projects/olympic_tweets<p>At Metis we learned about natural language processing (NLP) and were given an opportunity to apply NLP concepts to a topic of our choice. With the Olympics in full swing at the time and dominating my Twitter feed, I was interested in seeing what an analysis of Olympic tweets would produce. I tend to follow comedians and sports bloggers on Twitter, and prior to this project I was inundated with jokes about Rio’s unique brand of Olympic preparation that involved unfinished venues, sewage water, athlete robberies, and dead bodies washing ashore. On top of that, the time period I collected data was the final weekend that included the closing ceremonies, Usain Bolt running away with another 3 golds, USA basketball finally getting their act together, and Ryan Lochte bringing his lovable brand of public urination to Brazil. My expectation was to see the negative sentiment that populated my Twitter feed reflected among the a broader sample of Olympic-related tweets. I am aware that my Twitter experience is an echo chamber of people with similar beliefs, and I was intrigued to see how that compares to the greater public.</p>
<h3 id="d3-vizhttpsbholligangithubioolympicd3-click-to-check-it-out"><a href="https://bholligan.github.io/olympicd3/" title="click to check it out">D3 Viz</a></h3>
<p>The result of my project is a d3 visualization that shows the most popular topics on Twitter during the closing ceremonies of the games and their associated sentiment. Here are a few highlights: <br />
- You can clearly see when Tokyo 2020 was announced on the live broadcast and on the NBC tape-delay.<br />
- It was interesting to see “sad” pick up in usage towards the end of the live broadcast.<br />
- The laughing emoji stayed relatively constant, leading me to believe there were no particularly funny moments.<br />
- The Brits use more positive terms than Americans when discussing their nation’s Olympic team.</p>
<p>In summary, the pervasive negative sentiment I expected to see was not present during the closing ceremonies among tweeters using Rio hashtags. When aggregating tweets, generic subjects and feelings inevitability prevailed, but mapping their relevance over time provided insight about how the broad viewership was reacting. The sentiment analysis tool I used is a rudimentary system that struggled to provide value, especially when the topics themselves were defined by emotions such as “love” or “sad”. It’s possible that different result could be found by incorporating Tweets devoid of hashtags, as hashtag usage tends to be higher among corporate and news-affiliated users.</p>
<p>Here’s <a href="https://github.com/bholligan/olympic_NLP/blob/master/Tweet%20Text.ipynb" title="Notebook">a link</a> to my Jupyter Notebook showing how I used TF-IDF, truncated singular value decomposition, and sentiment analysis to acquire my results.</p>
<p>The javascript to build the d3 visual is <a href="https://github.com/bholligan/bholligan.github.io/blob/master/olympicd3/index.html" title="Javascript">here</a>.</p>Brian Holliganbrian.matthew.holligan@gmail.comApplying NLP to the closing ceremonies of the 2016 Summer Olympics.Does Movie Genre Correlate to Revenue? (Part 2)2016-07-30T00:00:00-07:002016-07-30T00:00:00-07:00https://bholligan.github.io/projects/hollywood2<h2 id="part-2">Part 2</h2>
<h3 id="inspection">Inspection</h3>
<p>Before doing any modeling, I love investigating a seaborn pair plot of the data. The output shows the pair-wise relationship between all of the features. A couple examples can be seen in the figure below. Perusing through these plots is an awesome way to sanity check the collected data and look for any particularly interesting relationships. My objective was to predict domestic gross, and the pair plot built expectations for what features would be influential.</p>
<figure>
<img src="/images/pairplot.png" />
<figcaption>Example pair-plots looking at ajusted domstic gross </figcaption>
</figure>
<h3 id="lasso">Lasso</h3>
<p>I filtered out all movies with missing values, then performed a 70-30 train-test split. My first pass regression model was done using a Lasso estimator on normalized values of all the features. The regularization of Lasso models helps prevent overfitting and excessive complexity. I’m also a big fan of using Grid Search for selecting optimal hyperparameters of any linear regression model. The model coefficients that mattered the most were production budget and ratings from IMDB and Rotten Tomatoes. The most interesting part of this exercise was learning that the Metacritic rating had essentially no value when making predictions. The figure below shows predicted domestic gross values vs. actual. A perfect model would plot every point on that straight blue line.</p>
<figure style="margin-bottom:5px !important;">
<img src="/images/lasso.png" style="margin:auto !important;
width:50% !important;
display:block !important;" />
</figure>
<figcaption style="text-align:center !important;">Model Score = 0.44</figcaption>
<h3 id="testing-on-the-past-5-years">Testing on the Past 5 Years</h3>
<p>Since the usefulness of a model is through accurately predicting future events, I thought it would be interesting to train on all films through 2010, and then predict revenues for films released from 2011-2016. I used the lasso regression model again, as well as random forest and gradient boosting tree approaches. An interesting outcome of the grid search methodology was discovering that the least absolute deviations (LAD) loss function performed better than the popular least squares function. Given the extreme outliers that occur in the film industry, this is not too surprising that the LAD function, which does not punish outliers as severely, would improve predictive power.</p>
<figure style="margin-bottom:5px !important;">
<img src="/images/randforest.png" style="margin:auto !important;
width:50% !important;
display:block !important;" />
</figure>
<figcaption style="text-align:center !important;">Model Score = 0.58</figcaption>
<p>The random forest method, with test results shown above, performed the best. Below is a comparison of some of the more important features in each model. All of the models relied heavily on production budget and Rotten Tomatoes and IMDB rating. The genre of the film had very little influence in making predictions.</p>
<figure style="margin-bottom:5px !important;">
<img src="/images/features.png" style="margin:auto !important;
width:50% !important;
display:block !important;" />
</figure>
<figcaption style="text-align:center !important;">Relative Importance of Features</figcaption>
<h3 id="notable-takeaways">Notable Takeaways</h3>
<p>The straight-up answer to my initial question is that no, genre on its own is not useful for determining how much a film is going to gross domestically. As one might expect, high budget films that people like are what make money. Despite these unexciting conclusions, I did make a few interesting discoveries, such as a 3D movies not being correlated with revenue in any model. I also found it important to be very careful about overfitting gradient boosted models when predicting on a new distribution. The moviegoing habits over the last 5 years were slightly different than the prior 30 years, which contributed to gradient boosting methods performing worse than anticipated. Attempting to predict the ratio of revenue to budget proved unsuccessful, mostly due to the presence of outliers and losing budget as a feature. Lastly, trying to make predictions within individual genres was unsuccessful, as most genres had only around 100 movies total, an insufficient number to generate meaningful conclusions.</p>Brian Holliganbrian.matthew.holligan@gmail.comRegression analysisDoes Movie Genre Correlate to Revenue? (Part 1)2016-07-27T00:00:00-07:002016-07-27T00:00:00-07:00https://bholligan.github.io/projects/hollywood<p>This post is a detailed explanation of a project I did at Metis.</p>
<h2 id="part-1">Part 1</h2>
<h3 id="objective">Objective</h3>
<p>The goal of the project was to obtain film industry data and apply regression models to identify correlations and learn something about the industry. The skills required to execute this task are an understanding of web scraping and basic machine learning principles. When devising a problem statment, the first challenge I discovered was that the scope of the project would need to be broad. I had aspirations to look at the relative success of Pixar films, but with only 17 major releases at the time, that’s simply not enough data to build a halfway decent machine learning model. In order to give myself an ample number of data points to work with, I opted to investigate whether film genre had any correlation with hollywood success. The question now becomes, what’s the best metric to rate success?</p>
<h3 id="defining-success">Defining Success</h3>
<p>The simplest way to determine how well a movie does is by its total gross. Domestic gross figures are readily available for almost every movie, and the major releases have foreign gross values as well. Finding further financial information is where things become difficult. Box Office Mojo, the source of most of the data used in this project, has production budgets for about half of its major films. I could not find any repositories containing marketing budgets, merchandising revenue, after-theater sales, and other information that would be useful in determining actual profits. As such, I’m left with the fact that the only metrics I can predict are gross theater revenue and the ratio of that revenue over the production budget. This is not optimal, but domestic gross should be a good proxy for many of the other financial aspects of a film and should serve the purpose of this project. I deemed it essential to adjust past revenue figures to today’s dollars. I did this by sticking to movies released since 1980 and converting the revenue figure based on average ticket price for each year. This method is slightly flawed, as a movie released on December 31st will get a different correction than one released on January 1st the following day, but for the most part it should do well to normalize revenue data over the duration of time period examined (1980 - 2015).</p>
<h3 id="determining-features">Determining Features</h3>
<p>It was rather silly, but I found the use of the word “feature” to be non-intuitive during my first machine learning (ML) lecture. This is for anyone else who might be new to the field, but features are ML-speak for what I had traditionally thought of as variables or inputs. They are the attributes I will be using to predict the success of Hollywood films. Feature selection is an incredibly important part of building any regression or prediction model, and I won’t elaborate on selection theory here, but the features I opted for are: genre, budget, domestic gross, MPAA rating, critical reception, runtime and release date. These features are relatively easy to acquire, which was a necessity given the two week time constraint on the project. Since my objective statement is determining whether genre is predictive of revenue, it was critical to include a variety of other features to make sure genres were not serving as a surrogate for other, more important features.</p>
<p>One feature inclusion that I debated was critical reception. I acquired and used movie ratings from Rotten Tomatoes, IMDB and Metacritic. There are interesting conclusions to be drawn from looking at the relationship between box office gross and reviews, but obviously you do not have this information prior to a film’s release. Using this data was a personal admission that this project is for personal development, and not for making genuine revenue predictions. As our teacher often stressed, “you’ll never be able to tell me how exactly much a movie is going to make just by saying it’s a PG-13 Action flick released in the summer.” That sentiment nicely encapsulates the limitations of this project, and would take on even greater meaning when I went about assessing regression models. However, no models can be built before the data is be acquired, and that proved to be a rather tricky proposition.</p>
<h3 id="collecting-data">Collecting Data</h3>
<p>“Scraping” is a term used to describe extracting data from websites. As a first time participant in this process, I was very thankful for the prebuilt tools and libraries that made it manageable. All the data observed on a website is embedded within its HTML code. Google Chrome has an incredibly handy “Inspect” feature (found via a right-click) that lets you browse through HTML and get a feel for how the site is organized. A nice webscraping library called Beautiful Soup has been developed for python. Beautiful Soup creates Soup objects that can be parsed, searched, and maneuvered through conveniently to isolate and acquire information. The main source of data I used was <a href="http://http://www.boxofficemojo.com/" title="Mojo">Box Office Mojo</a>. This site contained all of the features I was looking to acquire, outside of critical reception. The following is the bread-and-butter of retrieving HTML from a website and converting to a Soup object:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> <span class="c"># requests a library for retrieving and reading HTML </span>
<span class="n">page</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="n">text</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="s">"lxml"</span><span class="p">)</span> <span class="c"># BeautifulSoup stores the HTML in a soup object. </span>
</code></pre>
</div>
<p>My process for iterating through films on the Mojo website was based on the fact that the website is built through url templates. All I needed to do was acquire the unique movie title id located in each movie page’s url. This process went as follows:</p>
<ol>
<li>Get the list of genres sorted by number of movies in the genre.</li>
<li>Iterate through a sorted version of each genre page to get the list of top 100 highest grossing films in the genre.</li>
<li>Iterate through each film’s page to retrieve data for that particular film.</li>
</ol>
<p>Data from each movie was stored in a dictionary, where the keys corresponded to eventual column names of the data frame. The dictionary of each movie was appended to a list with the dictionaries of all the movies. The biggest plight that I faced when running my script was missing data leading to errors. I learned the value of setting up an error logger and using try/except clauses to identify and resolve any issues. Below is the simple error handling style I employed.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">try</span><span class="p">:</span>
<span class="c"># Attempt to assign the domestic gross value. </span>
<span class="c"># The function money_to_int cleans up the string and converts it to an integer. </span>
<span class="n">domestic_gross</span> <span class="o">=</span> <span class="n">money_to_int</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'div'</span><span class="p">,</span> <span class="n">class_</span> <span class="o">=</span> <span class="s">"mp_box_content"</span><span class="p">)</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s">'td'</span><span class="p">,</span> <span class="n">align</span> <span class="o">=</span> <span class="s">'right'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="n">movie_data_dict</span><span class="p">[</span><span class="s">'Domestic_Gross'</span><span class="p">]</span> <span class="o">=</span> <span class="n">domestic_gross</span> <span class="c"># Put the domestic gross into the movie data dictionary. </span>
<span class="k">except</span><span class="p">:</span>
<span class="c"># If there's an issue retrieving the domestic gross, this will allow the script to keep running and create a record of the url that caused the issue. </span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'ErrorLog.txt'</span><span class="p">,</span><span class="s">'a'</span><span class="p">)</span> <span class="k">as</span> <span class="n">efile</span><span class="p">:</span>
<span class="n">efile</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">Error domestic gross. Movie name: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">mov_url</span><span class="p">))</span>
</code></pre>
</div>
<p>The last issue to hurdle was getting timed-out by Box Office Mojo. With so many of my classmates also scraping their site, I had numerous scrape attempts get prematurely halted by HTTP request errors. My solution was to add a randomized delay in the python script, and to limit my scraping to no more than 1000 movies at a time. By the time the script was running smoothly I was already focused on getting the modeling finished, so I opted to cap my data collection at ~2200 movies from 29 different genres. The final piece of data, critical reception, was pulled from the API of the Open Movie Database. Once all of the data was collected and loaded into a Pandas dataframe, it was time to examine the relationship between features.</p>
<p><a href="https://bholligan.github.io/projects/hollywood2/">Part 2</a></p>Brian Holliganbrian.matthew.holligan@gmail.comProject overview and data collection