1) Performed data pre-processing, feature engineering and data analysis on a dataset having every dialogue from the popular American TV series “The Office” to summarize character stats, relationships and their influence on the popularity and success of the show (in terms of ratings and votes), presented through interactive visualizations.
2) Goal - To determine what kind of questions can be answered from the show's script and ratings data analysis that can lead the makers to better data-driven decisions while investing in a season revival/reboot using pandas, plotly, d3graph and chord packages.
3) Two datasets are used - The dialogues dataset from theofficequotes.net and IMDB ratings dataset from Kaggle.
Author: Swarnita Venkatraman
Date: 15/12/2020
df.head()#dialogues dataset
| id | season | episode | scene | line_text | speaker | deleted | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 1 | All right Jim. Your quarterlies look very good... | Michael | False |
| 1 | 2 | 1 | 1 | 1 | Oh, I told you. I couldn't close it. So... | Jim | False |
| 2 | 3 | 1 | 1 | 1 | So you've come to the master for guidance? Is ... | Michael | False |
| 3 | 4 | 1 | 1 | 1 | Actually, you called me in here, but yeah. | Jim | False |
| 4 | 5 | 1 | 1 | 1 | All right. Well, let me show you how it's done. | Michael | False |
df_imdb.head() #IMDB ratings dataset
| Season | Title | AirDate | Rating | Num_Votes | Description | DirectedBy | WrittenBy | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Pilot | 2005-03-24 | 7.5 | 4349 | The premiere episode introduces the boss and s... | Ken Kwapis | Ricky Gervais |Stephen Merchant and Greg Daniels |
| 1 | 1 | Diversity Day | 2005-03-29 | 8.3 | 4213 | Michael's off color remark puts a sensitivity ... | Ken Kwapis | B. J. Novak |
| 2 | 1 | Health Care | 2005-04-05 | 7.8 | 3536 | Michael leaves Dwight in charge of picking the... | Ken Whittingham | Paul Lieberstein |
| 3 | 1 | The Alliance | 2005-04-12 | 8.1 | 3428 | Just for a laugh, Jim agrees to an alliance wi... | Bryan Gordon | Michael Schur |
| 4 | 1 | Basketball | 2005-04-19 | 8.4 | 3745 | Michael and his staff challenge the warehouse ... | Greg Daniels | Greg Daniels |
1) The show had 9 seasons in total. 86 episodes(46%) had a rating above the mean/median rating of 8.2 and most of the votes are concentrated between 1700-3000.
2) Although season 3(with the highest median rating) had the highest number of scenes it has the third highest number of lines. Confirms that more scenes does not necessarily mean more lines in a season.
3) Season 4 had only 14 episodes as the show had to stop production in November 2007 due to a writer’s strike when it was being shot. But three of its episodes feature in the overall Top 5 maximum lines list and maximum scenes list probably because of it being a shorter season.
4) Season 5(with the highest number of episodes) had the highest number of lines and includes the most watched episode of the show "Stress Relief" which aired on NBC after the broadcast of Super Bowl XLIII, making it the only episode to reach over 20 million viewers.
5) Besides season 1,season 7 had the highest number of unique directors(20) and unique writers(16) for the 24 episodes that it had.
6) Season 8's highest rated episode rating is same as the median rating across all episodes while the season has the lowest mean votes along with season 9.The season also features the lowest voted episode which suggests that after Michael left viewers lost interest hence did not watch the show/didnt want to vote.
7) 57,973 lines were spoken in total by 699 speakers with the show's last season(Season 9) having the highest number of unique speakers(184).
8) Characters of the show like Rainn Wilson(Dwight),Steve Carell(Michael),John Krasinski(Jim),Ed Helms(Andy) and Brian Baumgartner(Kevin) directed episodes while Mindy Kaling(Kelly),B. J. Novak(Ryan) and Paul Lieberstein(Toby) not only directed but were also regular writers of the show apart from acting in it.Steve Carell(Michael) also wrote couple of episodes.
• An upward trend seen here - more the total lines better the ratings. Major concentration of episodes lies in the 200-320 lines range wherein most episodes of seasons 7,8 and 9 tend to have the lower ratings.
• Episodes with more than 75th percentile of total lines across all episodes lies(323 lines) generally tend to have ratings >= show's median rating 8.2.
• Similar pattern seen like Total_lines-an upward trend is seen,more the total scenes better the ratings. Major concentration of episodes lies in the 30-55 scenes range wherein most episodes of seasons 7,8 and 9 tend to have the lower ratings.
• Episodes with more than 75th percentile of total scenes across all episodes lies(52 scenes) generally tend to have ratings >= show's median rating 8.2.
• Although there are a few episodes with less scenes that are highly rated, most episodes with higher scenes definitely seem to be highly rated.
• We know top 5 characters with maximum lines include Michael,Dwight,Jim,Pam and Andy.Michael tops all seasons by a large margin until season 7 with the highest number of lines while Dwight comes second in 5 seasons(2,3,5,7,9),Jim in 3 seasons(1,4,6) and Andy in season 8.
• Except for top 5 characters all other supporting characters have a limited range of lines.Darryl and Erin's upward curve suggests major shoot up in the last few seasons.
Text(0.5, 1.08, 'Relation between Total lines and Rating for top 5 characters')
• Assumption is that if the character has more lines in an episode they are seen majorly through the episode.
• Upward trend seen for everyone except Andy-ratings tend to drop when Andy has more lines while Dwight's slope is steeper suggesting lesser jump in ratings compared to other 3, meaning even if Dwight has more lines the ratings are still on the lower side.
• Jim and Pam have a similar graph while for Michael lowly rated episodes cannot be seen as most of those are from seasons 8 and 9 after he left.
• For the frequently seen character pairs number of common episodes where both the characters have atleast one line-
Jim-Dwight(185), Pam-Jim(181), Michael-Dwight(137) and Michael-Jim(136 episodes).
• We know Top 5 characters with maximum scenes include Michael,Dwight,Jim,Pam and Andy.Michael appears in the highest number of scenes in every season until season 7 after which he left the show.
• Jim and Dwight tie in season 6 despite Jim having more lines while in season 7 Jim appears in more scenes than Dwight despite Dwight having more lines.
• All the supporting characters seem to follow a somewhat similar limited pattern over 9 seasons.
Text(0.5, 1.08, 'Relation between Total scenes and Rating for top 5 characters')
• Assumption is that if the character appears in more scenes in an episode they are seen majorly through the episode. The trends are similar to previously seen total lines graphs.
• Greg Daniels has the highest median rating while Ken Kwapis and Paul Feig have all episode ratings above 8(except 1 which is Pilot episode) along with second highest median rating.
• While Matt Sohn's episodes comparatively have lower ratings, for David Rogers who has the lowest median rating it was just one episode that was rated high.
• Greg Daniels has the highest median rating of 8.7,followed by Paul Lieberstein who has the median rating of 8.5.
• Michael Schur and Justin Spitzer have all episodes rated above 8(except one from season 8) while Mindy Kaling and B.J.Novak have a similar ratings distribution. Aaron Shure has the lowest median rating of 7.9
• Nothing conclusive can be said since the number of episodes written by more than one writer is only 33/186 making the data too skewed for conclusive observations.
• Director-Writer pairs who have worked together in more than one episode seem to rake in slightly better ratings than those pairs who have worked together in only one episode.
• For (Jeffrey Blitz, Paul Lieberstein)-2/2 episodes,(Ken Kwapis, Greg Daniels)-2/3 episodes and (Paul Feig, Gene Stupnitsky | Lee Eisenberg)-3/4 episodes that they worked together on have ratings > 75th percentile of ratings(>8.6).
While for (Matt Sohn, Allison Silverman)-2/2 episodes that they worked together on have ratings < 25th percentile of ratings(<7.8).
• Percentage shown is with respect to total number of episodes per season. Season 6 and 8 experimented with the most number of new director-writer pairs(who worked together in just one episode) hence accounting for almost 80% of the episodes in both seasons.
• Although the number of episodes(19) involved are too less to form any conclusive observations,in Season 6 and 8, 5/7 actors directed an episode possibly to revive the low ratings.
• Half of the episodes belong to seasons 8 and 9 and most episodes are below median rating of 8.2. All three episodes directed by John Krasinski are lowly rated.
• Some iconic episodes like "Scott's Tots", "Michael's last dundies" and "Garage sale" feature in the list. Two iconic episodes "Casino Night" and "Survivor Man" were written by Steve Carell.
df_ep.sort_values(by=['Rating','Num_Votes'],ascending=False).head(10)
| season | episode | Title | AirDate | Rating | Num_Votes | DirectedBy | WrittenBy | year | month | day | Difference in days between air dates(season-wise) | Combined | episode_no | Total_lines | Total_scenes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 185 | 9 | 23 | Finale | 2013-05-16 | 9.8 | 9269 | Ken Kwapis | Greg Daniels | 2013 | 5 | 3 | 238.0 | (Ken Kwapis, Greg Daniels) | 186 | 522 | 116 |
| 135 | 7 | 21 | Goodbye, Michael | 2011-04-28 | 9.8 | 6909 | Paul Feig | Greg Daniels | 2011 | 4 | 3 | 217.0 | (Paul Feig, Greg Daniels) | 136 | 329 | 53 |
| 77 | 5 | 13 | Stress Relief | 2009-02-01 | 9.7 | 7058 | Jeffrey Blitz | Paul Lieberstein | 2009 | 2 | 6 | 129.0 | (Jeffrey Blitz, Paul Lieberstein) | 78 | 416 | 70 |
| 184 | 9 | 22 | A.A.R.M. | 2013-05-09 | 9.5 | 3401 | David Rogers | Brent Forrester | 2013 | 5 | 3 | 231.0 | (David Rogers, Brent Forrester) | 185 | 501 | 70 |
| 59 | 4 | 9 | Dinner Party | 2008-04-10 | 9.4 | 4824 | Paul Feig | Gene Stupnitsky | Lee Eisenberg | 2008 | 4 | 3 | 196.0 | (Paul Feig, Gene Stupnitsky | Lee Eisenberg) | 60 | 331 | 31 |
| 130 | 7 | 16 | Threat Level Midnight | 2011-02-17 | 9.4 | 4236 | Tucker Gates | B. J. Novak | 2011 | 2 | 3 | 147.0 | (Tucker Gates, B. J. Novak) | 131 | 72 | 26 |
| 27 | 2 | 22 | Casino Night | 2006-05-11 | 9.4 | 4195 | Ken Kwapis | Steve Carell | 2006 | 5 | 3 | 233.0 | (Ken Kwapis, Steve Carell) | 28 | 385 | 57 |
| 94 | 6 | 4 | Niagara: Part 1 & 2 | 2009-10-08 | 9.4 | 4055 | Paul Feig | Greg Daniels | Mindy Kaling | 2009 | 10 | 3 | 21.0 | (Paul Feig, Greg Daniels | Mindy Kaling) | 95 | 552 | 78 |
| 50 | 3 | 23 | The Job | 2007-05-17 | 9.3 | 3461 | Ken Kwapis | Paul Lieberstein | Michael Schur | 2007 | 5 | 3 | 238.0 | (Ken Kwapis, Paul Lieberstein | Michael Schur) | 51 | 544 | 88 |
| 64 | 4 | 14 | Goodbye, Toby | 2008-05-15 | 9.3 | 3453 | Paul Feig | Jennifer Celotta | Paul Lieberstein | 2008 | 5 | 3 | 231.0 | (Paul Feig, Jennifer Celotta | Paul Lieberstein) | 65 | 617 | 103 |
Peculiar features -
• Episodes belong to all seasons except season 1(new show and style of humour hence people take time to get accustomed) and season 8(complete focus on Andy's story arc and Michael's absence not the greatest season).
• Ken Kwapis and Paul Feig each directed 3 and 4 episodes in this list respectively - they both have directed the second and third highest number of episodes of the show with all rated above 8.
• 3 episodes were written/co-written by Greg Daniels while 3 episodes were written/co-written by Paul Lieberstein - both of them are regular writers of the show having the top two highest median ratings.
• The director-writer pairs of top 5 episodes in this list have worked together in atleast 1 other episode throughout the show.
• 9/10 episodes (except "Threat-level Midnight" episode which has an inconsitency in recording of lines and scenes in the dataset because it featured clips mainly) have total lines>75th percentile of lines(323) while 8/10 episodes have total scenes>75th percentile of scenes(52) confirming the trends seen earlier.
df_ep.sort_values(by=['Rating','Num_Votes'],ascending=[True,False]).head(10)
| season | episode | Title | AirDate | Rating | Num_Votes | DirectedBy | WrittenBy | year | month | day | Difference in days between air dates(season-wise) | Combined | episode_no | Total_lines | Total_scenes | Top5_max_lines | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 157 | 8 | 19 | Get the Girl | 2012-03-15 | 6.6 | 1944 | Rainn Wilson | Charlie Grandy | 2012 | 3 | 3 | 175.0 | (Rainn Wilson, Charlie Grandy) | 158 | 261 | 41 | ([Nellie, Jim, Andy, Erin, Robert], [42, 38, 33, 32, 24]) |
| 103 | 6 | 13 | The Banker | 2010-01-21 | 6.8 | 2665 | Jeffrey Blitz | Jason Kessler | 2010 | 1 | 3 | 126.0 | (Jeffrey Blitz, Jason Kessler) | 104 | 131 | 21 | ([Michael, Eric, Toby, Dwight, Computron], [44, 26, 18, 17, 10]) |
| 146 | 8 | 8 | Gettysburg | 2011-11-17 | 6.9 | 1850 | Jeffrey Blitz | Robert Padnick | 2011 | 11 | 3 | 56.0 | (Jeffrey Blitz, Robert Padnick) | 147 | 282 | 45 | ([Andy, Dwight, Jim, Robert, Oscar], [58, 30, 25, 24, 20]) |
| 167 | 9 | 5 | Here Comes Treble | 2012-10-25 | 7.0 | 1780 | Claire Scanlon | Owen Ellickson | 2012 | 10 | 3 | 35.0 | (Claire Scanlon, Owen Ellickson) | 168 | 292 | 39 | ([Andy, Dwight, Jim, Pam, Erin], [50, 39, 35, 32, 24]) |
| 159 | 8 | 21 | Angry Andy | 2012-04-19 | 7.1 | 1857 | Claire Scanlon | Justin Spitzer | 2012 | 4 | 3 | 210.0 | (Claire Scanlon, Justin Spitzer) | 160 | 292 | 51 | ([Andy, Nellie, Ryan, Pam, Erin], [53, 36, 30, 29, 24]) |
| 158 | 8 | 20 | Welcome Party | 2012-04-12 | 7.1 | 1739 | Ed Helms | Steve Hely | 2012 | 4 | 3 | 203.0 | (Ed Helms, Steve Hely) | 159 | 307 | 25 | ([Andy, Jim, Pam, Erin, Nellie], [48, 39, 32, 25, 22]) |
| 160 | 8 | 22 | Fundraiser | 2012-04-26 | 7.1 | 1694 | David Rogers | Owen Ellickson | 2012 | 4 | 3 | 217.0 | (David Rogers, Owen Ellickson) | 161 | 239 | 38 | ([Andy, Jim, Pam, Kevin, Nellie], [41, 24, 22, 18, 17]) |
| 164 | 9 | 2 | Roy's Wedding | 2012-09-27 | 7.2 | 1730 | Matt Sohn | Allison Silverman | 2012 | 9 | 3 | 7.0 | (Matt Sohn, Allison Silverman) | 165 | 296 | 47 | ([Jim, Pam, Andy, Dwight, Erin], [39, 34, 33, 31, 31]) |
| 141 | 8 | 3 | Lotto | 2011-10-06 | 7.3 | 1858 | John Krasinski | Charlie Grandy | 2011 | 10 | 3 | 14.0 | (John Krasinski, Charlie Grandy) | 142 | 317 | 42 | ([Andy, Jim, Darryl, Dwight, Pam], [63, 52, 49, 37, 23]) |
| 177 | 9 | 15 | Couples Discount | 2013-02-07 | 7.3 | 1655 | Troy Miller | Allison Silverman | 2013 | 2 | 3 | 140.0 | (Troy Miller, Allison Silverman) | 178 | 260 | 35 | ([Andy, Dwight, Pam, Jim, Erin], [62, 32, 28, 26, 26]) |
Peculiar features -
• All episodes belong to seasons 8 and 9(after Michael leaves the show!)
• Jeffrey Blitz and Claire Scanlon directed 2 episodes each-Jeffery Blitz's all other episodes have high ratings except the two that feature in this list while Claire Scanlon has directed only these two episodes.
• Charlie Grandy, Owen Ellickson and Allison Silverman wrote 2 episodes each.
• Episodes directed by the shows's actors Rainn Wilson,Ed Helms and John Krasinski feature in the list.
• While for (Matt Sohn, Allison Silverman)-2/2 episodes that they worked together on have ratings < 25th percentile of ratings(<7.8),none of the other director-writer pairs have worked together in another episode.
• None of the episodes have total lines>75th percentile of lines(323) or total scenes>75th percentile of scenes(52) confirming the trends seen earlier.
• Season 3 has highest median rating, with the median line present towards upper half of the plot (positive skewness)indicating most episodes have higher ratings above 8.6. Season 8 has the lowest median rating.
• Seasons 2 to 5 ratings have been steady and good,from season 6 onwards a general decline begins to be seen; a slight revival in season 7 does not continue in season 8 and 9 which have consistently low ratings except for 3 episodes in season 9.
• A decreasing trend in votes can be seen as we move towards later seasons, with a sharp dip from season 7 to 8.
• Positive linear relation seen upto certain extent between ratings and votes. High density of votes are mostly seen around 2000 upto 3000 for ratings between 8-8.8.Votes go above 6000 only at 4 instances.
• Ratings pick up after season 1 (naturally because audience took time to get accustomed to the style of humour) but the votes begin to decrease. The decrease in votes does not necessarily suggest decline in viewership as highly rated peaks are in tow with highly voted peaks for the audience favorite episodes hence explaining why it attracted more viewers. Even though season 9 wasn’t as bad as season 8 probably because of the way season 8's story arc progressed many viewers would not have wanted to watch it explaining why it isn’t voted high(saturated votes).
• The season finale episodes end on a higher rating than the first episode of the season except for season 6 and 8 which particularly see a decline in the ratings suggesting weaker finale episodes.
• Season 6 and 8 experimented with the most number of new director-writer pairs(who worked together in just one episode) accounting for almost 80% of the episodes in both seasons explaining why especially the season 6 episodes with Jim as co-manager and the Sabre storylines seemed not so engaging compared to the previous seasons.
• Considering "Regular director" as a director who has directed atleast 6 episodes(length of season 1) and "Not regular director" as a director who has directed less than 6 episodes throughout the show hence using this graph to draw conclusions for initial seasons might not be correct and sensible but can be used for analysing later seasons(especially season 8 and 9).
• Season 8 and 9 saw highest(>50%) percent of episodes being directed by not regular directors compared to other seasons.
• Similar assumptions as "Regular and not regular director" for "Regular writer and not regular writer" hence using this graph to draw conclusions for initial seasons might not be correct and sensible but can be used for analysing later seasons(especially season 8 and 9).
• Season 8 and 9 saw highest percent of episodes being written by not regular writers with season 9 having more than 70% episodes written by not regular writers compared to other seasons.
Conclusion for SEASON 8:
• The entire season was focused on Andy’s character(him as Regional manager, his family, personal life) and given the fact that Ed Helms is a good actor, the mediocre writing merely reduced Andy to a caricature trying to portray him as a replacement to Michael’s character. The writers apparently wanted to make Darryl (Craig Robinson) the manager but thought he was too smart to take decisions leading to comical situations.
• The season also saw unnecessarily extended storylines(Gabe cribbing over Erin in the background), randomly abandoned storylines(Darryl-Val, Cathy) and haphazard storylines(Florida store).
• The cast additions also didn’t help - Robert California’s and Nellie Bertram’s storylines did more damage to the ratings than expected.
Conclusion for SEASON 9:
• Focus on Dwight's story arc(farm life,friends,manager position,relationship with Angela) and Jim's story arc(relationship stress with Pam,his own sports company in Philly). While the initial episodes did focus on Andy's character, since Ed Helms was away for some episodes shooting for The Hangover Part III movie, his scenes were slashed by half and dialogues by 40% from season 8, shifting focus onto other characters which seems to have helped the show's ratings.
• The storylines got more real and personal with new friendships like Nellie-Pam, Dwight-Erin, Jim-Darryl along with a peak into the documentary crew.
• Not only did it bring back characters like Roy and Jan along with many new characters due to the scattered storylines(Clark,Pete,Dwight's family and friends,Jim's Philly office) but regular supporting characters like Angela and Oscar got their dues as they had their all time highest lines and scenes in season 9.
For the complete code and analysis visit my github repo - https://github.com/swarnitav08/EDA-for-The-Office-Story
