Dataset at a glance and EDA Highlights of
The Office Story: That's what the data said.

Analysis and Visualization of the show's dialogues and ratings dataset¶

1) Performed data pre-processing, feature engineering and data analysis on a dataset having every dialogue from the popular American TV series “The Office” to summarize character stats, relationships and their influence on the popularity and success of the show (in terms of ratings and votes), presented through interactive visualizations.

2) Goal - To determine what kind of questions can be answered from the show's script and ratings data analysis that can lead the makers to better data-driven decisions while investing in a season revival/reboot using pandas, plotly, d3graph and chord packages.

3) Two datasets are used - The dialogues dataset from theofficequotes.net and IMDB ratings dataset from Kaggle.

Author: Swarnita Venkatraman

Date: 15/12/2020

Datasets at a glance:¶

df.head()#dialogues dataset
id season episode scene line_text speaker deleted
0 1 1 1 1 All right Jim. Your quarterlies look very good... Michael False
1 2 1 1 1 Oh, I told you. I couldn't close it. So... Jim False
2 3 1 1 1 So you've come to the master for guidance? Is ... Michael False
3 4 1 1 1 Actually, you called me in here, but yeah. Jim False
4 5 1 1 1 All right. Well, let me show you how it's done. Michael False
df_imdb.head() #IMDB ratings dataset
Season Title AirDate Rating Num_Votes Description DirectedBy WrittenBy
0 1 Pilot 2005-03-24 7.5 4349 The premiere episode introduces the boss and s... Ken Kwapis Ricky Gervais |Stephen Merchant and Greg Daniels
1 1 Diversity Day 2005-03-29 8.3 4213 Michael's off color remark puts a sensitivity ... Ken Kwapis B. J. Novak
2 1 Health Care 2005-04-05 7.8 3536 Michael leaves Dwight in charge of picking the... Ken Whittingham Paul Lieberstein
3 1 The Alliance 2005-04-12 8.1 3428 Just for a laugh, Jim agrees to an alliance wi... Bryan Gordon Michael Schur
4 1 Basketball 2005-04-19 8.4 3745 Michael and his staff challenge the warehouse ... Greg Daniels Greg Daniels

Summary of Datasets:¶

1) The show had 9 seasons in total. 86 episodes(46%) had a rating above the mean/median rating of 8.2 and most of the votes are concentrated between 1700-3000.


2) Although season 3(with the highest median rating) had the highest number of scenes it has the third highest number of lines. Confirms that more scenes does not necessarily mean more lines in a season.


3) Season 4 had only 14 episodes as the show had to stop production in November 2007 due to a writer’s strike when it was being shot. But three of its episodes feature in the overall Top 5 maximum lines list and maximum scenes list probably because of it being a shorter season.


4) Season 5(with the highest number of episodes) had the highest number of lines and includes the most watched episode of the show "Stress Relief" which aired on NBC after the broadcast of Super Bowl XLIII, making it the only episode to reach over 20 million viewers.

5) Besides season 1,season 7 had the highest number of unique directors(20) and unique writers(16) for the 24 episodes that it had.


6) Season 8's highest rated episode rating is same as the median rating across all episodes while the season has the lowest mean votes along with season 9.The season also features the lowest voted episode which suggests that after Michael left viewers lost interest hence did not watch the show/didnt want to vote.


7) 57,973 lines were spoken in total by 699 speakers with the show's last season(Season 9) having the highest number of unique speakers(184).


8) Characters of the show like Rainn Wilson(Dwight),Steve Carell(Michael),John Krasinski(Jim),Ed Helms(Andy) and Brian Baumgartner(Kevin) directed episodes while Mindy Kaling(Kelly),B. J. Novak(Ryan) and Paul Lieberstein(Toby) not only directed but were also regular writers of the show apart from acting in it.Steve Carell(Michael) also wrote couple of episodes.

EDA Highlights:¶

Ques 1 - Is there a relation between total lines and Rating?¶

• An upward trend seen here - more the total lines better the ratings. Major concentration of episodes lies in the 200-320 lines range wherein most episodes of seasons 7,8 and 9 tend to have the lower ratings.
• Episodes with more than 75th percentile of total lines across all episodes lies(323 lines) generally tend to have ratings >= show's median rating 8.2.

Ques 2 - Is there a relation between total scenes and Rating?¶

• Similar pattern seen like Total_lines-an upward trend is seen,more the total scenes better the ratings. Major concentration of episodes lies in the 30-55 scenes range wherein most episodes of seasons 7,8 and 9 tend to have the lower ratings.
• Episodes with more than 75th percentile of total scenes across all episodes lies(52 scenes) generally tend to have ratings >= show's median rating 8.2.
• Although there are a few episodes with less scenes that are highly rated, most episodes with higher scenes definitely seem to be highly rated.

Ques 3 - Is there a relation between number of lines spoken by particular characters and Rating?¶

• We know top 5 characters with maximum lines include Michael,Dwight,Jim,Pam and Andy.Michael tops all seasons by a large margin until season 7 with the highest number of lines while Dwight comes second in 5 seasons(2,3,5,7,9),Jim in 3 seasons(1,4,6) and Andy in season 8.
• Except for top 5 characters all other supporting characters have a limited range of lines.Darryl and Erin's upward curve suggests major shoot up in the last few seasons.

Text(0.5, 1.08, 'Relation between Total lines and Rating for top 5 characters')

• Assumption is that if the character has more lines in an episode they are seen majorly through the episode.
• Upward trend seen for everyone except Andy-ratings tend to drop when Andy has more lines while Dwight's slope is steeper suggesting lesser jump in ratings compared to other 3, meaning even if Dwight has more lines the ratings are still on the lower side.
• Jim and Pam have a similar graph while for Michael lowly rated episodes cannot be seen as most of those are from seasons 8 and 9 after he left.


• For the frequently seen character pairs number of common episodes where both the characters have atleast one line- Jim-Dwight(185), Pam-Jim(181), Michael-Dwight(137) and Michael-Jim(136 episodes).

Ques 4 - Is there a relation between number of scenes that particular characters appear in and Rating?¶

• We know Top 5 characters with maximum scenes include Michael,Dwight,Jim,Pam and Andy.Michael appears in the highest number of scenes in every season until season 7 after which he left the show.
• Jim and Dwight tie in season 6 despite Jim having more lines while in season 7 Jim appears in more scenes than Dwight despite Dwight having more lines.
• All the supporting characters seem to follow a somewhat similar limited pattern over 9 seasons.

Text(0.5, 1.08, 'Relation between Total scenes and Rating for top 5 characters')

• Assumption is that if the character appears in more scenes in an episode they are seen majorly through the episode. The trends are similar to previously seen total lines graphs.

Ques 5 - Is there a relation between particular directors/writers and Rating?¶

• Greg Daniels has the highest median rating while Ken Kwapis and Paul Feig have all episode ratings above 8(except 1 which is Pilot episode) along with second highest median rating.
• While Matt Sohn's episodes comparatively have lower ratings, for David Rogers who has the lowest median rating it was just one episode that was rated high.

• Greg Daniels has the highest median rating of 8.7,followed by Paul Lieberstein who has the median rating of 8.5.
• Michael Schur and Justin Spitzer have all episodes rated above 8(except one from season 8) while Mindy Kaling and B.J.Novak have a similar ratings distribution. Aaron Shure has the lowest median rating of 7.9

Ques 5.1 - Is there a relation between the number of co-writers and Rating?¶


• Nothing conclusive can be said since the number of episodes written by more than one writer is only 33/186 making the data too skewed for conclusive observations.

Ques 5.2 - Do any particular director-writer combination fare better ratings?¶


• Director-Writer pairs who have worked together in more than one episode seem to rake in slightly better ratings than those pairs who have worked together in only one episode.


• For (Jeffrey Blitz, Paul Lieberstein)-2/2 episodes,(Ken Kwapis, Greg Daniels)-2/3 episodes and (Paul Feig, Gene Stupnitsky | Lee Eisenberg)-3/4 episodes that they worked together on have ratings > 75th percentile of ratings(>8.6). While for (Matt Sohn, Allison Silverman)-2/2 episodes that they worked together on have ratings < 25th percentile of ratings(<7.8).

• Percentage shown is with respect to total number of episodes per season. Season 6 and 8 experimented with the most number of new director-writer pairs(who worked together in just one episode) hence accounting for almost 80% of the episodes in both seasons.

Ques 5.3 - Do actors make good directors/writers considering they know the show from a different perspective?¶


• Although the number of episodes(19) involved are too less to form any conclusive observations,in Season 6 and 8, 5/7 actors directed an episode possibly to revive the low ratings.


• Half of the episodes belong to seasons 8 and 9 and most episodes are below median rating of 8.2. All three episodes directed by John Krasinski are lowly rated.


• Some iconic episodes like "Scott's Tots", "Michael's last dundies" and "Garage sale" feature in the list. Two iconic episodes "Casino Night" and "Survivor Man" were written by Steve Carell.

Ques 6 - Do the Top 10 best episodes(based on rating and votes) have any peculiar features associated with them?¶

df_ep.sort_values(by=['Rating','Num_Votes'],ascending=False).head(10)
season episode Title AirDate Rating Num_Votes DirectedBy WrittenBy year month day Difference in days between air dates(season-wise) Combined episode_no Total_lines Total_scenes
185 9 23 Finale 2013-05-16 9.8 9269 Ken Kwapis Greg Daniels 2013 5 3 238.0 (Ken Kwapis, Greg Daniels) 186 522 116
135 7 21 Goodbye, Michael 2011-04-28 9.8 6909 Paul Feig Greg Daniels 2011 4 3 217.0 (Paul Feig, Greg Daniels) 136 329 53
77 5 13 Stress Relief 2009-02-01 9.7 7058 Jeffrey Blitz Paul Lieberstein 2009 2 6 129.0 (Jeffrey Blitz, Paul Lieberstein) 78 416 70
184 9 22 A.A.R.M. 2013-05-09 9.5 3401 David Rogers Brent Forrester 2013 5 3 231.0 (David Rogers, Brent Forrester) 185 501 70
59 4 9 Dinner Party 2008-04-10 9.4 4824 Paul Feig Gene Stupnitsky | Lee Eisenberg 2008 4 3 196.0 (Paul Feig, Gene Stupnitsky | Lee Eisenberg) 60 331 31
130 7 16 Threat Level Midnight 2011-02-17 9.4 4236 Tucker Gates B. J. Novak 2011 2 3 147.0 (Tucker Gates, B. J. Novak) 131 72 26
27 2 22 Casino Night 2006-05-11 9.4 4195 Ken Kwapis Steve Carell 2006 5 3 233.0 (Ken Kwapis, Steve Carell) 28 385 57
94 6 4 Niagara: Part 1 & 2 2009-10-08 9.4 4055 Paul Feig Greg Daniels | Mindy Kaling 2009 10 3 21.0 (Paul Feig, Greg Daniels | Mindy Kaling) 95 552 78
50 3 23 The Job 2007-05-17 9.3 3461 Ken Kwapis Paul Lieberstein | Michael Schur 2007 5 3 238.0 (Ken Kwapis, Paul Lieberstein | Michael Schur) 51 544 88
64 4 14 Goodbye, Toby 2008-05-15 9.3 3453 Paul Feig Jennifer Celotta | Paul Lieberstein 2008 5 3 231.0 (Paul Feig, Jennifer Celotta | Paul Lieberstein) 65 617 103

Peculiar features -


• Episodes belong to all seasons except season 1(new show and style of humour hence people take time to get accustomed) and season 8(complete focus on Andy's story arc and Michael's absence not the greatest season).


• Ken Kwapis and Paul Feig each directed 3 and 4 episodes in this list respectively - they both have directed the second and third highest number of episodes of the show with all rated above 8.


• 3 episodes were written/co-written by Greg Daniels while 3 episodes were written/co-written by Paul Lieberstein - both of them are regular writers of the show having the top two highest median ratings.


• The director-writer pairs of top 5 episodes in this list have worked together in atleast 1 other episode throughout the show.


• 9/10 episodes (except "Threat-level Midnight" episode which has an inconsitency in recording of lines and scenes in the dataset because it featured clips mainly) have total lines>75th percentile of lines(323) while 8/10 episodes have total scenes>75th percentile of scenes(52) confirming the trends seen earlier.

Ques 7 - Do the Top 10 worst episodes(based on rating and votes) have any peculiar features associated with them?¶

df_ep.sort_values(by=['Rating','Num_Votes'],ascending=[True,False]).head(10)
season episode Title AirDate Rating Num_Votes DirectedBy WrittenBy year month day Difference in days between air dates(season-wise) Combined episode_no Total_lines Total_scenes Top5_max_lines
157 8 19 Get the Girl 2012-03-15 6.6 1944 Rainn Wilson Charlie Grandy 2012 3 3 175.0 (Rainn Wilson, Charlie Grandy) 158 261 41 ([Nellie, Jim, Andy, Erin, Robert], [42, 38, 33, 32, 24])
103 6 13 The Banker 2010-01-21 6.8 2665 Jeffrey Blitz Jason Kessler 2010 1 3 126.0 (Jeffrey Blitz, Jason Kessler) 104 131 21 ([Michael, Eric, Toby, Dwight, Computron], [44, 26, 18, 17, 10])
146 8 8 Gettysburg 2011-11-17 6.9 1850 Jeffrey Blitz Robert Padnick 2011 11 3 56.0 (Jeffrey Blitz, Robert Padnick) 147 282 45 ([Andy, Dwight, Jim, Robert, Oscar], [58, 30, 25, 24, 20])
167 9 5 Here Comes Treble 2012-10-25 7.0 1780 Claire Scanlon Owen Ellickson 2012 10 3 35.0 (Claire Scanlon, Owen Ellickson) 168 292 39 ([Andy, Dwight, Jim, Pam, Erin], [50, 39, 35, 32, 24])
159 8 21 Angry Andy 2012-04-19 7.1 1857 Claire Scanlon Justin Spitzer 2012 4 3 210.0 (Claire Scanlon, Justin Spitzer) 160 292 51 ([Andy, Nellie, Ryan, Pam, Erin], [53, 36, 30, 29, 24])
158 8 20 Welcome Party 2012-04-12 7.1 1739 Ed Helms Steve Hely 2012 4 3 203.0 (Ed Helms, Steve Hely) 159 307 25 ([Andy, Jim, Pam, Erin, Nellie], [48, 39, 32, 25, 22])
160 8 22 Fundraiser 2012-04-26 7.1 1694 David Rogers Owen Ellickson 2012 4 3 217.0 (David Rogers, Owen Ellickson) 161 239 38 ([Andy, Jim, Pam, Kevin, Nellie], [41, 24, 22, 18, 17])
164 9 2 Roy's Wedding 2012-09-27 7.2 1730 Matt Sohn Allison Silverman 2012 9 3 7.0 (Matt Sohn, Allison Silverman) 165 296 47 ([Jim, Pam, Andy, Dwight, Erin], [39, 34, 33, 31, 31])
141 8 3 Lotto 2011-10-06 7.3 1858 John Krasinski Charlie Grandy 2011 10 3 14.0 (John Krasinski, Charlie Grandy) 142 317 42 ([Andy, Jim, Darryl, Dwight, Pam], [63, 52, 49, 37, 23])
177 9 15 Couples Discount 2013-02-07 7.3 1655 Troy Miller Allison Silverman 2013 2 3 140.0 (Troy Miller, Allison Silverman) 178 260 35 ([Andy, Dwight, Pam, Jim, Erin], [62, 32, 28, 26, 26])

Peculiar features -


• All episodes belong to seasons 8 and 9(after Michael leaves the show!)


• Jeffrey Blitz and Claire Scanlon directed 2 episodes each-Jeffery Blitz's all other episodes have high ratings except the two that feature in this list while Claire Scanlon has directed only these two episodes.


• Charlie Grandy, Owen Ellickson and Allison Silverman wrote 2 episodes each.


• Episodes directed by the shows's actors Rainn Wilson,Ed Helms and John Krasinski feature in the list.


• While for (Matt Sohn, Allison Silverman)-2/2 episodes that they worked together on have ratings < 25th percentile of ratings(<7.8),none of the other director-writer pairs have worked together in another episode.


• None of the episodes have total lines>75th percentile of lines(323) or total scenes>75th percentile of scenes(52) confirming the trends seen earlier.

Ques 8 - Does a decrease in ratings and votes over time mean lesser people watched the show?¶

Relation between Season and Rating:¶

• Season 3 has highest median rating, with the median line present towards upper half of the plot (positive skewness)indicating most episodes have higher ratings above 8.6. Season 8 has the lowest median rating.
• Seasons 2 to 5 ratings have been steady and good,from season 6 onwards a general decline begins to be seen; a slight revival in season 7 does not continue in season 8 and 9 which have consistently low ratings except for 3 episodes in season 9.

Relation between Season and Votes - How far have the number of votes decreased over seasons?¶

• A decreasing trend in votes can be seen as we move towards later seasons, with a sharp dip from season 7 to 8.

Relation between Votes and Rating - How far do ratings increase as the number of votes received increases?¶

• Positive linear relation seen upto certain extent between ratings and votes. High density of votes are mostly seen around 2000 upto 3000 for ratings between 8-8.8.Votes go above 6000 only at 4 instances.


• Ratings pick up after season 1 (naturally because audience took time to get accustomed to the style of humour) but the votes begin to decrease. The decrease in votes does not necessarily suggest decline in viewership as highly rated peaks are in tow with highly voted peaks for the audience favorite episodes hence explaining why it attracted more viewers. Even though season 9 wasn’t as bad as season 8 probably because of the way season 8's story arc progressed many viewers would not have wanted to watch it explaining why it isn’t voted high(saturated votes).


• The season finale episodes end on a higher rating than the first episode of the season except for season 6 and 8 which particularly see a decline in the ratings suggesting weaker finale episodes.


• Season 6 and 8 experimented with the most number of new director-writer pairs(who worked together in just one episode) accounting for almost 80% of the episodes in both seasons explaining why especially the season 6 episodes with Jim as co-manager and the Sabre storylines seemed not so engaging compared to the previous seasons.

Ques 9 - Decoding season 8 and 9 : What went wrong in season 8 and a slightly better season 9? (apart from Michael's absence)¶

Who directed episodes in seasons 8 and 9?¶

• Considering "Regular director" as a director who has directed atleast 6 episodes(length of season 1) and "Not regular director" as a director who has directed less than 6 episodes throughout the show hence using this graph to draw conclusions for initial seasons might not be correct and sensible but can be used for analysing later seasons(especially season 8 and 9).
• Season 8 and 9 saw highest(>50%) percent of episodes being directed by not regular directors compared to other seasons.

Who wrote episodes in seasons 8 and 9?¶

• Similar assumptions as "Regular and not regular director" for "Regular writer and not regular writer" hence using this graph to draw conclusions for initial seasons might not be correct and sensible but can be used for analysing later seasons(especially season 8 and 9).
• Season 8 and 9 saw highest percent of episodes being written by not regular writers with season 9 having more than 70% episodes written by not regular writers compared to other seasons.

Conclusion for SEASON 8:


• The entire season was focused on Andy’s character(him as Regional manager, his family, personal life) and given the fact that Ed Helms is a good actor, the mediocre writing merely reduced Andy to a caricature trying to portray him as a replacement to Michael’s character. The writers apparently wanted to make Darryl (Craig Robinson) the manager but thought he was too smart to take decisions leading to comical situations.


• The season also saw unnecessarily extended storylines(Gabe cribbing over Erin in the background), randomly abandoned storylines(Darryl-Val, Cathy) and haphazard storylines(Florida store).


• The cast additions also didn’t help - Robert California’s and Nellie Bertram’s storylines did more damage to the ratings than expected.

Conclusion for SEASON 9:


• Focus on Dwight's story arc(farm life,friends,manager position,relationship with Angela) and Jim's story arc(relationship stress with Pam,his own sports company in Philly). While the initial episodes did focus on Andy's character, since Ed Helms was away for some episodes shooting for The Hangover Part III movie, his scenes were slashed by half and dialogues by 40% from season 8, shifting focus onto other characters which seems to have helped the show's ratings.


• The storylines got more real and personal with new friendships like Nellie-Pam, Dwight-Erin, Jim-Darryl along with a peak into the documentary crew.


• Not only did it bring back characters like Roy and Jan along with many new characters due to the scattered storylines(Clark,Pete,Dwight's family and friends,Jim's Philly office) but regular supporting characters like Angela and Oscar got their dues as they had their all time highest lines and scenes in season 9.

For the complete code and analysis visit my github repo - https://github.com/swarnitav08/EDA-for-The-Office-Story

display image