Exploratory Data Analysis

Load multivalued attributes

Multivalued attributes with gross and count

  1. Input
  1. Output

Brief information about data

Heatmap of Correlation Matrix, Histogram and Scatter Matrix

Genre analysis

From the bar chart that the following genres generate the highest gross:

  1. Animation
  2. Adventure
  3. Action
  4. Sci-fi
  5. Fantasy
  6. Family

Film-Noir stays at the lowest position

News, though with only 1 record, is in the top 6 of the median plot

We will develop a Genre Rank based on Median but will ignore the film that have number of releases < 10

We rank the genres by the highest rank of Genre in the list of Genres(Since Genre is multivalued attributes)

We rank the genre by total rank of Genre in the list of Genres

We we rank the genre by average rank of Genre in the list of Genres(Since Genre is multivalued attributes)

Now we will test the hypothesis that movie with Genre Adventure will have more value than other movies

We will choose this one since it has high correlation of 0.366

Release_Day analysis

Movies tend to release in Friday

No linear relation

Hypothesis: Did movies released in Friday or Wednesday have more gross then others?

Not very relevant between them!

Release_Month analysis

Number of release by month

How average gross depend on Release_Month and combine with Release_Year.

We can see that gross tends to increase by year.

By Median Plot of Month and Gross. We can see that 5,6,7,12 is the month that are top month(We call special month)

This correlation is low. We will consider by Mean instead

By Mean Plot of Month and Gross. We can see that 5,6,7,11,12 is the month that are top month(We call special month)

This has higher correlation so we will choose it

Budget analysis

Min value is 2100 $

No 0$ value Budget (Min value is 15000$)

Relation between Budget and Gross_worldwide

Now we will test the correlation between Budget and Gross_WorldWide

Budget and Gross_worldwide correlation may perform very well in our future model

Cast Analysis

Hypothesis: Is Gross depend on number of actors showed on movie's imdb webpage?

->>Very low correlation

Since all imdb pages tend to have more than 17 casts per page

Cast and average gross of the movies they cast for

Our data has name of 73026 casts

38972 Casts with lower than 3 movies cast for and some people with very high Median Value

It will cause problem of not reliable rank when we try to develop a rank system is this type of fields

Those casts name is not very well-known. Jason Whyte and Sean Anthony Moran is casts with only 1 movies attended

Now we will develop a rank system for cast (Find top leader casts)

The number of movies they cast for must larger then 5 movies

We call them leader cast. And now move on to develop rank

Apply to data to get CastsRank. Here we total the rank value

How about we take the rank means?

Both are approximately identical correlation value of 0.51

Now we will extract more feature That is:

NumLeadActors

HasTop30Actors

Crew analysis

Hypothesis: Does revenue depend on number of crews appear on movie's imdb website?

Number of movies each crew participate in

There are lots writers and directors who participate only in few movies

The top seems not change very much

Now we will develop rank of crew

Now we will rank the movies based on crew team rank

Result correlation:

Now consider number of crew in top 150 for each movie

Consider film have crew in top 50

Studio Analysis

Hypothesis: Is gross depends on number of Studios participate in?

Correlation 0.112: Not very relevant

Movies with Number of Studios Of 4 or 5 may be outlier. We will try to drop it.

Correlation stay the same so we don't need to delete them

Studio and Gross of the movies they work for

This average gross plot for studio is not reliable to decide which Studio is big or not since there are some Studios which participate only in one Movie

We can define big studios by 4 ways:

Now we will exclude the Studio with releases lower than 5

The top studios bar chat now looks more familiar.

Move on to rank the Studios.

Apply by get the max rank in the list of Studios for each movie

The studio not appear in the list will be randomly choose rank between 0 and 200

Apply by get the total rank in the list of Studios for each movie

The studio not appear in the list will be randomly choose rank between 0 and 200

Apply by get the mean rank in the list of Studios for each movie

The studio not appear in the list will be randomly choose rank between 0 and 200

Decide to choose total rank

Now we will extract more feature That is:

NumTopStudios

HasTopStudio

Production Countries Analysis

Hypothesis: Is gross depend on number of Production Companies in each movie?

Very low correlation of 0.079

Gross of each countries and Total Gross

This rank of average gross is not relevant since there are some countries with only least release count

We can see that almost movies are release in United States.

We will test that: Is movies release in United States will have higher gross compare to other films?

Seems low correlation

Now we will exclude countries that have number of releases less than 100

We will choose these list of Countries to extract a rank feature

Get rank by the maximum

By total

By in the list

Language analysis

How distribution release by Languague

Nowadays, films almost have English. We will try to find that what is the film is spoken in English or another languages

Correlation of 0.076

Keywords analysis

Hypothesis: Is gross depend on number of keywords?

Not relevant

How distribution of gross by Keywords

These keywords have high mean and median except the truth that have only 1 count

superhero keywords is popular in the 2010s and have high average gross

Rank keywords

Get rank for keywords by maximum

Default max rank = 0

MPAA Analysis

How distribution of gross by MPAA Certificate

NC-17 has the least numbers of release

PG-13 and G has higher mean and medians gross R and NC-27 has low mean and median gross This is understandable since R and NC-27 will reduces the age can watch the film. Hence lower!

Conduct rank by Certificate

By Mean

See that PG-13 have the highest mean gross.

We will test that is movie have rated PG-13 will have more gross than others.

Export to CSV