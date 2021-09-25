A fun project to learn the graphics database

Tired of all that JOIN in SQL? Have you had a headache every time you had to modify a schema in a relational database? If either of the answers is Yes, you should try a graphics database such as Neo4j.

A graph database stores information in the form of nodes and edges. Nodes are connected by edges and they both have properties. We can retrieve and aggregate data with queries. Because their logic and semantics are closer to how our mind models the real world, graph databases are easy to learn. Relational database users can quickly find the similarities between SQL and Cypher, Neo4j’s query language. In addition, Neo4j comes with aggregated functions and machine learning modules which can provide fast insight into the data.

In my previous articles (Neo4j for antibiotic resistance, for diseases and for genome analyzes), I showed that the Neo4j graph database can provide a lot of interesting information that is not so obvious in relational databases. But they were all about bioinformatics, not the simplest subject. And the articles were more about the advantages of Cypher over SQL. Readers have asked me for a more general introduction to Neo4j to speed up the learning curve. So I researched data for a Neo4j beginner project, especially for readers with SQL background. Recently I came across the bollywood movies dataset (CC BY 4.0) by P. Premkumar and considered it a good fit.

The word coat rack Bollywood is made up of Bombay and Hoplywood. As the name suggests, this is the Mumbai-based Hindi-language film industry. And the films cover a wide range of topics, and some of them have even touched on sensitive political and religious topics. Many movies contain happy dance scenes that brighten up the mood. Bollywood movies are very popular all over the world. For example, the movie Dangal has been so successful in China that two-thirds of its gross worldwide $ 330 million came from China alone. Hollywood has been written extensively, but Bollywood rarely receives its fair share of attention. So it will be great to learn more about this film industry. Premkumar’s tabular data is small, but it contains 1,698 Bollywood films between 2005 and 2017. So I consider them to be a great source of data for this introductory project. The code for this project is hosted on my Github repository here:

The dataset is https://boxofficeindia.com. Compared to IMDB.com, the data for boxofficeindia.com is very incomplete. Also be aware that the table only lists the main players. He therefore underestimates the actor-film relationship. In addition, this project only includes actors, directors and their films. There is also a small trap in the raw data: the budget and income columns have been misplaced.

First download the data from the link here. Swap column names Budget(INR) and Revenue(INR) . Then run all cells in my Python notebook prepare.ipynb . They should generate three CSV files in the data_for_neo4j case. You can also find all the data files in my Github repository.

Open Neo4j Desktop . Create a called project bollywood and place the three CSV files in its import case.

Then open Neo4j Browser and enter the following commands. They will import the data into the database (See the 2. Import data into Neo4j section for detailed instructions).

Let’s take a look at the dataset first. Open Neo4j Bloom in Neo4j Desktop . In Settings , increase Node query limit to 3,600. Perspective -> Search phrases , add a search expression called match all with the following query:

Return to the main window, in Search graph to select match all and run it. Bloom should generate the topological overview of the dataset (Figure 1).

Figure 1. The topological overview of the Bollywood dataset. Green: film; Red: actor; Blue: director. Image by the author.

The big picture immediately gives us an overall impression of the world of Bollywood cinema. Most of the actor-director trios are islands. But a large cluster is visible. You can find out more by zooming in or using the filters:

Figure 2. A detailed part of the large cluster in the dataset. Green: film; Red: actor; Blue: director. Image by the author.

It is clear that this large cluster is maintained by prolific Bollywood actors such as the Three Khans, Akshay Kumar, Emran Hashmi and directors such as Vikram bhatt and Mohit Suri.

We can also take a look at the gender distribution in the dataset:

The Neo4j documentation explains that WITH is like the pipe operator at Bash. In the above query, I calculated the total number of movies with the WITH statement, and then used it in the percentage calculation. The results suggest that dramas, comedies and thrillers are the top three genres of films in Bollywood.

After the overview, we can quickly calculate statistics with Neo4j. To begin with, let’s calculate the combined income of Aamir Khan’s films:

This Cypher first matched all movies with Aamir Khan as the lead actor and then summed up their earnings. This quickly gave us a total sum of 29,030,895,000 Indian rupees ($ 393,932,987). As one of the most successful and influential actors in Bollywood, Aamir Khan has starred in several of the highest-grossing Indian films of the year such as Ghajini, 3 idiots, Rangeela, Dhoom 3, package, and Dangal. He has also been called the king of the Chinese box office because his films have enjoyed huge success in China as well.

Then we can see the top earnings:

Unlike the previous query, this query removed the constraint on the actor nodes. He calculated the revenue-to-budget ratios to show how relatively profitable the films were. Finally, the query sorted the results by income in descending order.

The query revealed that Prabhass Baahubali 2 led the list. But this film was also quite expensive to make. In comparison, Dangal showed a revenue-to-budget ratio of 5. Again, the Khans topped the list. Be aware that these results do not agree with the data from Wikipedia.

We can see the list of the biggest box office bombs.

This quick query returned the first 10 flops of the dataset. At the top of the list was Bombay velvet, which has been criticized for its screenplay and directing. The occupancy rate on the first day was only 1020%. All theaters pulled the film on the third day. He burned a 748,635,000 INR hole in the pockets of his investors. The second place was occupied by Broken horses. The film received mostly negative reviews, and it was a commercial flop as well. But for me, directing an Indian director alone did not qualify the film as a true Bollywood film.

Then we can also see which actors have the most thrillers or horror movies under their belt.

It appeared that Emraan Hashmi and Ajay Devgn were the most active thriller actors. Mr. Hashmi has successfully established himself as a star of the thriller. And we can see the list of all his movies in the dataset:

It turns out that Mr. Hashmi was also the lead actor in the final three installments of the horror film series. Raaz. According to the dataset, there weren’t as many horror movies in Bollywood (52 out of 1,695, or 3%, see 2. Preview). These three horror movies alone have already made Mr. Hashimi the actor with the most horror movies in our data set.

Finally, let’s see how often some actor-director duo have worked together.

The bromance between superstar Ajay Devgn and director Rohit Shetty is well known. The film Phool Aur Kaante from 1991 was their first film together. And they’ve been working together ever since 11 films.

In Section 2, the topological overview of Neo4j Bloom showed us a large cluster centered around some of the biggest names in Bollywood. Now the question is whether it is possible to get this community through a query? With the help of Neo4js Graph Data Science Library (GDS), the answer is a resounding yes.

In my previous Neo4j for Diseases article, I showed how to use the Leuven algorithm to calculate communities. But in this project I failed to achieve the optimal results with Leuven. I then found the Medium Subgraph filtering article in Tomaz Bratanic’s Neo4j Graph Data Science library. In it he mentioned another algorithm Weakly connected components (COE). This algorithm finds sets of connected nodes in an undirected graph, where all nodes of the same set form a connected component. My test showed that he could quickly identify disconnected groups and effectively isolate the large section 2 cluster.

To use WCC, you need to activate the GDS plugin in your project (read the instructions in section 5 of Neo4j for Diseases). First, we need to create an in-memory chart with the command:

Then we run the WCC. It returns the 10 largest communities (called components in WCC):

We can see that component 0 has up to 1750 nodes. However, I have found that some knots are counted twice. Let’s see what are the nodes with the DISTINCT function:

The query returns 1,734 nodes instead of the original 1,750. We can even display the networks in Neo4j Browser. But first, adjust the viewing settings in Neo4j Browser (Figure 3).

Figure 3. Settings for displaying large clusters. Image by the author.

And then run the command:

The COLLECT function transforms the names into a list. We then do a normal MATCH query and filter the results with this list.

Figure 4. Component 0 in Bollywood data. Image by the author.

After close inspection, we can confirm that this is the large cluster that we observed in Neo4j Bloom from section 2.