In this work, we use Apache Spark to calculate queries on files that describe datasets. Apache Spark offers two APIs for implementing queries:
We use a dataset with movies, which comes from a subset of the Full MovieLens Dataset. This subset contains three text files in CSV format.
File | Description |
---|---|
movies.csv | Describes the movies in the dataset. |
movie_genres.csv | Contains on each line a pair of movie ID and movie genre in first and second place respectively. |
ratings.csv | Describes user ratings for movies. |
For more information about the files, visit Kaggle Movie Dataset.
Query | Description |
---|---|
Q1 | From 2000 onwards, for each year, find the movie with the biggest profit. Ignore entries that have no value on the release date or zero value in revenue or budget. |
Q2 | Find the percentage of users (%) who have given to movies average rating greater than 3. |
Q3 | For each genre, find the average rating of the genre and the number of films belonging to this genre. If a movie belongs to more than one genres, we consider to be measured in each genre. |
Q4 | For "Drama" movies, find the average movie summary length every 5 years from 2000 onwards (1st 5 years: 2000-2004, 2nd 2005-2009, 3rd 2010-2014, 4th 2015-2019) |
Q5 | For each genre, find the user with the most reviews along with his most and least favorite movie according to his ratings. The results should be in alphabetical order as to the genre and be presented in a table with the following columns: • Genre • User with more reviews • Number of reviews • Most favorite movie • Rating of most favorite movie • Least favorite movie • Rating of least favorite movie |
- We upload the 3 CSV files in Hadoop Distributed File System (HDFS).
- We convert the 3 CSV files to Parquet format and then we save them back to hdfs.
- We implement the 5 queries and calculate the execution time for the following 3 cases:
- Map Reduce Queries – RDD API
- Spark SQL with input csv file
- Spark SQL with input parquet file
- We present the results of execution times in a bar graph, grouped by query.
- We implement the Broadcast join in RDD API (Map Reduce).
- We implement the Repartition join in RDD API (Map Reduce).
- We compare the execution times of the two implementations above.
- We execute a query with and without optimizer (Spark SQL Query Optimizer) and we present the results in a bar graph.
For more information about the output results and the bar graphs, go to report.
Collaborator: Manos (Emmanouil) Vekrakis