Skip to content

Laboratory exercise created with Apache Spark, in the context of the "Advanced Topics in Database Systems" course in NTUA

Notifications You must be signed in to change notification settings

chrisbetze/Apache-Spark-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Apache-Spark-Project

Description

In this work, we use Apache Spark to calculate queries on files that describe datasets. Apache Spark offers two APIs for implementing queries:

  1. Resilient Distributed Dataset (RDD) API
  2. Dataframe API / Spark SQL

Dataset preview

We use a dataset with movies, which comes from a subset of the Full MovieLens Dataset. This subset contains three text files in CSV format.

File Description
movies.csv Describes the movies in the dataset.
movie_genres.csv Contains on each line a pair of movie ID and movie genre in first and second place respectively.
ratings.csv Describes user ratings for movies.

For more information about the files, visit Kaggle Movie Dataset.

Queries

Query Description
Q1 From 2000 onwards, for each year, find the movie with the biggest profit. Ignore entries that have no value on the release date or zero value in revenue or budget.
Q2 Find the percentage of users (%) who have given to movies average rating greater than 3.
Q3 For each genre, find the average rating of the genre and the number of films belonging to this genre. If a movie belongs to more than one genres, we consider to be measured in each genre.
Q4 For "Drama" movies, find the average movie summary length every 5 years from 2000 onwards (1st 5 years: 2000-2004, 2nd 2005-2009, 3rd 2010-2014, 4th 2015-2019)
Q5 For each genre, find the user with the most reviews along with his most and least favorite movie according to his ratings. The results should be in alphabetical order as to the genre and be presented in a table with the following columns:
• Genre
• User with more reviews
• Number of reviews
• Most favorite movie
• Rating of most favorite movie
• Least favorite movie
• Rating of least favorite movie

Part 1

  1. We upload the 3 CSV files in Hadoop Distributed File System (HDFS).
  2. We convert the 3 CSV files to Parquet format and then we save them back to hdfs.
  3. We implement the 5 queries and calculate the execution time for the following 3 cases:
    • Map Reduce Queries – RDD API
    • Spark SQL with input csv file
    • Spark SQL with input parquet file
  4. We present the results of execution times in a bar graph, grouped by query.

Part 2

  1. We implement the Broadcast join in RDD API (Map Reduce).
  2. We implement the Repartition join in RDD API (Map Reduce).
  3. We compare the execution times of the two implementations above.

  1. We execute a query with and without optimizer (Spark SQL Query Optimizer) and we present the results in a bar graph.

Results

For more information about the output results and the bar graphs, go to report.

Collaborator: Manos (Emmanouil) Vekrakis

About

Laboratory exercise created with Apache Spark, in the context of the "Advanced Topics in Database Systems" course in NTUA

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages