Apache-Spark-Project

Description

In this work, we use Apache Spark to calculate queries on files that describe datasets. Apache Spark offers two APIs for implementing queries:

Dataset preview

We use a dataset with movies, which comes from a subset of the Full MovieLens Dataset. This subset contains three text files in CSV format.

File	Description
movies.csv	Describes the movies in the dataset.
movie_genres.csv	Contains on each line a pair of movie ID and movie genre in first and second place respectively.
ratings.csv	Describes user ratings for movies.

For more information about the files, visit Kaggle Movie Dataset.

Queries

Query	Description
Q1	From 2000 onwards, for each year, find the movie with the biggest profit. Ignore entries that have no value on the release date or zero value in revenue or budget.
Q2	Find the percentage of users (%) who have given to movies average rating greater than 3.
Q3	For each genre, find the average rating of the genre and the number of films belonging to this genre. If a movie belongs to more than one genres, we consider to be measured in each genre.
Q4	For "Drama" movies, find the average movie summary length every 5 years from 2000 onwards (1st 5 years: 2000-2004, 2nd 2005-2009, 3rd 2010-2014, 4th 2015-2019)
Q5	For each genre, find the user with the most reviews along with his most and least favorite movie according to his ratings. The results should be in alphabetical order as to the genre and be presented in a table with the following columns: • Genre • User with more reviews • Number of reviews • Most favorite movie • Rating of most favorite movie • Least favorite movie • Rating of least favorite movie

Part 1

We upload the 3 CSV files in Hadoop Distributed File System (HDFS).
We convert the 3 CSV files to Parquet format and then we save them back to hdfs.
We implement the 5 queries and calculate the execution time for the following 3 cases:
- Map Reduce Queries – RDD API
- Spark SQL with input csv file
- Spark SQL with input parquet file
We present the results of execution times in a bar graph, grouped by query.

Part 2

We implement the Broadcast join in RDD API (Map Reduce).
We implement the Repartition join in RDD API (Map Reduce).
We compare the execution times of the two implementations above.

We execute a query with and without optimizer (Spark SQL Query Optimizer) and we present the results in a bar graph.

Results

For more information about the output results and the bar graphs, go to report.

Collaborator: Manos (Emmanouil) Vekrakis

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
code		code
output		output
README.md		README.md
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache-Spark-Project

Description

Dataset preview

Queries

Part 1

Part 2

Results

About

Releases

Packages

Languages

chrisbetze/Apache-Spark-Project

Folders and files

Latest commit

History

Repository files navigation

Apache-Spark-Project

Description

Dataset preview

Queries

Part 1

Part 2

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages