Du kannst nicht mehr als 25 Themen auswählen
Themen müssen entweder mit einem Buchstaben oder einer Ziffer beginnen. Sie können Bindestriche („-“) enthalten und bis zu 35 Zeichen lang sein.
153 Zeilen
6.7 KiB
TeX
153 Zeilen
6.7 KiB
TeX
vor 2 Jahren
|
%! Author = mrmcx
|
||
|
%! Date = 14.07.2022
|
||
|
%! Template = Fabian Wenzelmann, 2016--2019
|
||
|
|
||
|
\documentclass[a4paper,
|
||
|
twoside, % to have to sided mode
|
||
|
headlines=2.1 % number of lines in the heading, increase if you want more
|
||
|
]{scrartcl}
|
||
|
\usepackage[
|
||
|
margin=2cm,
|
||
|
includefoot,
|
||
|
footskip=35pt,
|
||
|
includeheadfoot,
|
||
|
headsep=0.5cm,
|
||
|
]{geometry}
|
||
|
\usepackage[utf8]{inputenc}
|
||
|
\usepackage[english]{babel}
|
||
|
\usepackage[T1]{fontenc}
|
||
|
\usepackage{mathtools}
|
||
|
\usepackage{amssymb}
|
||
|
\usepackage{lmodern}
|
||
|
\usepackage[automark,headsepline]{scrlayer-scrpage}
|
||
|
\usepackage{enumerate}
|
||
|
\usepackage[protrusion=true,expansion=true,kerning]{microtype}
|
||
|
\usepackage{hyperref}
|
||
|
|
||
|
\newcommand{\yourname}{Simon Moser}
|
||
|
\newcommand{\lecture}{Advanced Database and Information Systems}
|
||
|
\newcommand{\project}{Project 2}
|
||
|
\author{\yourname}
|
||
|
\title{\lecture}
|
||
|
\subtitle{\project}
|
||
|
|
||
|
\pagestyle{scrheadings}
|
||
|
\setkomafont{pagehead}{\normalfont}
|
||
|
\lohead{\lecture\\\yourname}
|
||
|
\lehead{\lecture\\\yourname}
|
||
|
\rohead{\project}
|
||
|
\rehead{\project}
|
||
|
|
||
|
\begin{document}
|
||
|
\maketitle
|
||
|
\section{Problem Statement}
|
||
|
\label{sec:problem-statement}
|
||
|
This report presents different approaches on Join\cite{wikijoin} algorithms on tables.
|
||
|
When two tables are joined, all columns are combined based on one related column in each table.
|
||
|
Every possible combination of rows where those columns are equal is returned.
|
||
|
A very simple approach would be to loop through the first table and for each row loop through the second table and check for equality of the related column.
|
||
|
Since the complexity of this approach is $O(M*N)$ for table sizes $M$ and $N$ and therefore very high, the following sections explain different approaches to this problem.
|
||
|
|
||
|
\section{Algorithm Description}
|
||
|
\label{sec:algorithm-description}
|
||
|
\subsection{Hash Join}
|
||
|
\label{subsec:hash-join}
|
||
|
The Hash Join\cite{wikihash} algorithm tries to offer a faster solution by using hash tables.
|
||
|
The rough description of the algorithm is:
|
||
|
\begin{enumerate}
|
||
|
\item Add each row of the smaller table to the hash table, where compared column is used to create the hash.
|
||
|
\item For each row in the other table, lookup the hash of it's compared column
|
||
|
\item If the hash exists in the hash table, compare the two rows whether not only the hash is equal
|
||
|
\end{enumerate}
|
||
|
|
||
|
One caveat of this approach is that the speed relies on the hash table being in-memory.
|
||
|
If it exceeds the size of the memory, it will be slowed down.
|
||
|
|
||
|
This algorithm is expected to have a complexity of $O(M+N)$ because the first table has to be looped to build the hash table and the second table has to be looped to check the hash table.
|
||
|
Here, a hash table access is expected to have a complexity of $O(1)$.
|
||
|
|
||
|
\subsection{Sort Merge Join}
|
||
|
\label{subsec:sort-merge-join}
|
||
|
A Sort Merge Join\cite{wikimergesort} tries to restrict the complexity by using good sort algorithms.
|
||
|
For this approach, both tables are sorted by the compared column first.
|
||
|
Then a pointer is set on the start of each column.
|
||
|
Now for each step the related columns for each pointer are compared.
|
||
|
If they are equal, the rows are added to the result.
|
||
|
If not, the pointer in the table with the smaller value is advanced by one row.
|
||
|
|
||
|
This approach relies on two things: first, the content of the compared column has to be sortable in any way.
|
||
|
Second, since both tables need to be sorted first, an effective sort algorithm has to be used.
|
||
|
|
||
|
Merge sort of the two tables has a complexity of $O(M \log_2 M + N \log N)$.
|
||
|
This exceeds the complexity of the approach in subsection\ \ref{subsec:hash-join}.
|
||
|
Still, this algorithm can be more effective when the tables are already sorted.
|
||
|
In this case, the complexity drops to $O(M+N)$ that is required for walking through the tables.
|
||
|
|
||
|
\section{Dataset Description}
|
||
|
\label{sec:dataset-description}
|
||
|
For the analysis, the WatDiv\cite{watdiv} dataset is used.
|
||
|
It consists out of the following entity types:
|
||
|
\begin{itemize}
|
||
|
\item wsdbm:User
|
||
|
\item wsdbm:Product
|
||
|
\item wsdbm:Review
|
||
|
\end{itemize}
|
||
|
|
||
|
They have the following relations:
|
||
|
\begin{itemize}
|
||
|
\item wsdbm:User wsdbm:follows wsdbm:User
|
||
|
\item wsdbm:User wsdbm:friendOf wsdbm:User
|
||
|
\item wsdbm:User wsdbm:likes wsdbm:Product
|
||
|
\item wsdbm:Product rev:hasReview wsdbm:Review
|
||
|
\end{itemize}
|
||
|
|
||
|
\section{Experiment and Analysis}
|
||
|
\label{sec:experiment-and-analysis}
|
||
|
\subsection{Preparations}
|
||
|
\label{subsec:preparations}
|
||
|
To be able to import the dataset using the Python library rdflib\cite{rdflib}, the file contents were brought to an url format using the following bash command:
|
||
|
|
||
|
\verb$ sed -E 's/([a-zA-Z0-9]+:[^ \t]+)/<\1>/g'$
|
||
|
|
||
|
\subsection{Implementation}
|
||
|
\label{subsec:implementation}
|
||
|
The algorithms were implemented using Python.
|
||
|
It might not offer the fastest way, but as an interpreted language Python is very suitable for fast-paced development.
|
||
|
The code (also of this report) is published at: \url{https://naclador.de/mosers/ADBIS-Projekt2}
|
||
|
|
||
|
\subsection{Analysis}
|
||
|
\label{subsec:analysis}
|
||
|
Due to a lack of time, an existing bug in the sort merge join could not be found and fixed.
|
||
|
Therefore, the following results for the small dataset are not representative:
|
||
|
|
||
|
\begin{verbatim}
|
||
|
5.67576265335083s for Hash Join (11415461 items)
|
||
|
0.15003275871276855s for Sort Merge Join (1475 items)
|
||
|
6.041101694107056s for Nested Loop Join (11415461 items)
|
||
|
\end{verbatim}
|
||
|
|
||
|
As it can be seen, the number of result items for the sort merge join differs greatly.
|
||
|
|
||
|
For smaller tables, Hash Join is expected to be the fasted algorithm since the table can be kept in memory completely.
|
||
|
For the bigger table, Sort Merge Join will be faster as it doesn't have a memory limit.
|
||
|
|
||
|
\section{Conclusion}
|
||
|
\label{sec:conclusion}
|
||
|
Join Algorithms should be used by use case.
|
||
|
There is nothing like the fastest algorithm, for small tables Hash Join is very effective while Sort Merge Join is perfect for pre-sorted tables.
|
||
|
The comparison to Nested Loop Join shows that a sophisticated algorithm is usually better than the easiest solution.
|
||
|
|
||
|
\begin{thebibliography}{watdiv}
|
||
|
\bibitem{wikijoin}
|
||
|
Join (SQL), Wikipedia, \url{https://en.wikipedia.org/wiki/Join_(SQL)}
|
||
|
\bibitem{wikihash}
|
||
|
Hash join, Wikipedia, \url{https://en.wikipedia.org/wiki/Hash_join}
|
||
|
\bibitem{wikimergesort}
|
||
|
Sort-Merge-Join, Wikipedia, \url{https://en.wikipedia.org/wiki/Sort-merge_join}
|
||
|
\bibitem{watdiv}
|
||
|
Waterloo SPARQL Diversity Test Suite, University of Waterloo, \url{https://dsg.uwaterloo.ca/watdiv/}
|
||
|
\bibitem{rdflib}
|
||
|
RDFLib, \url{https://rdflib.readthedocs.io}
|
||
|
\end{thebibliography}
|
||
|
\end{document}
|