Strona główna Moje zajęcia

Big Data Algorithms

The lecture is intended for students of "Big Data Analytics" studies at the Faculty of Fundamental Problems of Technology. It takes place on Fridays from - on the Microsoft Teams platform. On this page you will find information about the rules of passing, the material to be completed, literature and a list of tasks.

Rules for crediting the course

Completion of the course will be based on credits for exercises and laboratories. The final grade will be determined using the following formula $$ \begin{cases} \frac{E+L}{2} &: E \geq 3\\ 2.0 &: E=2.0\end{cases} ~, $$ where $E$ is the exercises grade and $L$ is the lab grade. If someone receives a 5.0 and would like to get a 5.5, it will have to meet me on the Microsoft Teams.

Learning materials

Some additional sources for Page Rank

Mandatory literature (you must read this)

  1. Deeper Inside PageRank, Amy N. Langville and Carl D. Meyer
  2. The Second Eigenvalue of the Google Matrix, Taher H. Haveliwala and Sepandar D. Kamvar

Mathematica code for computation and displaying PageRank:

HLG[G_,wsp_:1.0]:= Block[{rG},
  rG=PageRankCentrality[G,0.85];
  HighlightGraph[G,
    VertexList[G],
    VertexSize-> Thread[VertexList[G]->wsp*rG],
    VertexLabels->Placed["Name",Tooltip]
  ]
]
Example of construction of a graph:
STAR = Join[
   Table[ToString[i] -> "a", {i, 0, 9}], 
   Table[ToString[i] -> ToString[Mod[i + 1, 10]], {i, 0, 9}]
   ]; 
Very simple (only for tests) spider:
Spider[url_, dom_, k_] := Block[{G},
  IsCorrect[s_] := StringQ[s] && StringStartsQ[s, dom];
  G = NestGraph[
    Select[Import[#, "Hyperlinks"], IsCorrect[#] &] &, url, k
  ];
  Subgraph[G, Select[VertexList[G], StringQ[#] &], 
   VertexLabels -> Placed["Name", Tooltip], DirectedEdges -> True]
  ]
and and exaple of its usage:
MGB = Spider["http://cs.pwr.edu.pl/gebala/", "http://cs.pwr.edu.pl/gebala/", 3]
$ \def\RR{\mathbb{R}} \def\QQ{\mathbb{Q}} \def\ZZ{\mathbb{Z}} \def\NN{\mathbb{N}} $

Topics discussed at the lecture

09.10.2020: Introduction

  1. The three main types of data we will be dealing with:
    • a narrow array $T[n, m]$ such that $n\times m$ does not fit in the computer's memory but the number of $m$ columns is small
    • wide array $T[n, m]$ such that n⋅m does not fit in computer memory but the number of lines $n$ is small
    • an array $T[n, m]$, which fits in the computer's memory, but we are interested in (for example) all subsets of the column set $\{1,\ldots, m\}$ (note that this collection can be incredibly large).
  2. Random coincidences in large data sets
    • If 10,000 people flip a coin 10 times, about 10 of them will draw a fixed sequence $(i_1, ..., i_{10}) \in \{0,1\}^{10}$
    • EXERCISE: Estimate the number of people in Poland that every fixed day they see a black cat in the morning and are later fired from work on that day
  3. HELLO WORLD program for Big Data: extracting the most common words in a selected book
  4. Hash functions
    • Converting strings to natural numbers $$h ((a_0, \ldots, a_n)) = \sum_{i = 0}^{n} ord (a_i) \cdot 256^i$$
    • The concept of a hash function: $h: \Omega \to \{0,\ldots, n − 1\}$
    • Collision: a pair $x, y \in \Omega$ Ω such that $x \neq y$ and $h (x) = h (y)$
    • Observation: if $h: \Omega \to \{0,\ldots, n-1\}$, then there exists $a \in \{0,\ldots, n − 1\}$ such that $$|h^{−1}(\{a\}) | \geq \frac{|\Omega|}{n}$$ So: if $|\Omega| \gt n$, then collisions are inevitable.
    • Orthogonal projections of finite subsets of $P$ space $\RR^2$ onto one-dimensional subspaces: $$\pi\alpha (x, y) = a\cos\alpha + y \sin\alpha, \quad \alpha \in [0, \pi)$$ We have shown that if $P$ is a fixed finite subset of $\RR^2$ then $$\Pr [\pi_\alpha \text{ is one-to-one on } P] = 1$$ So: with probability 1, the random projection has no collision on $P$.
My quite inept lecture notes: BD_BDA_01.pdf. Next time I'll try to write more clearly.

16.10.2020: Hash functions and Bloom filters

  1. Def. A family $\mathcal{H}$ of hashing functions from $\Omega$ to $[n] = \{0,\ldots,n-1\}$ is an univeral family of hash functions if for each $x,y\in\Omega$ such that $x \neq y$ we have $$ \Pr_{h\in\mathcal{H}}[f(x)=h(y)] \leq \frac{1}{n} $$ If in the above inequality $\frac1n$ is replaced by $\frac{C}{n}$, where $C$ is a small number, then $\mathcal{H}$ is called an almost universal family of hash functions.
  2. Example: the family $\mathcal{H} = [n]^{\Omega}$ is universal
  3. Example: the family $f_a(x) = (a x \mod p) \mod n$, where $p$ is prime and $a\in\{1,\ldots,n\}$ is almost universal family of hash functions with constant $C=2$
  4. Def. A family $\mathcal{H}$ of hash functions if a $k$-independent family of hash functions if for each pairwise different family $x_1,\ldots, x_k$ of elements from $\Omega$ and arbitrary $y_1,\ldots,y_k$ from $[n]$ we have $$ \Pr_{h\in\mathcal{H}}[h(x_1)=y_1 \land \ldots h(x_k) = y_k] = \frac{1}{n^k} $$
  5. Construction: for $a = (a_0,\ldots,a_{k-1}) \in \ZZ_p^k$ we put $$ h_a(x) = \sum_{i=0}^{k-1} a_i x^i \mod p $$ Then $\{h_a:a\in\ZZ_p^k\}$ is a $k$-independent family of hash functions from $\ZZ_p$ to $\ZZ_p$. If we compose this family with the function $\phi(x) = x \mod n$, then we obtain "almost" $k$-independent family from $\ZZ_p$ to $[n]$.
    We can replace "almost" by "precisely" if we change $\ZZ_p$ by some field of size $2^m$ and if $n=2^a$ for $a\lt m$ !!!!!
  6. Bloom filters: read the paper A. Broder and M. Mitzenmacher Network Applications of Bloom Filters: A Survey
  7. Diagram of the function $f(p) = - \ln(p)\ln(1-p)$ which was used in the analysis of Bloom filters:
Additional reads:
  1. Slides from Purdue university on universal familes of hash functions
  2. Lecture from Washington university on independent families.
Questions
  1. Why does the concept universal hash function make no sens?
  2. What is the meaning of $\Pr_{h\in\mathcal{H}}[\ldots]$ in formulas from this lecture?
  3. Why the family of all functions from $\Omega$ to $[n]$ is a bad family of universal hash functions?
  4. Do you know an algebraic field with $4$ elements?
  5. Show that $2$-independence implies universality.
  6. Show that $k+1$-independency implies $k$-independency
  7. What is the value $\frac{1}{2^{\ln (2)}}$ (show some approximation) and why this number is important for Bloom filters?

23.10.2020: "Map-Reduce" model

  1. Birthday Paradox: $E[B_n] = \sqrt{\frac{\pi n}{2}} + \frac32 + O(n^{-1/2}) \approx \sqrt{n}$
  2. Coupon Collector Problem: $E[C_n] = n H_n \approx n \ln(n)$
  3. Fakt: if $m \gt n (\ln(n))^2$ then the distribution of $m$ balls in $n$ urns is closed to uniform, i.e. at each urn we have $\approx \frac{m}{n}$ balls.
  4. Basic classes of hash functions:
    1. cryptographic: MD5 (not recommended now), SHA-XXX
    2. MumurHash - a very popular function for Big Data
  5. BASIC MODEL: processes of typ MAPPER - hash function - processes of typ REDUCER.
    Must read: MapReduce: Simplified Data Processing on Large Clusters
  6. Word-count problem in Map-Reduce
    map(L){
      for all words from line L do 
      emit((w,1))
    };
    reduce(w,L) {
      emit(w,length(L))
    };
    
  7. Multiplication of matrix by a vector $x$ stored in memory:
    map(i,j,a){
      emit((i,a*x[j]))
    };
    reduce(i,L){
      emit((i, sum(L)))
    }
    
Questions
  1. What is the expected number of empty urns after throwing $n$ balls into $n$ urns?
  2. What is the expected number of empty urns after throwing $n\ln(n)$ balls into $n$ urns?
My lecture notes: BIGAlg03.pdf.

30.10.2020: "Map-Reduce" - II

Exercises
  1. How to calculate third central moment in map-reduce model?
  2. A graph $(V,E)$ is given as a series of edges $(x,y)$ to mappers. Calculate in map - reduce model the family $\{(x,z):(\exists y)((x,y)\in E \land (y,z)\in E)\}$
My lecture notes: BIGAlg04.pdf.

30.10.2020: "Map-Reduce" - III

Exercises
  1. How to calculate third central moment in map-reduce model?
  2. A graph $(V,E)$ is given as a series of edges $(x,y)$ to mappers. Calculate in map - reduce model the family $\{(x,z):(\exists y)((x,y)\in E \land (y,z)\in E)\}$
My lecture notes: BIGAlg04.pdf.

13.11.2020: "Map-Reduce" - IV

My lecture notes: BIGAlg05.pdf.

20.11.2020: Streaming - I

My lecture notes: BIGAlg06.pdf.

21.11.2020: Streaming - II

My lecture notes: BIGAlg06.pdf.

27.11.2020: Counting

  1. Morris Probabilistic Counter
    INIT: C=0;
    procedure Add(){
      if (rand() < 2^{-C}){ 
        C++; 
      }
    }
    
  2. Tw: $E[2^{C_n}] = n+1$. (proof: next week)
My lecture notes: BIGAlg07.pdf.

04.12.2020: HyperLogLog Algorithm

  1. Proof of last theorem from previous lecture (see also: Morris Probabilistic Counter).
  2. HyperLogLog: Philippe Flajolet, Éric Fusy,Olivier Gandouet, Frédéric Meunier HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm,2007 Conference on Analysis of Algorithms, AofA 07, Juan des Pins, France, June 17-22, 2007
My lecture notes: BIGAlg08.pdf.

11.12.2020: Locality Sensitive Hash Function Family

  1. Notion of metric space.
  2. Jaccard distance: $d_J(A,B) = \frac{|A\triangle B|}{|A \cup B|}$
  3. Theorem: Jaccard distane is a metric.
  4. Jaccard similarity:
    $$ S(A,B) = \frac{|A\cap B|}{|A \cup B|} $$
My lecture notes: BIGAlg09.pdf.

18.12.2020

My lecture notes: BIGAlg10.pdf.

8.01.2021: : Min hash sketches and similarity between documents

My lecture notes: BIGAlg11.pdf.

15.01.2021: Frequent items in data streams

  1. Misra-Gries summaries
  2. Majority count
  3. Min-count sketch
  4. "COVID tracing" applications
Exercises
  1. Recall the proof of Markov's inequality
  2. Derive Tchebyshev inequality from Markov inequality.
My lecture notes: BIGAlg12.pdf.

22.01.2021:

My lecture notes: BIGAlg13.pdf.

29.01.2021: Dimension reduction

  1. Idea of dimension reduction
  2. Formulation of Johnson - Linderstrauss theorem
  3. Basic steps of proof of the above theorem
Exercises
  1. What is $\int_{\RR} e^{-x^2} dx$
  2. You should know the proof of the above result.
My lecture notes: BIGAlg14.pdf.

5.02.2021: Wykład dodatkowy I

9.02.2021: Wykład dodatkowy II