Topics discussed at the lecture

09.10.2020: Introduction

The three main types of data we will be dealing with:
- a narrow array $T[n, m]$ such that $n\times m$ does not fit in the computer's memory but the number of $m$ columns is small
- wide array $T[n, m]$ such that n⋅m does not fit in computer memory but the number of lines $n$ is small
- an array $T[n, m]$, which fits in the computer's memory, but we are interested in (for example) all subsets of the column set $\{1,\ldots, m\}$ (note that this collection can be incredibly large).
Random coincidences in large data sets
- If 10,000 people flip a coin 10 times, about 10 of them will draw a fixed sequence $(i_1, ..., i_{10}) \in \{0,1\}^{10}$
- EXERCISE: Estimate the number of people in Poland that every fixed day they see a black cat in the morning and are later fired from work on that day
HELLO WORLD program for Big Data: extracting the most common words in a selected book
Hash functions
- Converting strings to natural numbers $$h ((a_0, \ldots, a_n)) = \sum_{i = 0}^{n} ord (a_i) \cdot 256^i$$
- The concept of a hash function: $h: \Omega \to \{0,\ldots, n − 1\}$
- Collision: a pair $x, y \in \Omega$ Ω such that $x \neq y$ and $h (x) = h (y)$
- Observation: if $h: \Omega \to \{0,\ldots, n-1\}$, then there exists $a \in \{0,\ldots, n − 1\}$ such that $$|h^{−1}(\{a\}) | \geq \frac{|\Omega|}{n}$$ So: if $|\Omega| \gt n$, then collisions are inevitable.
- Orthogonal projections of finite subsets of $P$ space $\RR^2$ onto one-dimensional subspaces: $$\pi\alpha (x, y) = a\cos\alpha + y \sin\alpha, \quad \alpha \in [0, \pi)$$ We have shown that if $P$ is a fixed finite subset of $\RR^2$ then $$\Pr [\pi_\alpha \text{ is one-to-one on } P] = 1$$ So: with probability 1, the random projection has no collision on $P$.

My quite inept lecture notes: BD_BDA_01.pdf. Next time I'll try to write more clearly.

16.10.2020: Hash functions and Bloom filters

Def. A family $\mathcal{H}$ of hashing functions from $\Omega$ to $[n] = \{0,\ldots,n-1\}$ is an univeral family of hash functions if for each $x,y\in\Omega$ such that $x \neq y$ we have $$ \Pr_{h\in\mathcal{H}}[f(x)=h(y)] \leq \frac{1}{n} $$ If in the above inequality $\frac1n$ is replaced by $\frac{C}{n}$, where $C$ is a small number, then $\mathcal{H}$ is called an almost universal family of hash functions.
Example: the family $\mathcal{H} = [n]^{\Omega}$ is universal
Example: the family $f_a(x) = (a x \mod p) \mod n$, where $p$ is prime and $a\in\{1,\ldots,n\}$ is almost universal family of hash functions with constant $C=2$
Def. A family $\mathcal{H}$ of hash functions if a $k$-independent family of hash functions if for each pairwise different family $x_1,\ldots, x_k$ of elements from $\Omega$ and arbitrary $y_1,\ldots,y_k$ from $[n]$ we have $$ \Pr_{h\in\mathcal{H}}[h(x_1)=y_1 \land \ldots h(x_k) = y_k] = \frac{1}{n^k} $$
Construction: for $a = (a_0,\ldots,a_{k-1}) \in \ZZ_p^k$ we put $$ h_a(x) = \sum_{i=0}^{k-1} a_i x^i \mod p $$ Then $\{h_a:a\in\ZZ_p^k\}$ is a $k$-independent family of hash functions from $\ZZ_p$ to $\ZZ_p$. If we compose this family with the function $\phi(x) = x \mod n$, then we obtain "almost" $k$-independent family from $\ZZ_p$ to $[n]$.
We can replace "almost" by "precisely" if we change $\ZZ_p$ by some field of size $2^m$ and if $n=2^a$ for $a\lt m$ !!!!!
Bloom filters: read the paper A. Broder and M. Mitzenmacher Network Applications of Bloom Filters: A Survey
Diagram of the function $f(p) = - \ln(p)\ln(1-p)$ which was used in the analysis of Bloom filters:

Additional reads:

Slides from Purdue university on universal familes of hash functions
Lecture from Washington university on independent families.

Questions

Why does the concept universal hash function make no sens?
What is the meaning of $\Pr_{h\in\mathcal{H}}[\ldots]$ in formulas from this lecture?
Why the family of all functions from $\Omega$ to $[n]$ is a bad family of universal hash functions?
Do you know an algebraic field with $4$ elements?
Show that $2$-independence implies universality.
Show that $k+1$-independency implies $k$-independency
What is the value $\frac{1}{2^{\ln (2)}}$ (show some approximation) and why this number is important for Bloom filters?

23.10.2020: "Map-Reduce" model

Birthday Paradox: $E[B_n] = \sqrt{\frac{\pi n}{2}} + \frac32 + O(n^{-1/2}) \approx \sqrt{n}$
Coupon Collector Problem: $E[C_n] = n H_n \approx n \ln(n)$
Fakt: if $m \gt n (\ln(n))^2$ then the distribution of $m$ balls in $n$ urns is closed to uniform, i.e. at each urn we have $\approx \frac{m}{n}$ balls.
Basic classes of hash functions:
1. cryptographic: MD5 (not recommended now), SHA-XXX
2. MumurHash - a very popular function for Big Data
BASIC MODEL: processes of typ MAPPER - hash function - processes of typ REDUCER.
Must read: MapReduce: Simplified Data Processing on Large Clusters

Word-count problem in Map-Reduce

map(L){
  for all words from line L do 
  emit((w,1))
};
reduce(w,L) {
  emit(w,length(L))
};

Multiplication of matrix by a vector $x$ stored in memory:

map(i,j,a){
  emit((i,a*x[j]))
};
reduce(i,L){
  emit((i, sum(L)))
}

Questions

What is the expected number of empty urns after throwing $n$ balls into $n$ urns?
What is the expected number of empty urns after throwing $n\ln(n)$ balls into $n$ urns?

My lecture notes: BIGAlg03.pdf.

30.10.2020: "Map-Reduce" - II

Exercises

How to calculate third central moment in map-reduce model?
A graph $(V,E)$ is given as a series of edges $(x,y)$ to mappers. Calculate in map - reduce model the family $\{(x,z):(\exists y)((x,y)\in E \land (y,z)\in E)\}$

My lecture notes: BIGAlg04.pdf.

30.10.2020: "Map-Reduce" - III

Exercises

How to calculate third central moment in map-reduce model?
A graph $(V,E)$ is given as a series of edges $(x,y)$ to mappers. Calculate in map - reduce model the family $\{(x,z):(\exists y)((x,y)\in E \land (y,z)\in E)\}$

My lecture notes: BIGAlg04.pdf.

13.11.2020: "Map-Reduce" - IV

My lecture notes: BIGAlg05.pdf.

20.11.2020: Streaming - I

My lecture notes: BIGAlg06.pdf.

21.11.2020: Streaming - II

My lecture notes: BIGAlg06.pdf.

27.11.2020: Counting

Morris Probabilistic Counter

INIT: C=0;
procedure Add(){
  if (rand() < 2^{-C}){ 
    C++; 
  }
}

Tw: $E[2^{C_n}] = n+1$. (proof: next week)

My lecture notes: BIGAlg07.pdf.

04.12.2020: HyperLogLog Algorithm

Proof of last theorem from previous lecture (see also: Morris Probabilistic Counter).
HyperLogLog: Philippe Flajolet, Éric Fusy,Olivier Gandouet, Frédéric Meunier HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm,2007 Conference on Analysis of Algorithms, AofA 07, Juan des Pins, France, June 17-22, 2007

My lecture notes: BIGAlg08.pdf.

11.12.2020: Locality Sensitive Hash Function Family

Notion of metric space.
Jaccard distance: $d_J(A,B) = \frac{|A\triangle B|}{|A \cup B|}$
Theorem: Jaccard distane is a metric.
Jaccard similarity:
$$ S(A,B) = \frac{|A\cap B|}{|A \cup B|} $$

My lecture notes: BIGAlg09.pdf.

18.12.2020

My lecture notes: BIGAlg10.pdf.

8.01.2021: : Min hash sketches and similarity between documents

My lecture notes: BIGAlg11.pdf.

15.01.2021: Frequent items in data streams

Misra-Gries summaries
Majority count
Min-count sketch
"COVID tracing" applications

Exercises

Recall the proof of Markov's inequality
Derive Tchebyshev inequality from Markov inequality.

My lecture notes: BIGAlg12.pdf.

22.01.2021:

My lecture notes: BIGAlg13.pdf.

29.01.2021: Dimension reduction

Idea of dimension reduction
Formulation of Johnson - Linderstrauss theorem
Basic steps of proof of the above theorem

Exercises

What is $\int_{\RR} e^{-x^2} dx$
You should know the proof of the above result.

My lecture notes: BIGAlg14.pdf.

Jacek Cichoń

Big Data Algorithms

Rules for crediting the course

Learning materials

Some additional sources for Page Rank

Mandatory literature (you must read this)

Mathematica code for computation and displaying PageRank:

Topics discussed at the lecture

09.10.2020: Introduction

16.10.2020: Hash functions and Bloom filters

23.10.2020: "Map-Reduce" model

30.10.2020: "Map-Reduce" - II

30.10.2020: "Map-Reduce" - III

13.11.2020: "Map-Reduce" - IV

20.11.2020: Streaming - I

21.11.2020: Streaming - II

27.11.2020: Counting

04.12.2020: HyperLogLog Algorithm

11.12.2020: Locality Sensitive Hash Function Family

18.12.2020

8.01.2021: : Min hash sketches and similarity between documents

15.01.2021: Frequent items in data streams

22.01.2021:

29.01.2021: Dimension reduction

5.02.2021: Wykład dodatkowy I

9.02.2021: Wykład dodatkowy II