mathjax

Wednesday, January 14, 2015

Bernoulli Distribution

The Bernoulli Distribution models the likelihood of two outcomes: either True or False, Heads or Tails, 1 or 0, Up or Down, Success or Failure, etc. The distribution takes one parameter, $p$, with $0 \leq p \leq 1$, which represents the likelihood of success. Since there are only two outcomes in the sample, $k \in \{0,1\}$, the likelihood of failure is represented by $q=1-p$. This way, the probability of all possible outcomes sums to one: $p+q=1$, and hence it is a legitimate probability distribution.
Formally, if $X$ is a random variable modeled by a Bernoulli distribution with parameter $p$, we say that $X \sim Bern(p)$, $P(X=1)=p$ and that $P(X=0)=1-p=q$.
A Bernoulli experiment or trial involves sampling a Bernoulli distribution once. For example, the flipping of a coin is a Bernoulli trial. If the coin is fair, then $p=0.5$, and by the definition, $q=1-p=0.5$; that is, the chance of heads is the same as the chance of tails.
The $pmf$ is
$p(k;p)=
\left \{
 \begin{array} {c c}
  p & \text{if} \: k=1, \\
  1-p=q & \text{if} \: k=0.
 \end{array}
\right \}
$
The way to read $p(k;p)$ is the probability of $k$, conditioned on $p$. If you look at the Bernoulli distribution as a special case of the Binomial distribution, then $p(k;p)=\binom{1}{1}p^{k}q^{1-k}=p^{k}q^{1-k}$ for $k \in \{0,1\}$.
The expected value of $X$ is $E(X)=p$ and the variance is $Var(X)=p(1-p)=pq$.
import numpy, scipy
import pandas as pd
import matplotlib.pyplot as plt

def bernoulli_trial(p):
    if numpy.random.random() < p:
        return 1
    return 0

p = 0.1
N = 20
trials = [bernoulli_trial(p) for i in range(N)]
print(trials)
This results in the following list:
[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Changing $p$ to $0.5$ gives something like:
[0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1]
Plot like so:
plt.scatter(range(len(trials)), trials, linewidth=0, alpha=0.5, s=50)
plt.title('Bernoulli Trials')
plt.xlabel('trial number')
plt.ylabel('trial outcome')
plt.show()
This data is called a random sample. We can use maximum likelihood estimation to estimate the value of $p$ if we don't know it.
The next piece of code does 500 Bernoulli trials for 9 different values of $p$: $p \in \{0.1, 0.2, ..., 0.9 \}$. We will pretend we don't know $p$ for each of these 9 samples.  
plt.clf()
N = 500
data = []
for p in [float(i)/10.0 for i in range(1,10)]:
    trials = [bernoulli_trial(p) for i in range(N)]
    data.append(trials)
data = numpy.array(data).T
df = pd.DataFrame(data, columns = [float(i)/10.0 for i in range(1,10)])
means = df.mean()
means.plot(kind='bar')
plt.xlabel('actual $p$')
plt.ylabel('observed $p$')


As you can see, the observed $p$ closely approximates the actual $p$.

Tuesday, January 13, 2015

How to set up a scientific virtual environment for python

Virtual environments are a way to partition your dependencies into different directories. So, for example, if program A needs pandas 1.14 and program 2 needs pandas 1.15, you can put programs 1 and 2 in separate directories, create virtual environments in each folder, and install the packages specific to each program in their respective places without conflict. Note that a typical python installation takes about 282 MB of memory.
This guide applies to Mac OS X and maybe Linux. First, make sure pip is installed. Python 2.7.9 comes with pip preinstalled. See this page for installation. You can check your version of pip with the following command
$ pip -V
Second, install the virtual environment package using pip:
$ pip install virtualenv
This may take a while. You are actually installing python.
Now make a directory in which you want to set up the virtual environment:
$ mkdir myproject
$ cd myproject
$ virtualenv vrtenv
You can specify a python interpreter using the -p option. The following will create a virtual environment with the version in /usr/bin/python2.7.
$ virtualenv -p /usr/bin/python2.7 vrtenv
If you want python 3.3, it must already be installed on your system, and call it the same way:
$ virtualenv -p /usr/bin/python3.3 vrtenv
vrtenv is the directory where the new python environmental variables reside. You can name this folder anything, but vrtenv is intuitive enough.
This is how you start a virtual environment.
$ source vrtenv/bin/activate
(vrtenv)$ 
Notice the change in the prompt. The name of the virtual environment should appear in parenthesis on the left. Now proceed to install your python packages:
$ pip install numpy
$ pip install scipy
It can be tedious to manually install everything you need. Luckily, pip provides the functionality to install packages from a file. Begin by collecting your current packages:
$ pip freeze > requirements.txt
Here is an example of what might be in requirements.txt:
backports.ssl-match-hostname==3.4.0.2
certifi==14.5.14
decorator==3.4.0
gnureadline==6.3.3
ipython==2.3.1
Jinja2==2.7.3
MarkupSafe==0.23
matplotlib==1.4.2
mock==1.0.1
networkx==1.9.1
nose==1.3.4
numpy==1.9.1
pandas==0.15.2
Pygments==2.0.2
pyparsing==2.0.3
python-dateutil==2.4.0
pytz==2014.10
pyzmq==14.4.1
requests==2.5.1
scikit-learn==0.15.1
scipy==0.14.1
six==1.9.0
tornado==4.0.2
twitter==1.15.0
Once you have frozen and redirected your old package index, just copy requirements.txt to your new directory, myproject, and call
$ pip install -r requirements.txt
All that remains is to copy pre-existing code into myproject or to create new code.
Nota Bene: before installing everything from requirements.txt, go through it and try to remove packages that are superfluous to your new directory. If you accidentally remove a package or two, that's okay, because when you try to run your python script, if you forgot anything, the program will raise an ImportError. Use pip to install whatever is missing, and then update your requirements.txt file using pip freeze again.
Using the requirements.txt above will give you all you need to run networkx for social network analysis, sklearn for machine learning, pandas for data analysis, and ipython notebooks for fun.
Finally, when you are done working in your new environment, deactivate it:
$ deactivate

Sunday, January 11, 2015

Selection Sort

This is the Selection Sort algorithm. It works best on small lists, $2<=k<=7$. It is in-place and is a comparison sort. It works by splitting the list in two sublists, the left sublist is sorted, and the right sublist is unsorted, and in each step (each index of the original list) select the smallest item from the right sublist, place it at the beginning (left-most position) of the right sublist, and then move the boundary one. The boundary is the initial index of each step; that is, the boundary is 0 in the first step, 1 in the second step, and so on. 
def selection_sort(lst):
 for i in xrange(len(lst)):
  m = i
  for j in xrange(i, len(lst)):
   if lst[j] < lst[m]:
    m = j
  #lst[i], lst[m] = lst[m], lst[i]
  swap(lst, i, m)
The Selection sort implementation above provides best case $\Omega(n^2)$ and worst case $O(n^2)$, giving $\Theta(n^2)$ in time complexity.

Sunday, January 4, 2015

Bryden et al. Word usage mirrors community structure in the online social network Twitter. EPJ Data Science 2013, 2:3.

This is my summary post. The intent of these summary posts is to provide a brief overview of the methods used and the results. 

Summary of
Word usage mirrors community structure in the online social network Twitter. 
Bryden et al. EPJ Data Science 2013, 2:3. full text here


Methods

Snowball sampling
The Map equation
Modularity
Z-score to rank words within a community
Euclidean distance with a bootstrap to determine word usage difference significance between communities. 


Results

Word frequencies of individuals differ by community, and by extension, to hierarchies of communities. A consequence of this pattern is that communities can be characterized by the word usage of their members; subsequently, the words of an individual member can be used to predict their community. Also the language patterns (word endings, letter-pair frequency, etc.) within a community are significantly different that those outside of the community. This makes sense: a shared language facilitates communication and interaction. They go on to say that words can act as cultural markers, which other individuals and groups use to decide whether or not to make a new association. This can be partially explained through homophily and peer effects. 


Saturday, January 3, 2015

Bubble Sort

This is the Bubble Sort algorithm. It is one of the most least efficient in time complexity, but it's effervescent name keeps people coming back.
The algorithm, which falls under the "comparison sort" paradigm, works by repeatedly stepping through the array, comparing adjacent elements, and swapping elements that are out of order. The first pass through the array starts by comparing the first and second values, then the second and third, then the third and fourth, ..., then the next-to-last and last, and will result in the largest element at the end of the array; i.e., the array is partially sorted in the last element. The second pass compares the second and third, the third and fourth, ... , then the next-next-to-last and the next-to-last element; since the last element was placed in the correct place in the first pass, it no longer needs to be checked.
def bubble_sort(lst):
 for i in range(len(lst)-1,0,-1):
  for j in range(i):
   if lst[j] > lst[j+1]:
    lst[j], lst[j+1] = lst[j+1], lst[j]
The naive Bubble sort implementation above provides best case $\Omega(n^2)$ and worst case $O(n^2)$, giving $\Theta(n^2)$ in time complexity.
Optimized versions of Bubble sort, like the one below, provide best case $\Omega(n)$ and worst case $O(n^2)$. The $\Omega(n)$ best case occurs when the list is already in sorted order. A boolean flag is set to $0$, and the loops runs more or less the same as in the naive algorithm, with the exception that if, in any loop, no two elements are swapped, then the list is in sorted order and the $while$ loop terminates.
def bubble_sort2(lst):
 swapped = False
 n = len(lst)
 while not swapped:
  swapped = False
  for i in range(n):
   if lst[i] < lst[i+1]:
    swap(lst, i, j)
    swapped = True
  n -= 1
The following version of Bubble sort is more efficient than the first two. It takes advantage of the fact that the index of the last swap is the position beyond which the array is sorted, because if it were not sorted, there would have been a swap. If the array is already in sorted order, the algorithm will make one pass over the $n-1$ pairs of adjacent elements, and then terminate, giving best case $\Omega(n)$ and worst case $O(n^2)$ if the list of reverse sorted.
def bubble_sort3(lst):
 n = len(lst)
 while n > 0:
  new_n = 0
  for i in range(1, n):
   if lst[i-1] > lst[i]:
    lst[i-1], lst[i] = lst[i], lst[i-1]
    new_n = i
  n = new_n

Friday, January 2, 2015

Insertion Sort

Here are two ways to do Insertion Sort in python.
def insertion_sort(lst):
 for i in xrange(1,len(lst)):
  j = i
  while j > 0 and lst[j] < lst[j-1]:
   swap(lst, j, j-1)
   j -= 1

def insertion_sort2(lst):
 for i in xrange(1,len(lst)):
  val = lst[i]
  j = i-1
  while j >= 0 and val < lst[j-1]:
   lst[j+1] = lst[j]
   j -= 1
  lst[j+1] = val


Insertion sort provides best case $Ω(n)$ and worst case $O(n^2)$.