The Binarium: 2015

Wednesday, February 18, 2015

Tamburrini et al. Twitter users change word usage according to conversation-partner social identity. Soc Net. 40 (2015) 84-89

Summary of
Twitter users change word usage according to conversation-partner social identity
Tamburrini et al. Soc Net. 40 (2015) 84-89. Full text here (unfortunately it is not free, yet).

Methods
Snowball sampling for determining which Twitter users' data to download, capturing messages with an '@' sign, discarding retweets.
Modularity maximization algorithm for graph partitioning (community detection)
Text similarity measures: Euclidean and Jaccard
Dataset of messages split into inter- and intra-community groups, then balanced by randomly discarding those from the larger set until both sets had the same cardinality.
Bootstrap sampling with set union

Results
The basic result is demonstrating that users change their word usage according to audience, and the community of the audience; specifically, the linguistic patterns used for inter-community messages differs from those of intra-community messages. This may seem intuitive or obvious to anyone who associates with multiple social circles--e.g., I talk differently to my grandparents than I do to my friends, and ditto for co-workers. Further, they show that the degree of isolation of a community is proportional to the magnitude of linguistic difference between speakers.
The paper addresses the issue of social identity, and the results add evidence to the two well-known social-psychological theories of Social Accommodation and Convergence. The former states that a speaker will alter their word usage to suite that of the audience, while the latter states that more isolated communities will differ more from the rest of the population.

Wednesday, February 11, 2015

Trends in Wikipedia "Request for Adminship" Elections - Part 1

This is the first post I'll make about a nice Wikipedia data set I just got my hands on. This post will focus solely on the characteristics of voting and election averages over time.

An election win means that an editor becomes a Wikipedia page administrator, giving them more editorial power. In order for an election to commence, a candidate must be nominated. The nodes in the data set, for each election, are labeled as 'nominator', 'candidate', or 'voter.' The three kinds of votes a voter can cast are 1, 0, and -1, indicating, yea, nea, and neutral, respectively.

There are 2,794 elections, 2,391 candidates, 7,194 nodes, and 110,087 edges. 44.6 % of elections resulted in a win, and 55.4 % resulted in a loss. Interesting to note that, in every election which resulted in a loss for the candidate, the nominator was labeled 'UNKNOWN.' I don't know if that label exists before the election started, and therefore no one knew who the nominator was, or if the label was changed from a unique identifier to UNKNOWN after the election outcome, presumably to mitigate the shame of the nominator at having nominated a losing candidate.

The first plot shows the average vote score a candidate receives during an election. The first three subplots are broken down by the first vote the candidate received. For example, the subplot in the upper left shows the moving average of votes, given that the candidate's first vote was positive (+1). The upper middle subplot shows average vote for candidates whose first vote was negative. The upper right subplot shows candidates whose first vote was neutral. The bottom subplot is the aggregate. In each subplot, a line represents a single election. The y coordinate corresponds to the average value of all votes up to and including that value. Tap or zoom in on the images to see more clearly.

We can see that those whose initial vote was positive (blue) tend to vote positive thereafter, while those initial vote is negative (red) or neutral (green) tend to average out near zero. If the first vote is positive, there is a 61 % chance of winning the election. If the first vote is negative, there is only a 39 % chance. If the first vote is neutral, the chances of winning and loosing are 0.8 % and 5 %, respectively.

Now let's talk about average scores, since that's what the plot shows. Since the votes can only take the values +1, 0, -1, the average must be between +1 and -1. If the first vote is positive, the average vote at the end of the election is 0.6; if the first vote is negative, the ending average is -0.65; if the first vote is neutral, the average vote is -0.12. Interestingly, if the first vote is positive, the chance of losing the election is about 30 %; however, if the first vote is negative, the chance of winning the election is only 0.8 %. The key takeaway here is to get a first vote that is positive if you want to win.

It is also worth mentioning that the average length (number of votes) in a winning election is 57, while the average length in a losing election is 27. This difference is significant. It may be that people will not continue to vote if they perceive an imminent loss for the candidate.

Now that we've seen data on elections, let's look at candidates over all elections. In the plot below, each line represents a candidate across all elections for which they ran.

Looks fairly similar to the previous one. That is because most people only run once. In fact, the average number of election in which a candidate participates is just the number of elections divided by the number of candidates: 2,794 / 2,391 = 1.168. However, if a candidate wins their first election, their average campaigns jumps to 3.23.

Now let's look at the average voter.

Those who voted positive tend to vote more, as indicated by the length of the line (how far to the right the line stretches). Since the proportion of voters whose first vote is positive account for 76.8 % of all voters, the line in the bottom subplot (grey) tend to terminate with a positive average. Note the x-axis is log scaled. If a voter cast a positive first vote, their average vote is 0.79; if their first vote is neutral, their average is 0.21; and if their first vote is negative, their average vote is -0.43. This one sample, the first vote, says a lot.

Finally, let's look at a moving average of election outcomes per candidate.

Here we see that just under half of candidates won their first election. Those who won tend to run fewer times, as well, with a max of 3 campaigns. While those who lost their first election ran more times, with a max of 5 campaigns. Note the x axis is linear and goes from 0 to 4, giving 4-0+1=5 distinct values. Main takeaway here is that once people are elected, they are content and don't tend to run more. However, those who loose have everything to gain, and tend to run more in order to win. In this plot, as in the previous ones, the thickness of the line conveys the proportion of people who follow that path. Notice the Thickness of the red line between x = 0 and 1, where the y value increases from -1 to 0. This means that they lost their first election (average of -1 is -1), and they won their second (average of -1 and +1 is 0), at which point they stopped running. Ditto for the red line between x = 1 and 2, and between 2 and 3. Again, the bottom subplot is an aggregate of the top 2 subplots.

Upcoming posts will also include average election size (how many people are voting), who is nominating candidates, visualizations, time series, and more.

Wednesday, January 14, 2015

Bernoulli Distribution

The Bernoulli Distribution models the likelihood of two outcomes: either True or False, Heads or Tails, 1 or 0, Up or Down, Success or Failure, etc. The distribution takes one parameter, $p$, with $0 \leq p \leq 1$, which represents the likelihood of success. Since there are only two outcomes in the sample, $k \in \{0,1\}$, the likelihood of failure is represented by $q=1-p$. This way, the probability of all possible outcomes sums to one: $p+q=1$, and hence it is a legitimate probability distribution.
Formally, if $X$ is a random variable modeled by a Bernoulli distribution with parameter $p$, we say that $X \sim Bern(p)$, $P(X=1)=p$ and that $P(X=0)=1-p=q$.
A Bernoulli experiment or trial involves sampling a Bernoulli distribution once. For example, the flipping of a coin is a Bernoulli trial. If the coin is fair, then $p=0.5$, and by the definition, $q=1-p=0.5$; that is, the chance of heads is the same as the chance of tails.
The $pmf$ is
$p(k;p)=
\left \{
\begin{array} {c c}
p & \text{if} \: k=1, \\
1-p=q & \text{if} \: k=0.
\end{array}
\right \}
$
The way to read $p(k;p)$ is the probability of $k$, conditioned on $p$. If you look at the Bernoulli distribution as a special case of the Binomial distribution, then $p(k;p)=\binom{1}{1}p^{k}q^{1-k}=p^{k}q^{1-k}$ for $k \in \{0,1\}$.
The expected value of $X$ is $E(X)=p$ and the variance is $Var(X)=p(1-p)=pq$.

import numpy, scipy
import pandas as pd
import matplotlib.pyplot as plt

def bernoulli_trial(p):
    if numpy.random.random() < p:
        return 1
    return 0

p = 0.1
N = 20
trials = [bernoulli_trial(p) for i in range(N)]
print(trials)

This results in the following list:

[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

Changing $p$ to $0.5$ gives something like:

[0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1]

Plot like so:

plt.scatter(range(len(trials)), trials, linewidth=0, alpha=0.5, s=50)
plt.title('Bernoulli Trials')
plt.xlabel('trial number')
plt.ylabel('trial outcome')
plt.show()

This data is called a random sample. We can use maximum likelihood estimation to estimate the value of $p$ if we don't know it.
The next piece of code does 500 Bernoulli trials for 9 different values of $p$: $p \in \{0.1, 0.2, ..., 0.9 \}$. We will pretend we don't know $p$ for each of these 9 samples.

plt.clf()
N = 500
data = []
for p in [float(i)/10.0 for i in range(1,10)]:
    trials = [bernoulli_trial(p) for i in range(N)]
    data.append(trials)
data = numpy.array(data).T
df = pd.DataFrame(data, columns = [float(i)/10.0 for i in range(1,10)])
means = df.mean()
means.plot(kind='bar')
plt.xlabel('actual $p$')
plt.ylabel('observed $p$')

As you can see, the observed $p$ closely approximates the actual $p$.

Tuesday, January 13, 2015

How to set up a scientific virtual environment for python

Virtual environments are a way to partition your dependencies into different directories. So, for example, if program A needs pandas 1.14 and program 2 needs pandas 1.15, you can put programs 1 and 2 in separate directories, create virtual environments in each folder, and install the packages specific to each program in their respective places without conflict. Note that a typical python installation takes about 282 MB of memory.
This guide applies to Mac OS X and maybe Linux. First, make sure pip is installed. Python 2.7.9 comes with pip preinstalled. See this page for installation. You can check your version of pip with the following command

$ pip -V

Second, install the virtual environment package using pip:

$ pip install virtualenv

This may take a while. You are actually installing python.
Now make a directory in which you want to set up the virtual environment:

$ mkdir myproject
$ cd myproject
$ virtualenv vrtenv

You can specify a python interpreter using the -p option. The following will create a virtual environment with the version in /usr/bin/python2.7.

$ virtualenv -p /usr/bin/python2.7 vrtenv

If you want python 3.3, it must already be installed on your system, and call it the same way:

$ virtualenv -p /usr/bin/python3.3 vrtenv

vrtenv is the directory where the new python environmental variables reside. You can name this folder anything, but vrtenv is intuitive enough.
This is how you start a virtual environment.

$ source vrtenv/bin/activate
(vrtenv)$

Notice the change in the prompt. The name of the virtual environment should appear in parenthesis on the left. Now proceed to install your python packages:

$ pip install numpy
$ pip install scipy

It can be tedious to manually install everything you need. Luckily, pip provides the functionality to install packages from a file. Begin by collecting your current packages:

$ pip freeze > requirements.txt

Here is an example of what might be in requirements.txt:

backports.ssl-match-hostname==3.4.0.2
certifi==14.5.14
decorator==3.4.0
gnureadline==6.3.3
ipython==2.3.1
Jinja2==2.7.3
MarkupSafe==0.23
matplotlib==1.4.2
mock==1.0.1
networkx==1.9.1
nose==1.3.4
numpy==1.9.1
pandas==0.15.2
Pygments==2.0.2
pyparsing==2.0.3
python-dateutil==2.4.0
pytz==2014.10
pyzmq==14.4.1
requests==2.5.1
scikit-learn==0.15.1
scipy==0.14.1
six==1.9.0
tornado==4.0.2
twitter==1.15.0

Once you have frozen and redirected your old package index, just copy requirements.txt to your new directory, myproject, and call

$ pip install -r requirements.txt

All that remains is to copy pre-existing code into myproject or to create new code.
Nota Bene: before installing everything from requirements.txt, go through it and try to remove packages that are superfluous to your new directory. If you accidentally remove a package or two, that's okay, because when you try to run your python script, if you forgot anything, the program will raise an ImportError. Use pip to install whatever is missing, and then update your requirements.txt file using pip freeze again.
Using the requirements.txt above will give you all you need to run networkx for social network analysis, sklearn for machine learning, pandas for data analysis, and ipython notebooks for fun.
Finally, when you are done working in your new environment, deactivate it:

$ deactivate

Sunday, January 11, 2015

Selection Sort

This is the Selection Sort algorithm. It works best on small lists, $2<=k<=7$. It is in-place and is a comparison sort. It works by splitting the list in two sublists, the left sublist is sorted, and the right sublist is unsorted, and in each step (each index of the original list) select the smallest item from the right sublist, place it at the beginning (left-most position) of the right sublist, and then move the boundary one. The boundary is the initial index of each step; that is, the boundary is 0 in the first step, 1 in the second step, and so on.

def selection_sort(lst):
 for i in xrange(len(lst)):
  m = i
  for j in xrange(i, len(lst)):
   if lst[j] < lst[m]:
    m = j
  #lst[i], lst[m] = lst[m], lst[i]
  swap(lst, i, m)

The Selection sort implementation above provides best case $\Omega(n^2)$ and worst case $O(n^2)$, giving $\Theta(n^2)$ in time complexity.

Sunday, January 4, 2015

Bryden et al. Word usage mirrors community structure in the online social network Twitter. EPJ Data Science 2013, 2:3.

This is my summary post. The intent of these summary posts is to provide a brief overview of the methods used and the results.

Summary of
Word usage mirrors community structure in the online social network Twitter.
Bryden et al. EPJ Data Science 2013, 2:3. full text here

Methods

Snowball sampling
The Map equation
Modularity
Z-score to rank words within a community
Euclidean distance with a bootstrap to determine word usage difference significance between communities.

Results

Word frequencies of individuals differ by community, and by extension, to hierarchies of communities. A consequence of this pattern is that communities can be characterized by the word usage of their members; subsequently, the words of an individual member can be used to predict their community. Also the language patterns (word endings, letter-pair frequency, etc.) within a community are significantly different that those outside of the community. This makes sense: a shared language facilitates communication and interaction. They go on to say that words can act as cultural markers, which other individuals and groups use to decide whether or not to make a new association. This can be partially explained through homophily and peer effects.

Saturday, January 3, 2015

Bubble Sort

This is the Bubble Sort algorithm. It is one of the most least efficient in time complexity, but it's effervescent name keeps people coming back.
The algorithm, which falls under the "comparison sort" paradigm, works by repeatedly stepping through the array, comparing adjacent elements, and swapping elements that are out of order. The first pass through the array starts by comparing the first and second values, then the second and third, then the third and fourth, ..., then the next-to-last and last, and will result in the largest element at the end of the array; i.e., the array is partially sorted in the last element. The second pass compares the second and third, the third and fourth, ... , then the next-next-to-last and the next-to-last element; since the last element was placed in the correct place in the first pass, it no longer needs to be checked.

def bubble_sort(lst):
 for i in range(len(lst)-1,0,-1):
  for j in range(i):
   if lst[j] > lst[j+1]:
    lst[j], lst[j+1] = lst[j+1], lst[j]

The naive Bubble sort implementation above provides best case $\Omega(n^2)$ and worst case $O(n^2)$, giving $\Theta(n^2)$ in time complexity.
Optimized versions of Bubble sort, like the one below, provide best case $\Omega(n)$ and worst case $O(n^2)$. The $\Omega(n)$ best case occurs when the list is already in sorted order. A boolean flag is set to $0$, and the loops runs more or less the same as in the naive algorithm, with the exception that if, in any loop, no two elements are swapped, then the list is in sorted order and the $while$ loop terminates.

def bubble_sort2(lst):
 swapped = False
 n = len(lst)
 while not swapped:
  swapped = False
  for i in range(n):
   if lst[i] < lst[i+1]:
    swap(lst, i, j)
    swapped = True
  n -= 1

The following version of Bubble sort is more efficient than the first two. It takes advantage of the fact that the index of the last swap is the position beyond which the array is sorted, because if it were not sorted, there would have been a swap. If the array is already in sorted order, the algorithm will make one pass over the $n-1$ pairs of adjacent elements, and then terminate, giving best case $\Omega(n)$ and worst case $O(n^2)$ if the list of reverse sorted.

def bubble_sort3(lst):
 n = len(lst)
 while n > 0:
  new_n = 0
  for i in range(1, n):
   if lst[i-1] > lst[i]:
    lst[i-1], lst[i] = lst[i], lst[i-1]
    new_n = i
  n = new_n

Friday, January 2, 2015

Insertion Sort

Here are two ways to do Insertion Sort in python.

def insertion_sort(lst):
 for i in xrange(1,len(lst)):
  j = i
  while j > 0 and lst[j] < lst[j-1]:
   swap(lst, j, j-1)
   j -= 1

def insertion_sort2(lst):
 for i in xrange(1,len(lst)):
  val = lst[i]
  j = i-1
  while j >= 0 and val < lst[j-1]:
   lst[j+1] = lst[j]
   j -= 1
  lst[j+1] = val

Insertion sort provides best case $Ω(n)$ and worst case $O(n^2)$.

The Binarium

mathjax