The Binarium: February 2015

Summary of
Twitter users change word usage according to conversation-partner social identity
Tamburrini et al. Soc Net. 40 (2015) 84-89. Full text here (unfortunately it is not free, yet).

Methods
Snowball sampling for determining which Twitter users' data to download, capturing messages with an '@' sign, discarding retweets.
Modularity maximization algorithm for graph partitioning (community detection)
Text similarity measures: Euclidean and Jaccard
Dataset of messages split into inter- and intra-community groups, then balanced by randomly discarding those from the larger set until both sets had the same cardinality.
Bootstrap sampling with set union

Results
The basic result is demonstrating that users change their word usage according to audience, and the community of the audience; specifically, the linguistic patterns used for inter-community messages differs from those of intra-community messages. This may seem intuitive or obvious to anyone who associates with multiple social circles--e.g., I talk differently to my grandparents than I do to my friends, and ditto for co-workers. Further, they show that the degree of isolation of a community is proportional to the magnitude of linguistic difference between speakers.
The paper addresses the issue of social identity, and the results add evidence to the two well-known social-psychological theories of Social Accommodation and Convergence. The former states that a speaker will alter their word usage to suite that of the audience, while the latter states that more isolated communities will differ more from the rest of the population.

This is the first post I'll make about a nice Wikipedia data set I just got my hands on. This post will focus solely on the characteristics of voting and election averages over time.

An election win means that an editor becomes a Wikipedia page administrator, giving them more editorial power. In order for an election to commence, a candidate must be nominated. The nodes in the data set, for each election, are labeled as 'nominator', 'candidate', or 'voter.' The three kinds of votes a voter can cast are 1, 0, and -1, indicating, yea, nea, and neutral, respectively.

There are 2,794 elections, 2,391 candidates, 7,194 nodes, and 110,087 edges. 44.6 % of elections resulted in a win, and 55.4 % resulted in a loss. Interesting to note that, in every election which resulted in a loss for the candidate, the nominator was labeled 'UNKNOWN.' I don't know if that label exists before the election started, and therefore no one knew who the nominator was, or if the label was changed from a unique identifier to UNKNOWN after the election outcome, presumably to mitigate the shame of the nominator at having nominated a losing candidate.

The first plot shows the average vote score a candidate receives during an election. The first three subplots are broken down by the first vote the candidate received. For example, the subplot in the upper left shows the moving average of votes, given that the candidate's first vote was positive (+1). The upper middle subplot shows average vote for candidates whose first vote was negative. The upper right subplot shows candidates whose first vote was neutral. The bottom subplot is the aggregate. In each subplot, a line represents a single election. The y coordinate corresponds to the average value of all votes up to and including that value. Tap or zoom in on the images to see more clearly.

We can see that those whose initial vote was positive (blue) tend to vote positive thereafter, while those initial vote is negative (red) or neutral (green) tend to average out near zero. If the first vote is positive, there is a 61 % chance of winning the election. If the first vote is negative, there is only a 39 % chance. If the first vote is neutral, the chances of winning and loosing are 0.8 % and 5 %, respectively.

Now let's talk about average scores, since that's what the plot shows. Since the votes can only take the values +1, 0, -1, the average must be between +1 and -1. If the first vote is positive, the average vote at the end of the election is 0.6; if the first vote is negative, the ending average is -0.65; if the first vote is neutral, the average vote is -0.12. Interestingly, if the first vote is positive, the chance of losing the election is about 30 %; however, if the first vote is negative, the chance of winning the election is only 0.8 %. The key takeaway here is to get a first vote that is positive if you want to win.

It is also worth mentioning that the average length (number of votes) in a winning election is 57, while the average length in a losing election is 27. This difference is significant. It may be that people will not continue to vote if they perceive an imminent loss for the candidate.

Now that we've seen data on elections, let's look at candidates over all elections. In the plot below, each line represents a candidate across all elections for which they ran.

Looks fairly similar to the previous one. That is because most people only run once. In fact, the average number of election in which a candidate participates is just the number of elections divided by the number of candidates: 2,794 / 2,391 = 1.168. However, if a candidate wins their first election, their average campaigns jumps to 3.23.

Now let's look at the average voter.

Those who voted positive tend to vote more, as indicated by the length of the line (how far to the right the line stretches). Since the proportion of voters whose first vote is positive account for 76.8 % of all voters, the line in the bottom subplot (grey) tend to terminate with a positive average. Note the x-axis is log scaled. If a voter cast a positive first vote, their average vote is 0.79; if their first vote is neutral, their average is 0.21; and if their first vote is negative, their average vote is -0.43. This one sample, the first vote, says a lot.

Finally, let's look at a moving average of election outcomes per candidate.

Here we see that just under half of candidates won their first election. Those who won tend to run fewer times, as well, with a max of 3 campaigns. While those who lost their first election ran more times, with a max of 5 campaigns. Note the x axis is linear and goes from 0 to 4, giving 4-0+1=5 distinct values. Main takeaway here is that once people are elected, they are content and don't tend to run more. However, those who loose have everything to gain, and tend to run more in order to win. In this plot, as in the previous ones, the thickness of the line conveys the proportion of people who follow that path. Notice the Thickness of the red line between x = 0 and 1, where the y value increases from -1 to 0. This means that they lost their first election (average of -1 is -1), and they won their second (average of -1 and +1 is 0), at which point they stopped running. Ditto for the red line between x = 1 and 2, and between 2 and 3. Again, the bottom subplot is an aggregate of the top 2 subplots.

Upcoming posts will also include average election size (how many people are voting), who is nominating candidates, visualizations, time series, and more.

The Binarium

mathjax

Wednesday, February 18, 2015

Tamburrini et al. Twitter users change word usage according to conversation-partner social identity. Soc Net. 40 (2015) 84-89

Wednesday, February 11, 2015

Trends in Wikipedia "Request for Adminship" Elections - Part 1

kjh