0.23/(sqrt((160-5.85)/5.85)*0.05)
[1] 0.8961155
James Holland Jones
August 24, 2024
I decided to do a cocktail-napkin calculation this morning, as one does. Since Elon Musk bought the social-media platform formerly known as Twitter in October of 2022, there has been a clear shift to the right that many commentators have noted. Earlier this week, on the second day of the Democratic National Convention, Musk posted a Twitter poll about users’ preferences between the two presidential candidates, Harris and Trump. His poll, which collected nearly 5.85 million votes, revealed a 73% to 27% preference for Trump. Musk noted that the poll was “super unscientific,” but the vibe of the whole thing was that he was presenting the truth that the mainstream media is afraid to print. This has led some to worry that this is part of the ground game for claiming a rigged election if Trump loses.
Most polls indicate that the contest is, in fact, very tight, with the two candidates in a nearly even race, though the most recent polls suggest a small advantage for Harris. This implies that Musk’s poll differs from the overall consensus by around 20-25 percentage points.
As with many others, I wondered if maybe this poll is telling us more about the composition of Twitter than it is about the US presidential election. There is good evidence that the platform has shifted substantially to the right. Musk’s own posts have also shifted considerably toward right-wing topics.
In 2018, the Harvard statistician Xiao-Li Meng wrote a paper where he coined the term The Big Data Paradox, where “the more the data, the surer we fool ourselves.” Appropriately enough for our current problem, Meng was trying to understand why the polls seemed to be so wrong in the 2016 US presidential election. Most polls had Trump losing and, of course, that didn’t actually happen.
That was about the time that our collective infatuation with big data was also taking off. The thinking that you don’t have to look hard to find is that with enough data, we don’t have to worry about tedious things like sampling design or measurement error. Generally there is some vague gesturing toward the Law of Large Numbers or the like.
Suppose we have a sample of size
Meng derived an identity that shows that there are three (and only three) factors that determine the estimation error for some quantity of interest. These are:
Data quantity represents both the size of your statistical sample (
The difficulty of the inference problem is given by the standard deviation of your quantity of interest (
The last quantity, data quality, is probably the most interesting. It is the correlation between the outcome (
Consider the sample mean,
Here
When we gather a traditional design-based or model-based sample (for more on the distinction, check out Steve Thompson’s great book, Sampling), the
We can put this all together to show that the estimation error (i.e., the difference between our sample estimate and the population value) is given by the following remarkably simple formula:
If
We can apply this decomposition to Elon Musk’s poll.
I used the following values: 154.6 million people voted in the presidential election in 2020. Turnout has been trending up, so we can use an estimate of
That seems like an absurdly high correlation. I was convinced that it had to be wrong, but if the input numbers are even approximately right, that correlation has to be very high to get such a big difference between the polling consensus of a very close race and the result of Musk’s Twitter poll with its value of approximately +50 for Trump. Now, of course, this assumes that the polls are a better reflection of reality and it was the failure of polls in 2016 that led to Meng’s paper in the first place. I am neither political scientist nor survey researcher, so I won’t comment on the quality of the current polls. But seriously, does anyone really believe that Trump has a large lead over Harris in our current social and political climate? I certainly hope that we’ve learned from our previous experience that big data do not guarantee correct inference and it certainly doesn’t obviate the need for proper sampling!
We can try various values of
calc_meng <- function(n=5.85e6,N=160e6,d=0.23,sigma) d/(sqrt((N-n)/n)*sigma)
sig <- seq(0.05,0.2,,100)
plot(sig, calc_meng(sigma=sig), type="l", lwd=2, col="blue4",
xlab=expression(sigma[G]),
ylab=expression(rho[GR]))
But let’s be clear here: the value of the correlation is so high because the difference between Musk’s poll and the general polling consensus is so huge. Note also that this difference is even greater if Harris is actually leading as suggested by recent national polls.
Meng (2018) found that a minuscule correlation of
Honestly, I don’t know. It certainly suggests that people who follow Musk (and therefore were more likely to see the poll) are strongly biased to the right. Does that mean that Twitter as a whole is? Maybe. The next question, of course, is how representative Musk’s 193M followers are of the platform as a whole. But the algorithmic amplification of the site’s most-followed user means that even people who don’t follow Musk probably saw the post. I guess the bare minimum take-away here is to be very skeptical of Twitter polls that are meant to shed light on features of the general population.
It does seem clear that the platform formerly known as Twitter has become a hotbed of misinformation. Here is a helpful timeline of the changes in Twitter and Musk’s posting habits. It’s sad to watch. There was a time there when Scientific Twitter was an amazing place for communicating results and connecting to other scientists and science writers. I miss it.