Using Facebook Data, Algorithm Predicts Personality Better Than Friends
sciencehabit writes: A new study of Facebook data shows that machines are now better at sussing out our true personalities than our friends. One of the standard methods for assessing personality is to analyze people's answers to a 100-item questionnaire with a statistical technique called factor analysis. There are five main factors that divide people by personality—openness, conscientiousness, extraversion, agreeableness, and neuroticism—which is why personality researchers call this test the Big Five. People can accurately predict how their friends will answer the Big Five questions. ... Compared with humans predicting their friends' personalities by filling out the Big Five questionnaire, the computer's prediction based on Facebook likes was almost 15% more accurate on average, the team reports online today in PNAS (abstract). Only people's spouses were better than the computer at judging personality.
Why? Why, with everything that everyone knows about Facebook, all the privacy violations, all the obvious signs that they really don't give a rat's ass about the users, just the money that users' data can earn them, would anyone still be using Facebook? Is it willful ignorance? Or is it deep denial? Now, we find out: Facebook can and is being used to profile people. Come on, is this what you all really want?
Disregard Facebook. Take your life back.
Are YOU using the TOOL, or is the TOOL using YOU? Think about it!
The comment that the algorithm does better at predicting personality than a person's friends will depend very strongly on how you define a friend. I have a very large number of Facebook friends about whom I know almost nothing, so I am not at all surprised that an algorithm will do better.
I am a Statistician. One false move and you are a Statistic
I hope nobody will ever be able to use my reddit's comments to predicts my personnality ever!!!
Ceci n'est pas une Signature !
I have a Facebook-account due to family, but I make maybe one post a year there and I never like anything whatsoever. What does such an algorithm tell about me? I mean, it sounds to me like the algorithm is already biased towards certain kind of people from the get-go if it only applies to socially-outwards people who enjoy "liking" stuff on Facebook.
The names of the factors are guesses. Factor analysis looks at the covariance matrix of items, and finds sub-matrices of the total matrix that meaningfully covary. Each one of those sub-matrices is called a factor, or latent variable, which is measured by common covariation between the questions. The number of latent factors found in a questionnaire is typically derived both by theory (we made a questionnaire intended to measure these 6 different things) and empirical facts (of which typically would be Horn's parallel analysis or the Kaiser criterion [which simply means all eigen values of the covariance matrix that are greater than one]). The factors are named because that is what was a suitable commonality between the items first measured, along with external criterions like predicting other theoretically related constructs. The Big 5 are an enormously well studied problem space, and the stability and pervasiveness of these concepts have been well documented and linked to specific gene expressions, developmental trends, et cet.
Haven't you failed to read the article before claiming that it is wrong?
For those playing along at home, Fig.1 from the actual article explicitly refutes the AC's claim.
Every day a little gladder.
I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
Except that the Big Five aren't orthogonal, which means they are fairly useless as a personality theory.
Nothing is going to be explicitly orthogonal, and forcing them to be doesn't make the conceptual issue you seem to have any better or worse (n.b., orthogonal connotes a lack of meaningful correlation between the factors. What the parent is complaining about is that each of the latent factors is meaningfully correlated with the other four to different extents). First, we are of course talking about an exploratory (EFA) approach (haven't read the article but the 10-fold CV referenced above makes sense), and partially the distinction between principal components and factor analysis. The Big 5 model itself has been tested using SEM and confirmatory factor analysis, and the five interrelated but not redundant number of latent factors validates repeatedly. Second, remember that EFA solved using maximum likelihood can be used to assess the null hypothesis that no more factors are necessary to produce acceptable fit within the sample. Thus (although this, from a statistical fishing perspective, would be bad) we can actually sequentially find the minimum number of factors necessary to reproduce a non-significantly different correlation matrix, when compared to the original sample. Therefore, with multiple independent studies (and k-fold CV like this study did) we can say that five is pretty well empirically demonstrated.
Now, the distinction between PCA and EFA. PCA is a technique explicitly designed to remove redundant covariation between items, and as such, the more dimensions you allow to represent the data, the better your overall fit. If you have nine items, nine principal components will capture 100% of the total, 9 item variance. However, it may be that 1 PC captures 65% of the variance, 2 represent 90% and the remaining 7 PCs make up the remaining 10. EFA works with correlations, and as such the most variance that can be reproduced is not 100%, but instead something analogous to the signal to noise ratio in engineering. It's a technique designed to identify and structure signals within noisy data, and therefore by default it doesn't assume everything being input is actually pure signal. Again, we're not measuring one thing chopped into 5 bits (or two, three etc) but 5 different things that have been repeatedly found to best fit data, when tested simultaneously, therefore controlling for each other. That means that the structure found represents statistically independent latents.
However, that is not to say that the five latent factors do not share commonality that is meaningful (although when you run these procedures, a correlation of .3-.5 is generally pretty high, meaning at most a .1-.25% information redundancy between factors). If interested, and you have a sample and the required number of parameters, you can build hierarchical factor models, in which common latents underly multiple lower level latents, which then underly the observed item responses. Alternatively, you could even say that there is just one personality latent, let's call it `everything', and that only one latent underlies (it helps if you think of latents as causes of the observed variables/items) all 100 or whatever personality items, like in this study. There is a specific rotation procedure, the bifactor/Schmid–Leiman factor rotation.
What this will do is examine global model fit: the question of whether the regression slopes from the observed items to the common covariances meaningfully reproduce the sample's covariances; does the data here empirically validate the correlational pattern we would expect if only one informational construct was represented (measured) in the data. Next (actually simultaneously), it will estimate whether, controlling for that one believed general latent factor, is there still meaningful latents estimable from the data. So, it's asking: is there still statistically significant relationships between items, once we've rem
Actually, I'm surprised that the algorithm doesn't outperform spouses as well.
Do any of your friends tirelessly catalog, index, analyze and correlate every chuckle or offhand comment you make within their earshot? Do you continue to talk freely in front of them, knowing they're doing it? If so, they can probably outperform this algorithm.
The real fun will come from correlating the physiological signals coming in from fitness bands, eye-trackers, and eventually EEG pickups. Your soul will be laid barer than lunar regolith.