How the NSA Identified Satoshi Nakamoto (medium.com)
An anonymous reader shares a report: The 'creator' of Bitcoin, Satoshi Nakamoto, is the world's most elusive billionaire. Very few people outside of the Department of Homeland Security know Satoshi's real name. In fact, DHS will not publicly confirm that even THEY know the billionaire's identity. Satoshi has taken great care to keep his identity secret employing the latest encryption and obfuscation methods in his communications. Despite these efforts (according to my source at the DHS) Satoshi Nakamoto gave investigators the only tool they needed to find him -- his own words. Using stylometry one is able to compare texts to determine authorship of a particular work. Throughout the years Satoshi wrote thousands of posts and emails and most of which are publicly available. According to my source, the NSA was able to the use the 'writer invariant' method of stylometry to compare Satoshi's 'known' writings with trillions of writing samples from people across the globe. By taking Satoshi's texts and finding the 50 most common words, the NSA was able to break down his text into 5,000 word chunks and analyse each to find the frequency of those 50 words. This would result in a unique 50-number identifier for each chunk. The NSA then placed each of these numbers into a 50-dimensional space and flatten them into a plane using principal components analysis. The result is a 'fingerprint' for anything written by Satoshi that could easily be compared to any other writing. The NSA then took bulk emails and texts collected from their mass surveillance efforts. First through PRISM and then through MUSCULAR, the NSA was able to place trillions of writings from more than a billion people in the same plane as Satoshi's writings to find his true identity. The effort took less than a month and resulted in positive match.
in a 50d space, "trillions" is still going to be fairly wide spread. assuming your axes all go from 0 to 1 and that's it, and you avoid fractions, you've still got 2^50 nodes, which is on the order of a quadrillion, or 1000 nodes per text block.
Sure, there's likely to be clustering, but it's not quite as inevitable as you're assuming from just the number of data sets.
Sadia Afroz is the main public-sector researcher on this topic (stylometric machine learning).
She gave a relevant introduction in 2013 stylometric analysis to track anonymous users in the underground and the corresponding video regarding darknet user tracking through stylometry.
She commented a while ago "Please do not ask me to deanonymize Satoshi." and gave reasons.
I know it's out of fashion to read TFA, but you could have just scrolled right to the end:
"But why? Why go to so much trouble to identify Satoshi? My source tells me that the Obama administration was concerned that Satoshi was an agent of Russia or China—that Bitcoin might be weaponized against us in the future. Knowing the source would help the administration understand their motives. As far as I can tell Satoshi hasn’t violated any laws and I have no idea if the NSA determined he was an agent of Russia or China or just a Japanese crypto hacker."
Oh and also, this report is literally just a self-sourced blog post.
"Sources: Many readers have asked that I provide third party citations to ‘prove’ the NSA identified Satoshi using stylometry. Unfortunately, I cannot as I haven’t read this anywhere else—hence the reason I wrote this post. I’m not trying to convince the reader of anything, instead my goal is to share the information I received and make the reader aware of the possibility that the NSA can easily determine the authorship of any email through the use of their various sources, methods, and resources."