Troy Hunt Inside the 3 Billion People National Public Data Breach
pSponsored by ppI decided to write this post because theres no concise way to explain the nuances of whats being described as one of the largest data breaches ever Usually its easy to articulate a data breach a service people provide their information to had someone snag it through an act of unauthorised access and publish a discrete corpus of information that can be attributed back to that source But in the case of National Public Data were talking about a data aggregator most people had never heard of where a threat actor has published various partial sets of data with no clear way to attribute it back to the source And theyre already the subject of a class action to add yet another variable into the mix Ive been collating information related to this incident over the last couple of months so let me talk about whats known about the incident what data is circulating and what remains a bit of a mysteryppLets start with the easy bit who is National Public Data NPD Theyre what we refer to as a data aggregator that is they provide services based on the large volumes of personal information they hold From the front page of their websiteppThere are many legally operating data aggregators out there and there are many that end up with their data in Have I Been Pwned HIBP For example Master Deeds Exactis and Adapt to name but a few In April we started seeing news of National Public Data and billions of breached records with one of the first references coming from the Dark Web Intelligence accountppUSDoD Allegedly Breached National Public Data Database Selling 29 Billion Records httpstcoemQIZ0lgsn pictwittercomTt8UNppPSuppBack then the breach was attributed to USDoD a name to remember as youll see that throughout this post The embedded image is the first reference of the 29B number weve subsequently seen flashed all over the press and its right there alongside the request of 35M for the data Clearly there is a financial motive involved here so keep that in mind as we dig further into the story That image also refers to 200GB of compressed data that expands out to 4TB when uncompressed but thats not what initially caught my eye Instead something quite obvious in the embedded image doesnt add up if this data is the entire population of USA CA and UK which is 450M people in total whats the 29B number we keep seeing Because that doesnt reconcile with reports about nearly 3 billion people with social security numbers exposed Further SSNs are a rather American construct with Canada having SINs Social Insurance Number and the UK having well NI National Insurance numbers are probably the closestequivalent This is the constant theme youll read about in this post stuff just being a bit off But hyperbole is often a theme with incidents like this so lets take the headlines with a grain of salt and see what the data tells usppI was first sent data allegedly sourced from NPD in early June The corpus I received reconciled with what vxunderground reported on around the same time note their reference to the 8th of April which also lines up with the previous tweetppApril 8th 2024 a Threat Actor operating under the moniker USDoD placed a large database up for sale on Breached titled National Public Data They claimed it contained 2900000000 records on United States citizens They put the data up for sale for 3500000NationalppIn their message they refer to having received data totalling 2771GB uncompressed which aligns with the sum total of the 2 files I receivedppThey also mentioned the data contains first and last names addresses and SSNs all of which appear in the first file above among other fieldsppThese first rows also line up precisely with the post Dark Web Intelligence included in the earlier tweet And in case youre looking at it and thinking thats the same SSN repeated across multiple rows with different names those records are all the same people just with the names represented in different orders and with different addresses all in the same city In other words those 6 rows only represent one person which got me thinking about the ratio of rows to distinct numbers Curious I took 100M samples and found that only 31 of the rows had unique SSNs so extrapolating that out 29B would be more like 899M This is something to always be conscious of when you read headline numbers 29B doesnt necessarily mean 29B people it often means rows of data Speaking of which those 2 files contain 1698302004 and 997379506 rows respectively for a combined total of 2696B Is this where the headline number comes from Perhaps its close and its also precisely the same as Bleeping Computer reported a few days agoppAt this point in the story theres no question that there is legitimate data in there From the aforementioned Bleeping Computer storyppAnd in vxundergrounds tweet they mention thatppA quick tangential observation in the same tweetppWhich is what youd expect from a legally operating data aggregator service Its a minor point but it does support the claim that the data came from NPDppImportant None of the data discussed so far contains email addresses That doesnt necessarily make it any less impactful for those involved but its an important point Ill come back to later as it relates to HIBPppSo this data appeared in limited circulation as early as 3 months ago It contains a huge amount of personal information even if it isnt 29B people and then to make matters worse it was posted publicly last weekppNational Public Data a service by Jerico Pictures Inc suffered databreach Hacker Fenice leaked 29b records with personal details including full names addresses SSNs in plain text httpstcofXY3SXEiKeppWho knows who Fenice is and what role they play but clearly multiple parties had access to this data well in advance of last week Ive reviewed what they posted and it aligns with what I was sent 2 months ago which is bad But on the flip side at least it has allowed services designed to protect data breach victims to get notices out to themppTwice this week I was alerted my SSN was found on the web thanks to a data breach at National Public Data Cool Thanks guys pictwittercomFAlfNmXUqmppInevitably breaches of this nature result in legal action which as I mentioned in the opening paragraph began a couple of weeks ago It looks like a tipoff from a data protection service was enough for someone to bring a case against NPDppUp until this point pretty much everything lines up but for one thing Where is the 4TB of data And this is where it gets messy as were now into the territory of partial data For example this corpus from last month was posted to a popular hacking forumppNational Public Database Allegedly Partially LeakedIt is stated that nearly 80 GB of sensitive data from the National Public Data is availableThe post contains different credits for the leakage and the alleged breach was credited to a threat actor Sxul and stressed that it httpstcov8uq0o88NS pictwittercoma6dn3MvYkfppThats 80GB and whilst its not clear whether thats the size of the compressed or extracted archive either way its still a long way short of the full alleged 4TB Do take note of the file name in the embedded image though peopledata935660398959524741csv as this will come up again later onppEarlier this month a 27part corpus of data alleged to have come from NPD was posted to Telegram this image representing the first 10 parts at 4GB eachppThe compressed archive files totalled 104GB and contained what feels like a fairly random collection of datappMany of these files are archives themselves with many of those then containing yet more archives I went through and recursively extracted everything which resulted in a total corpus of 642GB of uncompressed data across more than 1k files If this is partial what was the story with the 80GB partial from last month Who knows but in the in those files above were 134M unique email addressesppJust to take stock of where were at weve got the first set of SSN data which is legitimate and contains no email addresses yet is allegedly only a small part of the total NPD corpus Then weve got this second set of data which is larger and has tens of millions of email addresses yet is pretty random in appearance The burning question I was trying to answer is is it legitppThe problem with verifying breaches sourced from data aggregators is that nobody willingly knowingly provides their data to them so I cant do my usual trick of just asking impacted HIBP subscribers if theyd used NPD before Usually I also cant just look at a data aggregator breach and find pointers that tie it back to the company in question due to references in the data mentioning their service In part thats because this data is just so damn generic Take the earlier screenshot with the SSN data how many different places have your first and last name address SSN etc Attributing a source when theres only generic data to go by is extremely difficultppThe kludge of different file types and naming conventions in the image above worried me Is this actually all from NPD Usually youd see some sort of continuity for example a heap of json files with similar names or a swathe of sql files with each one representing a dumped table The presence of peopledata935660398959524741csv ties this corpus together with the one from the earlier tweet but then theres stuff like Accuitty1012022zip could that refer to Acuity single c single t which I wrote about in November HIBP isnt returning hits for email addresses in that folder against the Acuity I loaded last year so no its a different corpus But that archive alone ended up having over 250GB of data with almost 100M unique email addresses so it forms a substantial part of the overall corpus of datappThe 3608086KB criminalexportcsvzip file caught my eye in part because criminal record checks are a key component NPDs services but also because it was only a few months ago we saw another breach containing 70M rows from a US criminal database And see who that breach was attributed to USDoD the same party whose name is all over the NPD breach I did actually receive that data but filed it away and didnt load it into HIBP as there were no email addresses in it I wonder if the data from that story lines up with the file in the image above Lets check the archivesppDifferent file name but hey its a 3608086KB file Given the NPD breach initially occurred in April and the criminal data hit the news in May its entirely possible the latter was obtained from the former but I couldnt find any mention of this correlation anywhere Side note this is a perfect example of why I retain breaches in offline storage after processing because theyre so often helpful when assessing the origin and legitimacy of new breachesppContinuing the search for oddities I decided to see if I myself was in there On many occasions now Ive loaded a breach started the notification process running walked away from the PC then received an email from myself about being in the breach Im continually surprised by the places I find myself in including this oneppDammit Its an email address of mine yet clearly none of the other data is mine Not my name not my address and the obfuscated numbers definitely arent familiar to me I dont believe theyre SSNs or other sensitive identifiers but because I cant be sure Ive obfuscated them I suspect one of those numbers is a serialised date of birth but of the total 28 rows with my email address on them the two unique DoBs put me as being born in either 1936 or 1967 Both are a long way from the truthppA cursory review of the other data in this corpus revealed a wide array of different personal attributes One file contained information such as height weight eye colour and ethnicity The uktxt file in the image above merely contained a business directory with public information I could have dug deeper but by now there was no point Theres clearly some degree of invalid data in here theres definitely data weve seen appear separately as a discrete breach and there are many different versions of partial NPD data although the 27part archive discussed here is the largest I saw and the one I was most consistently directed to by other people The more I searched the more bits and pieces attributed back to NPD I foundppIf I were to take a guess there are two likely explanations for what were seeingppBoth of these are purely speculative though and the only parties that know the truth are the anonymous threat actors passing the data around and the data aggregator thats now being sued in a class action so yeah were not going to see any reliable clarification any time soon Instead were left with 134M email addresses in public circulation and no clear origin or accountability I sat on the fence about what to do with this data for days not sure whether I should load it and if I did whether I should write about it Eventually I decided it deserved a place in HIBP as an unverified breach and per the opening sentence this blog post was the only way I could properly explain the nuances of what I discovered This way impacted people will know if their data is floating around in this corpus and if they find this information unactionable then they can do precisely what they would have done had I not loaded it nothingppLastly I want to reemphasise a point I made earlier on there were no email addresses in the social security number files If you find yourself in this data breach via HIBP theres no evidence your SSN was leaked and if youre in the same boat as me the data next to your record may not even be correct And no I dont have a mechanism to load additional attributes beyond email address into HIBP nor point people in the direction of the source data some of you will have received a reminder about why I dont do that just a few days ago And Im definitely not equipped to be your personal lookup service manually trawling through the data and pulling out individual records for you So treat this as informational only an intriguing story that doesnt require any further actionppHi Im Troy Hunt I write this blog create courses for Pluralsight and am a Microsoft Regional Director and MVP who travels the world speaking at events and training technology professionals ppHi Im Troy Hunt I write this blog run Have I Been Pwned and am a Microsoft Regional Director and MVP who travels the world speaking at events and training technology professionals ppI often run private workshops around these heres upcoming events Ill be atppDont have Pluralsight already How about a 10 day free trial Thatll get you access to thousands of courses amongst which are dozens of my own includingpp
Send new blog posts
ppHey just quickly confirm youre not a robotpp SubmittingppGot it Check your email click the confirmation link I just sent you and were doneppThis work is licensed under a Creative Commons Attribution 40 International License In other words share generously but provide attributionppOpinions expressed here are my own and may not reflect those of others Unless Im quoting someone theyre just my own viewsppThis site runs entirely on Ghost and is made possible thanks to their kind support Read more about why I chose to use Ghostp
Send new blog posts
ppHey just quickly confirm youre not a robotpp SubmittingppGot it Check your email click the confirmation link I just sent you and were doneppThis work is licensed under a Creative Commons Attribution 40 International License In other words share generously but provide attributionppOpinions expressed here are my own and may not reflect those of others Unless Im quoting someone theyre just my own viewsppThis site runs entirely on Ghost and is made possible thanks to their kind support Read more about why I chose to use Ghostp