Chinese start-up leaked 400GB of scraped data exposing 200+ million Facebook, Instagram and LinkedIn users

High-flying and rapidly growing Chinese social media management company Socialarks has suffered a huge data leak leading to the exposure of over 400GB of personal data including several high-profile celebrities and social media influencers.

The company’s unsecured ElasticSearch database contained personally identifiable information (PII) from at least 214 million social media users from around the world, using both populist consumer platforms such as Facebook and Instagram, as well as professional networks such as LinkedIn.

The Elastic instance was discovered as part of Safety Detectives’ cybersecurity mission of discovering online vulnerabilities that could potentially pose risks to the general public. Once the owner of the data is identified, our team then informs the affected parties as soon as possible to mitigate the risk of any cybersecurity breaches and server leaks.

In Socialarks’ case, our team found the ElasticSearch server to be publicly exposed without password protection or encryption, during routine IP-address checks on potentially unsecured databases.

The lack of security apparatus on the company’s server meant that anyone in possession of the server IP-address could have accessed a database containing millions of people’s private information.

According to Anurag Sen, head of the Safety Detectives cybersecurity team, the affected database contained a “huge trove” of sensitive personal information to the tune of 408GB and more than 318 million records in total.

Given the sheer size of the data leak, it has been severely challenging for our team to unravel the full extent of the potential damage caused.

Our research team was able to determine that the entirety of the leaked data was “scraped” from social media platforms, which is both unethical and a violation of Facebook’s, Instagram’s and LinkedIn’s terms of service.

Moreover, it is important to note that Socialarks suffered a similar data breach in August 2020 leading to data from 150 million LinkedIn, Facebook and Instagram users being exposed.

Almost as a carbon-copy, August’s database breach revealed reams of personal data from 66 million LinkedIn users, 11.6 million Instagram accounts and 81.5 million Facebook accounts.

From the leaked data we discovered, it was possible to determine people’s full names, country of residence, place of work, position, subscriber data and contact information, as well as direct links to their profiles.

Who are Socialarks?
Headquartered in both Shenzhen and Xiamen, Socialarks is a sprawling company with more than 10 regional branches spread across southern China including populous hotspots such as Beijing, Shanghai, Shenzhen, Guangzhou, Ningbo and Suzhou.

The company was first founded by Jinbin Sun in 2014 (who still serves as the company’s CEO to this day) as an “efficient foreign trade transaction solution” through Shenzhen Benniao Social Technology Co., Lmt.

According to Socialarks, the company is a “cross-border social media management company dedicated to solving the current problems of brand building, marketing, marketing, social customer management in China’s foreign trade industry”.

The company uses a data management platform (DMP) to carry out automated and precise marketing for various enterprises across China and have deployed both iOS and Android applications in recent years.

What was leaked?
Socialarks’ server contained scrapped profiles of more than 214 million social media users, obtained from Facebook, Instagram and LinkedIn.

The database contained more than 408GB of data and more than 318 million records.

What was leaked?

Without any protection whatsoever, our research team discovered the following:

11,651,162 Instagram user profiles
66,117,839 LinkedIn user profiles
81,551,567 Facebook user profiles
a further 55,300,000 Facebook profiles which were summarily deleted within a few hours after our team first discovered the server and its vulnerability.
What was surprising, that the numbers of profiles affected in the data leak found by our team are the same as the numbers mentioned in the August data leak. However, there were big differences, such as size of a database, the companies hosting those servers and the amount of indices.

The affected server, hosted by Tencent, was segmented into indices in order to store data obtained from each social media source. Our team discovered records from 3 major social media platforms: Instagram, Facebook and LinkedIn.

Instagram data
The Instagram index contained various popular personalities and online celebrities.

Our team discovered several high-profile influencers in the exposed database, including prominent food bloggers, celebrities and other social media influencers.

Instagram data
Celebrity Instagram profile including phone number and email address.

Every record contained public data scraped from influencer Instagram accounts, including their biographies, profile pictures, follower totals, location settings as well as personal information such as contact details in the form of email addresses and phone numbers.

Instagram data

The Instagram records exposed the following details:

Full name
Phone numbers for 6+ million users
Email addresses for all 11+ million users
Profile link
Username
Profile picture
Profile description
Average comment count
Number of followers and following count
Country of location
Specific locality in some cases
Frequently used hashtags
Facebook data
As mentioned above, the leak exposed 81.5 million Facebook user profiles with over 40 million exposed phone numbers and a further 32 million email address entries. Notably, most of the phone numbers our team discovered originated from pages and not individuals.

The Facebook records exposed the following details:

Full name
‘About’ text
Email addresses
Phone numbers
Country of location
Like, Follow and Rating count
Messenger ID
Facebook link with profile pictures
Website link
Profile description
LinkedIn data
Finally, our team discovered 66.1 million LinkedIn user profiles with as many as 31 million leaked email addresses (not disclosed in the profile but obtained through other, as yet unknown, sources).

The LinkedIn records exposed the following details:

Full name
Email addresses
Job profile including job title and seniority level
LinkedIn profile link
User tags
Domain name
Connected social media account login names e.g., Twitter
Company name and revenue margin
LinkedIn data
Database search showing 66 million LinkedIn profile results including personal information such as job title, name and email address.

The chart below shows a sample breakdown of user-profiles, sorted by country, from a sample of 42 million records.

LinkedIn data

Unexplained presence of Instagram and LinkedIn personal data
Socialarks’ database contained scraped data including personal information, albeit user data was partially completed.

However, according to our findings, Socialarks’ database stored personal data for Instagram and LinkedIn users such as private phone numbers and email addresses for users that did not divulge such information publicly on their accounts. How Socialarks could possibly have access to such data in the first place remains unknown.

Also, the fact that such a large, active, and data-rich database was left completely unsecured (probably for a second time) is astonishing.

It remains unclear how the company managed to obtain private data from multiple secure sources.

Unexplained presence of Instagram and LinkedIn personal data
Instagram profile showing email and phone number despite information not being provided to Instagram.

It is also worth noting that Socialarks is based in China and was founded with private venture capital in 2014, while the vulnerable server is located in Hong Kong.

Number of records leaked: 318+ million
Number of affected users: Approx. 214 million
Size of data breach: 408 gigabytes
Server location: Hong Kong (hosted by Tencent)
Company location: Xiamen, Fujian, China
Our cybersecurity team discovered the server vulnerability on 12 December 2020 and contacted Socialarks as soon as they were confirmed as the server owners on 14 December 2020. The company did not respond to our correspondence but the server was secured on the same day.

Data breach impact
Data scraping is a means of extracting private information from a website.

Aided by the rapid sprawl of seamlessly connected online services and platforms, data scraping has become commonplace online, given the value of the information being obtained and the fact that the practice is legal if authorised by the user as part of agreeing to terms of use.

Most data scraping is completely innocuous and carried out by web developers, business intelligence analysts, honest businesses such as travel booker sites, as well as being done for market research purposes online. However, even if such data is obtained legally – if it is stored without adequate cybersecurity, large leaks affecting millions of people can occur.

Importantly, data scraping is deemed to be a legal practise if it is done ethically. In SocialArks’ case, private information was obtained from multiple sources and supplemented with scraped data. Moreover, the company’s server had insufficient security and was left completely unsecured.

When private information including phone numbers, email addresses and birth information is extracted and/or leaked, criminals are empowered to commit heinous acts including identity theft and financial fraud.

Considering that data scrapes are conducted by automated bots, it can mean millions of innocent users can have their information collected, stored (and potentially leaked) within a short period, without even being aware of it.

Social media platforms like Facebook allow users to access third-party websites by using their existing Facebook login information. However, if security protocols are not properly instituted, hackers can deploy so-called “scraper bots” to extract private information.

In some cases, scraped data can be weaponized to carry out a specific goal of extracting personal information for criminal purposes. Potential ramifications of exposing personal information include identity theft and financial fraud conducted across other platforms including online banking.

Contact information can be harnessed to target people with targeted scams including sending personalised emails containing other personal information about the target, thereby gaining their trust, and setting the stage for a deeper intrusion into their privacy.

Users can also be targeted with clickthroughs that lead to the installation of harmful phishing and malware software.

Sharing personal information such as first and last name, physical and email address and mobile phone number can be weaponized by nefarious hackers to launch “mass attacks”.

Moreover, large tranches of data such as the one leaked by Socialarks can then be sold or provided to other malicious parties, thereby making the potential ramifications of data scraping even more wide-ranging and severe.

Celebrities and high-profile figures face some of the highest risk of cybersecurity flaws and leaked data.

If personal information belonging to famous celebrities is leaked, it opens the door for a range of malicious criminality including blackmail, extortion and stalking. High-profile individuals can attract audiences in their millions and with the higher degree of interest, comes a higher degree of cybersecurity or real life risk.

Preventing Data Exposure
How can you prevent your personal information from being exposed in a data leak and ensure that you are not a victim of attacks – cyber or real-world – if it is leaked?

Be cautious of what information you give out and to whom
Check that the website you are on is secure (look for https and/or a closed lock)
Only give out what you feel confident cannot be used against you (avoid government ID numbers, personal preferences that may cause you trouble if made public, etc.)
Create secure passwords by combining letters, numbers, and symbols
Do not click links in emails unless you are sure that the sender is legitimately who they represent themselves to be
Double-check any social media accounts (even ones you no longer use) to ensure that the privacy of your posts and personal details are visible only to people you trust
Avoid using