Data scraping: A (mostly) legal way to harvest your information

A scraper scraping an image.

When you post information or images of yourself publicly online, there is always the risk that someone will record that information and use it in some way. But the practice of data scraping, in which large amounts of public information is collected in an automated fashion, has made this possibility almost a guarantee.

What is data scraping?

With data scraping, machines are used to record information that was meant for human eyes. This happens most commonly in the form of web scraping, where an algorithm copies data from a web page while posing as a human. 

Web scrapers are commonly used by companies to keep tabs on their competitors’ websites, scanning for new updates, inventory changes, and price fluctuations. Travel sites scrape data from different airline and hotel websites to show users price comparisons. Some retailers also scrape Twitter and review sites like Yelp for sales leads.

But more recently, data scraping has been used to copy en masse publicly available information of individuals on social media. While this information was never a secret to begin with, attackers using data scraping have been able to create large, organized collections of the data for sale.

Read more: What is cybercrime? 5 types and examples of cybercrime explained

Data scraping vs. web crawling vs. hacking

Search engines like Google use web crawlers to discover and record pages on the internet so people can search for them. It’s a symbiotic relationship between web crawlers and websites: Google wants to know what content websites have to offer its users, and website owners (usually) want those users to be able to find them easily.

Data scrapers, meanwhile, can be thought of as parasites. They are not customers and provide no value back to the website. Deployed on a massive scale, they can overload web servers and slow down websites for legitimate users. Ever had to solve a CAPTCHA to “prove you’re not a robot”? It’s partly to prevent data scraping.

It’s not that websites don’t want any other machines touching their data. Many websites provide APIs, or application programming interfaces, software that lets legitimate apps and their algorithms access databases without clogging up the pipes for customers. But when a program doesn’t use an API and instead attempts to parse data off a public-facing web page, that’s data scraping.

Left unchecked, data scraping can be a huge problem for companies and their customers, on a scale that’s beginning to rival that of more traditional hacks and data breaches.

There are also nuances when it comes to the difference between hacking and data scraping. Hacking is analogous to theft: An attacker gains access to data that was protected somehow, usually by a password.

Data scraping is morally fuzzier. The data in question was technically out in the open already. For example, airlines already make their airfares public to help potential customers, so if a competitor’s bot wants the same info, is it really “stealing”?

Is data scraping legal?

Web scraping is legal, in theory. Let’s say you are copying and pasting text from a free resource like Wikipedia and decide to write an automated script to make your job easier. This is perfectly legal and doesn’t hurt anyone.

Many websites, however, have terms of service that explicitly prohibit data scraping, but the consequences of violating them can vary dramatically. If the scraping was small in scale, you may simply lose access to their service. But you may also face legal action, especially if the scraping was large-scale enough to impact their bottom line.

This is what happened when eBay sued Bidder’s Edge, a service that aggregated auction data scraped from eBay, resulting in approximately 100,000 extra server requests per day. EBay argued that Bidder’s Edge had committed “trespass to chattels” by interfering with their business, resulting in an undisclosed settlement in eBay’s favor.

Other companies have followed suit, notably Craigslist (v. Padmapper), QVC (v. Resultly), and LinkedIn (v. hiQ), setting more and more precedents for legal action against data scrapers.

Read more: In a possible first, facial recognition has led to a wrongful arrest

Data scraping hurts individual privacy

Until recently, scraping was mainly a problem for businesses. But when it comes to social media—where “the product is you”—data scraping can be a real problem for personal privacy.

Earlier this year, personal data from more than 533 million Facebook users, including phone numbers, email addresses, and full names, appeared on a hacking forum. Unlike other major data breaches, this data hadn’t been “hacked,” per se. Until 2019, it was publicly available through a loophole in Facebook’s contact import feature and was simply scraped.

Perhaps the most controversial application of data scraping comes from a company called Clearview AI. A joint venture from an Australian tech developer and an American politician, Clearview uses facial recognition technology to provide police departments with access to a database of over 3 billion photos of faces scraped from social media. Input a photo of a suspect’s face, and the output is every available post containing that face. 

Police say Clearview’s product is extremely effective at catching criminals, especially those who don’t appear in official law enforcement databases. Stagnant cases have been solved in mere minutes because the suspect happened to appear in the background of a friend’s recent photo on Facebook.

Clearview claims its database of more than 3 billion photos is fair game because each one was publicly available on the internet at the time it was scraped. If you don’t want your photos to appear in their database, simply set your sharing settings to “private.” 

But, of course, that won’t retroactively delete your photos that have already been scraped. It also doesn’t help people whose face may simply appear in the background of another user’s photo. And with millions of people posting photos to social media every second, that’s getting harder and harder to avoid.

There is little you can do to prevent any existing information about you online from being scraped apart from limiting what photos and personal details you put out there.

Read more: Data ethics: When your public images are used for profit

The future of data scraping

For now, regulations haven’t caught up with the practice of data scraping, but there are signs of legal pushback. Australian authorities recently ordered Clearview to remove photos of Australians from its databases. Clearview claims the order lacks jurisdiction because Clearview doesn’t “do business” in Australia. But with a database of billions of human faces, laws based on physical borders are difficult to enforce.

Will traditional legislation be enough to rein in the effects of data scraping on personal privacy? It’s an open question, and unfortunately one that is likely to be tested again and again.

Read more: Machines are learning, and they know a lot about you

A phone with a padlock.
Enjoy a safer online experience with powerful privacy protection
What is a VPN?