Sticky Banner Visual Mobile 3

Spring deal: Get a free upgrade for 3 months on annual offers.

Spring deal: Free upgrade on annual offers. Claim now!

Claim Now!

Expressvpn Glossary

Data scraping

Data scraping

What is data scraping?

Data scraping refers to the practice of using automated tools to gather large amounts of data from websites and other digital platforms. The collected datasets are often used to support market research, machine learning (ML) applications, and more.

How does data scraping work?

Scraper bots send automated requests to the platforms they seek to gather information from. Most use lightweight tools that communicate directly with websites, but some use automated browser tools that load and render web pages much like a human user.

From there, they parse data received from the target and locate whatever information is being sought. This is then extracted and stored, either as the original code (usually HTML) or adapted into a structured format (like a spreadsheet).A diagram showing the steps involved in data scraping.

Data scraping uses

Data obtained through scraping can be used to:

  • Enable rapid market intelligence: Organizations may use data scraping techniques to monitor publicly available information such as product availability and pricing.
  • Power AI development: Data scrapers can efficiently create the massive and diverse datasets that ML systems require for training.
  • Generate leads: Sales or marketing teams may gather dispersed but publicly available information to build lists of possible customers.

Misuse of data scraping

Data scraping is sometimes used illegally or unethically by cybercriminals to harvest personal information. For example, threat actors may scrape forums and community platforms for contact information that is used to direct spear phishing or spam.

Legal, ethical, and privacy considerations

Note: This information is for general educational purposes and does not constitute legal advice.

Organizations considering scraping should assess applicable laws and contractual restrictions before proceeding. While responsible scrapers respect the law, user privacy, and targeted services’ terms of service, scraping can result in the unintentional collection of sensitive data.

The following ethical and privacy considerations apply:

  • Informed consent: Users may object to their data being harvested and used for other purposes, even if they’ve posted it publicly.
  • Personal privacy: Harvested data may be used in profiling and can facilitate phishing attacks.
  • Legal exposure: Scraping activities could potentially violate laws such as the General Data Protection Regulation (GDPR) by gathering personal data, whether intentionally or not.
  • Copyright infringement: Scraped content may include copyrighted material or proprietary databases, which may be protected under intellectual property laws.
  • Resource burden: High-volume requests from scrapers strain the infrastructure of targeted sites, which is why many platforms ban the practice in their terms of service.

Further reading

FAQ

What is the difference between web scraping and data scraping?

Though often used interchangeably, "web scraping" and "data scraping" can be understood to mean different things. This isn’t universally agreed upon, but data scraping can be seen as referring to the automated collection of any kind of digital information resource. Web scraping, on the other hand, can sometimes just refer to HTML data pulled from websites.

Is data scraping legal?

The legality of data scraping depends on jurisdiction, the type of data collected, how it is collected, and how it is used. Scraping publicly accessible information may be lawful in some jurisdictions, but it can still violate a website’s terms of service or applicable data protection laws. It’s best to seek legal advice if considering data scraping.

How to protect personal data from scraping?

Personal data is less likely to be scraped when public visibility on social media and websites is limited and sensitive details (like phone numbers or email addresses) aren’t posted publicly. Data that’s already online can be removed from personal profiles, and data deletion requests or data removal services can help take down information on third-party sites.

What are best practices for ethical data scraping?

Ethical data scraping follows websites’ rules and applicable laws, and it respects personal privacy. Developers should follow robots.txt files and site terms (where applicable), use rate limiting to avoid overloading servers, and exclude personal identifiers whenever possible. Alternatively, formally engaging with a target’s API to gather data is another option. Organizations or individuals considering scraping should consult legal counsel to ensure compliance with applicable laws and contractual obligations.
Get Started