Scraping Longitudinal Website Data

Welcome to the main website for the paper “The Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data”, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.

This website consists of two main parts: our codebase, which offers a general-purpose Python setup to scraping historical websites using the Wayback Machine, and our open-access CompuCrawl database, which was built using the four-step approach described in the paper. This database contains websites of North American firms in the Compustat database between 1996 and 2020—covering 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages.

For frequently asked questions about the code- and database, please refer to our FAQ.

Authors

Richard F.J. Haans. Associate Professor. Rotterdam School of Management, Erasmus University Rotterdam, The Netherlands Email: haans@rsm.nl

Marc J. Mertens. Research Associate and PhD Candidate. University of Mannheim, Germany. Email: marc.mertens@uni-mannheim.de