Codebase
This section contains the latest version of our codebase for scraping longitudinal website data using the Wayback Machine. For frequently asked questions about the code and database, please refer to the FAQ.
Setup instructions
- Before running any code, please read through the entire documentation, provided here.
- Ensure the working directory is the same as the directory containing the scripts. The code has checks to verify this.
- Create empty folders called ‘HTML’, ‘TXT_uncleaned, ‘TXT_cleaned’ and ‘TXT_combined’.
- Run the ‘01 Collect frontpages.py’ script to collect the frontpages.
- Run the ‘02 Collect further pages.py’ script to collect additional pages one click deeper into the domain.
- Archive the files in the ‘HTML’ folder to a .zip file called ‘HTML.zip’.
- Run the ‘03 Convert HTML to plaintext.py’ script to process the files in ‘HTML.zip’.
- Optional: Run the ‘04 GPT application.py’ script to classify the selected pages for further selection.
- Archive the files in the ‘TXT_uncleaned folder to a .zip file called ‘TXT_uncleaned.zip’.
- Run the ‘05 Clean and select.py’ script to clean and select the texts, placing them in the ‘TXT_cleaned’ folder. This script also combines the selected pages into a single plaintext file per firm/year in the ‘TXT_combined’ folder.
From there, you can use the texts in the ‘TXT_cleaned’ and ‘TXT_combined’ folders for further analyses. The generated metadata files can be used to summarize the data or as input for additional analyses.