CompuCrawl database
This page contains the complete code and data for the CompuCrawl database, which includes archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020. The database encompasses 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.
Final database files
If you are interested in the final database of texts, the following files are available:
- “HTML.zip”: Contains the full set of HTML pages.
- “TXT_uncleaned.zip”: Includes processed but uncleaned text files.
- “TXT_cleaned.zip”: Contains selected and cleaned text files.
- “TXT_combined.zip”: Includes combined and cleaned text files at the GVKEY/year level.
Full workflow and file descriptions
The full set of files, in order of use, is as follows:
- Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.
- 01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.
- URLs_1_deeper.csv: List of URLs one page deeper on the main domains.
- 02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.
- scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.
- HTML.zip: Archived version of the set of individual HTML files.
- 03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.
- TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.
- input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.
- 04 GPT application.py: Python script using OpenAI’s API to classify selected pages according to their HTML title and URL.
- categorization_applied.csv: Output file containing the classification of the selected pages.
- exclusion_list.xlsx: File containing three sheets: ‘gvkeys’ containing the GVKEYs of duplicate observations (that need to be excluded), ‘pages’ containing page IDs for pages that should be removed, and ‘sentences’ containing (sub-)sentences to be removed.
- 05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combines individual pages into one combined observation per GVKEY/year.
- metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.
- TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.
- TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.
- 06 Topic model.R: R script that loads up the combined text data from the folder stored in “TXT_combined.zip”, applies further cleaning, and estimates a 125-topic model.
- TM_125.RData: RData file containing the results of the 125-topic model.
- loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.
- 125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.
- 07 Word2Vec train and align.py: Python script that loads the plaintext files in the “TXT_cleaned.zip” archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.
- Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.
- 08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms “sustainability” and “profitability” over time.
- 99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.
- URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.