This repository contains the code, data, and outputs associated with the paper "Large Language Models: An Applied Econometric Framework" by Jens Ludwig, Sendhil Mullainathan, and Ashesh Rambachan.
This repository is organized to facilitate the replication of results presented in the paper. The subdirectories are structured based on task type (prediction or estimation) and application (financial news headlines or Congressional legislation):
prediction_headlines/: Code and data for prediction tasks involving financial news headlines.prediction_legislation/: Code and data for prediction tasks involving Congressional legislation.estimation_headlines/: Code and data for estimation tasks involving financial news headlines.estimation_legislation/: Code and data for estimation tasks involving Congressional legislation.figures/: Code and outputs for generating figures presented in the paper.tables/: Code and outputs for generating tables presented in the paper.
To query LLMs, you need to add your API_KEY to a .env file:
- Visit the OpenAI API Keys page linked here.
- Generate a new API key and copy it.
- Replace your key in the following command and run it to create a
.envfile:cd path/to/LanguageModel_Labels echo 'API_KEY="paste your key here"' > .env
Create a Conda environment with the required dependencies for Python and R:
cd path/to/LanguageModel_Labels
conda update conda
conda config --set channel_priority strict
conda env create -f conda_llm_env.yaml
conda activate llm_envThe tasks based on financial news headlines includes restricted data provided by Wharton Research Data Services (WRDS). To access this data:
-
Go to Beta Suite by WRDS and log in with your credentials.
-
Configure the parameters as follows:
-
Click the
Submit Formbutton. -
Once the query status shows
Success, clickDownload .csv Output. -
Rename the downloaded file as
CAPM_returns.csv. -
Place it in the directory:
./estimation_headlines/data/raw
You can replicate the results presented in the paper using one of the following methods:
-
Full Replication: Run the
run_all.shshell script to execute all steps in sequence:bash run_all.sh -
Partial Replication: Run individual scripts for specific steps. Each subdirectory contains a detailed
README.mdfile with instructions for task-specific replication.
If you use this repository, please cite the paper:
@article{ludwig2024largelanguagemodelsapplied,
title={Large Language Models: An Applied Econometric Framework},
author={Jens Ludwig and Sendhil Mullainathan and Ashesh Rambachan},
year={2024},
journal={arXiv preprint arXiv:2412.07031},
url={https://arxiv.org/abs/2412.07031},
}
Wharton Research Data Services (WRDS) was used in preparing this paper. This service and the data available thereon constitute valuable intellectual property and trade secrets of WRDS and/or its third-party suppliers.
-
Adler, E Scott, and John Wilkerson. 2020. Congressional Bills Project, NSF 00880066 and 00880061. http://congressionalbills.org/download.html (accessed July 5, 2024).
-
Aenlle, Miguel. 2020. Daily Financial News for 6000+ Stocks. https://www.kaggle.com/datasets/miguelaenlle/massive-stock-news-analysis-db-for-nlpbacktests (accessed August 1, 2024).
-
Beta Suite by WRDS. 2024. Provided by Wharton Research Data Services. https://wrds-www.wharton.upenn.edu/pages/grid-items/beta-suite-wrds (accessed August 1, 2024).
-
Egami, Naoki, Musashi Hinck, Brandon M. Stewart, and Hanying Wei. 2023. "Using imperfect surrogates for downstream inference: design-based supervised learning for social science applications of large language models". Advances in Neural Information Processing Systems, Vol. 36. Replication code available at: https://osf.io/gjt87/.
-
French, Kenneth R. 2024. Fama-French Research Data Factors (Daily). https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_daily_CSV.zip (accessed September 23, 2024).
-
Jones, Bryan D., Frank R. Baumgartner, Sean M. Theriault, Derek A. Epp, Cheyenne Lee, and Miranda E. Sullivan. 2023. Policy Agendas Project: Codebook. https://minio.la.utexas.edu/compagendas/codebookfiles/Codebook_PAP_2019.pdf (accessed July 5, 2024).
-
Wilkerson, John, E. Scott Adler, Bryan D. Jones, Frank R. Baumgartner, Guy Freedman, Sean M. Theriault, Alison Craig, Derek A. Epp, Cheyenne Lee, and Miranda E. Sullivan. 2023. Policy Agendas Project: Congressional Bills. https://minio.la.utexas.edu/compagendas/datasetfiles/US-Legislative-congressional_bills_19.3_3_3.csv (accessed July 5, 2024).