BEACON (Business Establishment Automated Classification of NAICS) is a machine learning tool developed to help respondents self-designate their 6-digit NAICS (North American Industry Classification System) code on the Economic Census (EC). BEACON’s methodology is based on machine learning, natural language processing, and information retrieval.
The EC is conducted every five years. In particular, the years ending in "2" or "7". This survey represents approximately eight million establishments, covering most industries and all geographic areas of the United States.
The NAICS is a hierarchical 6-digit coding structure. The first two digits represent the economic sector and the additional non-zero digits as industry detail. The U.S. Census Bureau classifies establishments by NAICS industry based on the primary business of activity of the establishment. NAICS is utilized throughout the survey life cycle: sample selection, data collection, analytical review, and publication.
On the EC, respondents are asked to describe their business. There are prelisted descriptions corresponding to a suggested NAICS code, but the respondent can also type in a description. Clerical analysis of this write-in text is a resource intensive process.
The general idea of BEACON is the respondent inputs a business description and BEACON returns a ranked list of 6-digit NAICS code with matching industry descriptions.
- Respondent provides write-in description.
- Text is outputted to BEACON API.
- API returns most relevant NAICS codes to respondent.
The goals are to help respondents properly self-designate their NAICS code, send respondents down correct EC questionnaire path, and reduce clerical work associated with write-ins.
This makes the questionnaire more dynamic. Overall, BEACON leads to less clerical work associated with analyzing NAICS write-ins.
First step is the text cleaning process: convert to lower and account for numbers and punctuation, remove common "stop" words, stem words to reduce the number of word variations, and correct common misspellings.
Example: Input Text: This is a convenience store. Clean Text: conveni store
Underlying BEACON is a dictionary of text that occurs frequently in the cleaned training data. It consists of words, word combinations, and full-length/exact descriptions. These pieces of text serve as the model features. These features contain NAICS distributions and associated purity weights that measure how concentrated, or pure, the distribution is for each word or word combination.
Information retrieval models look at how words, word combinations, and entire descriptions are distributed across NAICS codes. Each type (word, word combination, and entire description) has their relevant scores calculated by using their NAICS distribution and their purity weights. The individual scores are averaged, yielding relevance scores. These relevance scores range in value between 0 and 100. The scores reflect how confident BEACON is that the NAICS code is correct.
This section serves as a guide to the repository contents. The following files are in the root level:
| File | Description |
|---|---|
create_example_data.py |
Program for creating example datasets using public NAICS files |
create_example_data_output.txt |
Output of create_example_data.py |
beacon.py |
Codebase for implementing a simplified version of BEACON |
beacon_example.py |
Program for illustrating the use of beacon.py |
beacon_example_output.txt |
Output of beacon_example.py |
The following files can be found in the presentations folder:
| File | Description |
|---|---|
eurostat_BEACON_Whitehead_Pfeiff.pdf |
2024 Eurostat industry coding webinar presentation on BEACON |
2023-FCSM-BEACON-Model-Stacking.pdf |
2023 FCSM presentation on BEACON and applying model stacking |
2022-FCSM-Wiley-Whitehead.pdf |
2022 FCSM presentation on BEACON and a related model SINCT |
JSM_Dumbacher_Whitehead.pdf |
2022 JSM presentation on BEACON |
We appreciate any feedback you would like to provide us; please post any questions that you may have in the GitHub issues section.
Any opinions and conclusions expressed herein are those of the authors and do not reflect the views of the U.S. Census Bureau. No estimates—numerical or otherwise—based on internal Census Bureau information are included.
U.S. Census Bureau code is provided on an 'as is' basis, and the user assumes responsibility for its use. The Census Bureau has relinquished control of the information and no longer has responsibility to protect the integrity, confidentiality, or availability of the information. Any claims against the Census Bureau stemming from the use of its GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Census Bureau. The Census Bureau seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by the Census Bureau or the United States Government.
For more information, please see BEACON conference presentations and papers attached to the repository. If you have any questions or comments, please reach out to the BEACON team:
- Brian Dumbacher (@brian-dumbacher-census)
- Daniel Whitehead (@DanWhiteheadCensus)
- Sarah Pfeiff (@sdpfeiff)
Please use the information in CITATION.bib to cite the journal article with which this repository is associated:
Dumbacher, B., Whitehead, D., Jeong, J., and Pfeiff, S. (2025). BEACON: A Tool for Industry Self-Classification in the Economic Census. Journal of Data Science, 23(2): 429-448. https://doi.org/10.6339/25-JDS1180
@ARTICLE{DuWhJePf2025,
AUTHOR = {Brian Dumbacher and Daniel Whitehead and Jiseok Jeong and Sarah Pfeiff},
TITLE = {BEACON: A Tool for Industry Self-Classification in the Economic Census},
JOURNAL = {J. Data Sci.},
FJOURNAL = {Journal of Data Science},
YEAR = {2025},
VOLUME = {23},
NUMBER = {2},
PAGES = {429-448},
ISSN = {1680-743X},
DOI = {10.6339/25-JDS1180},
SICI = {1680-743X(2025)23:2<429:BATFIS>2.0.CO;2-2},
}
- Dumbacher, B., Whitehead, D., Jeong, J., and Pfeiff, S. (2025). BEACON: A Tool for Industry Self-Classification in the Economic Census. Journal of Data Science, 23(2): 429–448. https://doi.org/10.6339/25-JDS1180
- Dumbacher, B. and Whitehead, D. (2024). Industry Self-Classification in the Economic Census. U.S. Census Bureau ADEP Working Paper Series, ADEP-WP-2024-04. https://www2.census.gov/library/working-papers/2024/econ/industry-self-classification-economic-census.pdf
- Dumbacher, B. and Whitehead, D. (2024). Ranked short text classification using co-occurrence features and score functions. U.S. Census Bureau ADEP Working Paper Series, ADEP-WP-2024-06. https://www2.census.gov/library/working-papers/2024/econ/ranked-short-text-classification-using-co-occurrence-features-and-score-functions.pdf
- U.S. Census Bureau. (2024). Economic Census. Online; accessed 5 August 2024. https://www.census.gov/programs-surveys/economic-census.html
- U.S. Census Bureau. (2024). North American Industry Classification System. Online; accessed 5 August 2024. https://www.census.gov/naics/
- Whitehead, D. and Dumbacher, B. (2024). Ensemble Modeling Techniques for NAICS Classification in the Economic Census. U.S. Census Bureau ADEP Working Paper Series, ADEP-WP-2024-03. https://www2.census.gov/library/working-papers/2024/econ/ensemble-modeling-techniques-for-naics-classification-economic-census.pdf
- Wiley, E. and Whitehead, D. (2024). Implementing Interactive Classification Tools in the 2022 Economic Census. U.S. Census Bureau ADEP Working Paper Series, ADEP-WP-2024-05. https://www2.census.gov/library/working-papers/2024/econ/implementing-interactive-classification-tools-2022-economic-census.pdf