Bibliometric Analysis Pipeline

Overview

This repository contains a modular, API-driven Python pipeline designed for bibliometric analysis and systematic literature reviews. It was developed to automate the collection, processing, and semantic analysis of scientific publications from the Scopus database.

The pipeline replaces manual literature search methods with a reproducible, code-based approach, allowing researchers to:

Systematically Collect Data: Retrieve comprehensive metadata for thousands of papers across defined journals.
Analyze Content: Perform full-text and abstract keyword scanning with context extraction.
Map Social Networks: Construct and analyze co-authorship and affiliation networks.
Visualize Trends: Generate high-quality figures showing publication growth and technology adoption over time.
Model Topics: Apply Latent Dirichlet Allocation (LDA) to discover hidden thematic structures and correlate them with specific domain terminology.

Repository Structure

The analysis is divided into five sequential Jupyter Notebooks located in scripts/bibliometric_analysis/:

1_data_collection_and_metadata.ipynb: Interacts with the Elsevier Scopus API to harvest metadata (DOIs, titles, abstracts, authors) for specified journals since 2010.
2_full_text_processing.ipynb: Performs advanced text mining on retrieved records, identifying occurrences of technical keywords and extracting their surrounding textual context (+/- 75 characters).
3_author_network_analysis.ipynb: Constructs social graphs of authors and affiliations, calculating network centrality metrics to identify key influencers.
4_trend_visualization.ipynb: Produces publication timelines and "Technology Trend" line charts tracking the frequency of specific keywords over years.
5_advanced_topic_modeling.ipynb: Uses unsupervised learning (LDA) to infer latent topics from abstracts and generates heatmaps correlating these topics with explicit technology flags.

Prerequisites

1. Elsevier API Key

Access to the Scopus API is required.

Obtain an API Key from the Elsevier Developer Portal.
If you are accessing from within a subscribed institution, you may also need an Institution Token.
These keys should be set as environment variables or configured in config.py:
- ELSEVIER_API_KEY
- INST_TOKEN (Optional)

2. Python Environment

The pipeline requires Python 3.8+ and the following libraries:

pip install pandas requests networkx seaborn matplotlib gensim nltk numpy

3. Input Data

The pipeline relies on two configuration files located in the data/ directory:

data/journals.csv: A CSV file with a column named journal containing the exact names of the journals to query.
data/technical_buzzwords.csv: A CSV file mapping broad categories to specific search terms. Required columns: buzzword (Category) and terms (comma-separated list of keywords).

4. Historical Data & Archives

For reference and reproducibility, the data/ directory also contains:

data/raw/: Contains previously scraped metadata (metadata_accessible.csv, papers_metadata.csv).
data/processed/legacy/: Contains intermediate result dumps (LDA_BuzzwordCount.xlsx, keyword_specialtopic_dump.csv, etc.).
archive/: Contains refactored legacy notebooks:
- buzzword_feature_engineering_legacy.ipynb: Early feature encoding logic.
- exploratory_visualizations_legacy.ipynb: Historical EDA and plots.

Usage Instructions

Configure: Ensure your API keys are set and your input CSVs (journals.csv, technical_buzzwords.csv) are correctly populated in the data/ folder.
Run Sequentially: Execute the notebooks in order (1 through 5).
- Step 1 creates data/raw/papers_metadata.csv.
- Step 2 creates data/processed/text_analysis.csv.
- Subsequent steps rely on these generated files.
View Results:
- Figures: High-resolution plots are saved to results/figures/.
- Network Data: Gephi-compatible graph files (.gexf) are saved to results/.
- Processed Data: Intermediate CSVs are available in data/processed/ for further manual inspection.

Acknowledgements

This work has been developed by the Hydroinformatics Lab at Tulane University.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bibliometric Analysis Pipeline

Overview

Repository Structure

Prerequisites

1. Elsevier API Key

2. Python Environment

3. Input Data

4. Historical Data & Archives

Usage Instructions

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
archive		archive
data		data
1_data_collection_and_metadata.ipynb		1_data_collection_and_metadata.ipynb
2_full_text_processing.ipynb		2_full_text_processing.ipynb
3_author_network_analysis.ipynb		3_author_network_analysis.ipynb
4_trend_visualization.ipynb		4_trend_visualization.ipynb
5_advanced_topic_modeling.ipynb		5_advanced_topic_modeling.ipynb
README.md		README.md
config.py		config.py

uihilab/BibliometricAnalysis

Folders and files

Latest commit

History

Repository files navigation

Bibliometric Analysis Pipeline

Overview

Repository Structure

Prerequisites

1. Elsevier API Key

2. Python Environment

3. Input Data

4. Historical Data & Archives

Usage Instructions

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages