This repository contains a modular, API-driven Python pipeline designed for bibliometric analysis and systematic literature reviews. It was developed to automate the collection, processing, and semantic analysis of scientific publications from the Scopus database.
The pipeline replaces manual literature search methods with a reproducible, code-based approach, allowing researchers to:
- Systematically Collect Data: Retrieve comprehensive metadata for thousands of papers across defined journals.
- Analyze Content: Perform full-text and abstract keyword scanning with context extraction.
- Map Social Networks: Construct and analyze co-authorship and affiliation networks.
- Visualize Trends: Generate high-quality figures showing publication growth and technology adoption over time.
- Model Topics: Apply Latent Dirichlet Allocation (LDA) to discover hidden thematic structures and correlate them with specific domain terminology.
The analysis is divided into five sequential Jupyter Notebooks located in scripts/bibliometric_analysis/:
1_data_collection_and_metadata.ipynb: Interacts with the Elsevier Scopus API to harvest metadata (DOIs, titles, abstracts, authors) for specified journals since 2010.2_full_text_processing.ipynb: Performs advanced text mining on retrieved records, identifying occurrences of technical keywords and extracting their surrounding textual context (+/- 75 characters).3_author_network_analysis.ipynb: Constructs social graphs of authors and affiliations, calculating network centrality metrics to identify key influencers.4_trend_visualization.ipynb: Produces publication timelines and "Technology Trend" line charts tracking the frequency of specific keywords over years.5_advanced_topic_modeling.ipynb: Uses unsupervised learning (LDA) to infer latent topics from abstracts and generates heatmaps correlating these topics with explicit technology flags.
Access to the Scopus API is required.
- Obtain an API Key from the Elsevier Developer Portal.
- If you are accessing from within a subscribed institution, you may also need an
Institution Token. - These keys should be set as environment variables or configured in
config.py:ELSEVIER_API_KEYINST_TOKEN(Optional)
The pipeline requires Python 3.8+ and the following libraries:
pip install pandas requests networkx seaborn matplotlib gensim nltk numpyThe pipeline relies on two configuration files located in the data/ directory:
data/journals.csv: A CSV file with a column namedjournalcontaining the exact names of the journals to query.data/technical_buzzwords.csv: A CSV file mapping broad categories to specific search terms. Required columns:buzzword(Category) andterms(comma-separated list of keywords).
For reference and reproducibility, the data/ directory also contains:
data/raw/: Contains previously scraped metadata (metadata_accessible.csv,papers_metadata.csv).data/processed/legacy/: Contains intermediate result dumps (LDA_BuzzwordCount.xlsx,keyword_specialtopic_dump.csv, etc.).archive/: Contains refactored legacy notebooks:buzzword_feature_engineering_legacy.ipynb: Early feature encoding logic.exploratory_visualizations_legacy.ipynb: Historical EDA and plots.
- Configure: Ensure your API keys are set and your input CSVs (
journals.csv,technical_buzzwords.csv) are correctly populated in thedata/folder. - Run Sequentially: Execute the notebooks in order (1 through 5).
- Step 1 creates
data/raw/papers_metadata.csv. - Step 2 creates
data/processed/text_analysis.csv. - Subsequent steps rely on these generated files.
- Step 1 creates
- View Results:
- Figures: High-resolution plots are saved to
results/figures/. - Network Data: Gephi-compatible graph files (
.gexf) are saved toresults/. - Processed Data: Intermediate CSVs are available in
data/processed/for further manual inspection.
- Figures: High-resolution plots are saved to
This work has been developed by the Hydroinformatics Lab at Tulane University.