Skip to content

uihilab/BibliometricAnalysis

Repository files navigation

Bibliometric Analysis Pipeline

Overview

This repository contains a modular, API-driven Python pipeline designed for bibliometric analysis and systematic literature reviews. It was developed to automate the collection, processing, and semantic analysis of scientific publications from the Scopus database.

The pipeline replaces manual literature search methods with a reproducible, code-based approach, allowing researchers to:

  1. Systematically Collect Data: Retrieve comprehensive metadata for thousands of papers across defined journals.
  2. Analyze Content: Perform full-text and abstract keyword scanning with context extraction.
  3. Map Social Networks: Construct and analyze co-authorship and affiliation networks.
  4. Visualize Trends: Generate high-quality figures showing publication growth and technology adoption over time.
  5. Model Topics: Apply Latent Dirichlet Allocation (LDA) to discover hidden thematic structures and correlate them with specific domain terminology.

Repository Structure

The analysis is divided into five sequential Jupyter Notebooks located in scripts/bibliometric_analysis/:

  • 1_data_collection_and_metadata.ipynb: Interacts with the Elsevier Scopus API to harvest metadata (DOIs, titles, abstracts, authors) for specified journals since 2010.
  • 2_full_text_processing.ipynb: Performs advanced text mining on retrieved records, identifying occurrences of technical keywords and extracting their surrounding textual context (+/- 75 characters).
  • 3_author_network_analysis.ipynb: Constructs social graphs of authors and affiliations, calculating network centrality metrics to identify key influencers.
  • 4_trend_visualization.ipynb: Produces publication timelines and "Technology Trend" line charts tracking the frequency of specific keywords over years.
  • 5_advanced_topic_modeling.ipynb: Uses unsupervised learning (LDA) to infer latent topics from abstracts and generates heatmaps correlating these topics with explicit technology flags.

Prerequisites

1. Elsevier API Key

Access to the Scopus API is required.

  • Obtain an API Key from the Elsevier Developer Portal.
  • If you are accessing from within a subscribed institution, you may also need an Institution Token.
  • These keys should be set as environment variables or configured in config.py:
    • ELSEVIER_API_KEY
    • INST_TOKEN (Optional)

2. Python Environment

The pipeline requires Python 3.8+ and the following libraries:

pip install pandas requests networkx seaborn matplotlib gensim nltk numpy

3. Input Data

The pipeline relies on two configuration files located in the data/ directory:

  • data/journals.csv: A CSV file with a column named journal containing the exact names of the journals to query.
  • data/technical_buzzwords.csv: A CSV file mapping broad categories to specific search terms. Required columns: buzzword (Category) and terms (comma-separated list of keywords).

4. Historical Data & Archives

For reference and reproducibility, the data/ directory also contains:

  • data/raw/: Contains previously scraped metadata (metadata_accessible.csv, papers_metadata.csv).
  • data/processed/legacy/: Contains intermediate result dumps (LDA_BuzzwordCount.xlsx, keyword_specialtopic_dump.csv, etc.).
  • archive/: Contains refactored legacy notebooks:
    • buzzword_feature_engineering_legacy.ipynb: Early feature encoding logic.
    • exploratory_visualizations_legacy.ipynb: Historical EDA and plots.

Usage Instructions

  1. Configure: Ensure your API keys are set and your input CSVs (journals.csv, technical_buzzwords.csv) are correctly populated in the data/ folder.
  2. Run Sequentially: Execute the notebooks in order (1 through 5).
    • Step 1 creates data/raw/papers_metadata.csv.
    • Step 2 creates data/processed/text_analysis.csv.
    • Subsequent steps rely on these generated files.
  3. View Results:
    • Figures: High-resolution plots are saved to results/figures/.
    • Network Data: Gephi-compatible graph files (.gexf) are saved to results/.
    • Processed Data: Intermediate CSVs are available in data/processed/ for further manual inspection.

Acknowledgements

This work has been developed by the Hydroinformatics Lab at Tulane University.

About

Google Scholar Thematic Analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •