An Implementation of MassQL for Streamlining Common Mass Spectrometry Analysis Workflows
MassQLab applies a series of queries (written in the language of MassQL) to a directory containing mass spectrometry data in mzML format. Results are tabulated and saved as images and csv files.
MassQL: https://github.com/mwang87/MassQueryLanguage
See Setup Guide or Quickstart for Initial Setup Instructions
MassQLab currently supports multiple usage modes, depending on your workflow and preference:
py src\MassQLab_console.py— Executes the full analysis pipeline from the command line [^]py src\MassQLab_GUI.py— Launches the graphical interface [^^]
notebooks\MassQLab_notebook.ipynb— Walks through the workflow with manual controlnotebooks\MassQLab_console.ipynb— Executes the full analysis pipeline within a notebook [^]notebooks\MassQLab_GUI.ipynb— Launches the GUI from within a notebook [^^]
- A full desktop interface is planned to support non-technical users and exploratory workflows, with minimal code and no need for Python or environment setup.
[^] Requires massqlab_config.json to be configured upfront
[^^] If GUI hangs or if closed while in progress, may continue running in background and may require force close via task manager
Each entry point loads configurations populated from the massqlab_config.json file located in the project root.
Some entry points (e.g., MassQLab_console.py, MassQLab_console.ipynb) run the full workflow based only on this configuration.
Other entry points (e.g., MassQLab_notebook.ipynb, MassQLab_GUI.py, MassQLab_GUI.ipynb) permit manual modification of configuration at runtime.
{
"data_directory": "path/to/data",
"queryfile": "path/to/queryfile",
"output_directory": false,
"analysis": true,
"cache_setting": true,
"convert_raw": false,
"msconvertexe": false
}-
data_directory
Path to the folder containing mass spectrometry files (.mzMLor raw format).
MassQLab currently processes all files in this directory. -
queryfile
Path to the query file (JSON, CSV, or XLSX) that defines the MassQL queries to apply to each file in the data directory.
For optional fields and advanced configuration, see the Config File Advanced section.
The query file (queryfile) defines the MassQL queries to be executed, along with a user-defined name for each query.
Example query files are provided in the root of the MassQLab repository:
MassQL_Queries.jsonMassQL_Queries.csv(xlsx also permitted with same schema)
Each query in the file must include the following:
-
name:
A unique name for each query (per MS1/MS2 type). Avoid very long names to prevent output issues. -
query:
The MassQL query string itself.
For optional fields and advanced query configuration, see the Query File Advanced section.
See https://mwang87.github.io/MassQueryLanguage_Documentation/ for full documentation
-
Get MS1 scans matching m/z 207.1418 with a tolerance of 2.5 ppm, retention time between 1.0 and 1.2 minutes, and filter for 207.1418 peak intensity
QUERY scaninfo(MS1DATA) FILTER MS1MZ=207.1418:TOLERANCEPPM=2.5 AND RTMIN=1.0 AND RTMAX=1.2 -
Get MS1 scans where an MS2 scan with a product ion is present
QUERY scaninfo(MS1DATA) WHERE MS2PROD=226.18
-
Get MS2 scans with a precursor ion matching m/z 429.3765, retention time between 9.0 and 9.5 minutes, and return total intensity of each scan
QUERY scaninfo(MS2DATA) WHERE MS2PREC=429.3765:TOLERANCEPPM=2.5 AND RTMIN=9.0 AND RTMAX=9.5 -
Get MS2 scans with a precursor ion matching m/z 429.3765, retention time between 9.0 and 9.5 minutes, and return intensity of the peak with m/z 85.0281
QUERY scaninfo(MS2DATA) WHERE MS2PREC=429.3765:TOLERANCEPPM=2.5 AND RTMIN=9.0 AND RTMAX=9.5 FILTER MS2PROD=85.0281:TOLERANCEPPM=10 -
Get MS2 scans where both a product ion and a neutral loss are present
QUERY scaninfo(MS2DATA) WHERE MS2NL=176.0321 AND MS2PROD=85.02915 -
Get MS2 scans where a product ion matches an arithmetic expression
QUERY scaninfo(MS2DATA) WHERE MS2PROD=144+formula(CH2)
From the project root, ensure your virtual environment is activated and all requirements are installed (see Setup Guide). Then run one of the following:
py src\MassQLab_console.py # Launches the console workflowpy src\MassQLab_GUI.py # Launches the GUI (if available)From the project root, ensure your virtual environment is activated and all requirements are installed. Then launch Jupyter Lab:
jupyter labNavigate to the notebooks/ directory and open the desired notebook to begin working with MassQLab.
Results are saved in MassQLab_Output inside the defined output_directory (data_directory is used by default), organized by timestamp.
MS1 Outputs
ms1_raw_df.csv: Raw query results + metadatams1_analysis_df.csv: Peak fitting analysisms1_RT_analysis_df.csv: RT stats vs. query RTms1_traces.pdf: All Gaussian fits per query/filems1_summary_traces.pdf: Summary of Gaussian fits per queryms1_summary_traces_inverse.pdf: Summary of Gaussian fits per filems1_summary_areas.pdf: Peak areas per queryms1_summary_areas_inverse.pdf: Peak areas per filems1_consolidated_traces.png: Consolidated image of all ms1_traces with Gaussian fit
MS2 Outputs
ms2_raw_df.csv: Raw query results + metadatams2_analysis_df.csv: Peak picking analysisms2_plots.pdf: Scan intensities (lines connect shared MS1 scans)ms2_summary_plots.pdf: Top scan per queryms2_cluster_plots.pdf: Summary of MS2 analysis clustered by energy level if applicablems2_cluster_plots_group.pdf: Summary of MS2 analysis by query group (if defined)
-
output_directory
Saves output to custom directory. Othewrise saved within data_directory. -
analysis
Whether to run downstream analysis on results returned from the MassQL queries (trueorfalse). -
cache_setting
Iftrue, saves an intermediate Feather file to disk and uses it for faster re-querying. -
convert_raw
Iftrue, attempts to convert raw files to.mzML.
Note: This feature is experimental. It is recommended to convert files using ProteoWizard msconvert before using MassQLab. -
msconvertexe
Path to themsconvert.exeexecutable used for raw-to-mzML conversion (only used ifconvert_rawistrue). MSConvert
This section describes parameters that may be used in the queryfile in addition to "name" and "query"
Any additional parameters in queryfile will also be carried over into excel/csv outputs for ad-hoc usage
-
"abundance":
- an expected or anticipated abundance of the peak area (ms1) or intensity (MS2)
- abundance validity will be measured with a 10% (ms1) or 20% (ms2) threshold of this value
- abundances not within this threshold will be flagged as invalid and will not be considered in downstream analysis
- customizing threshold will be added in future release
-
"group":
- A designation for queries that have some relationship
- Use the same argument for the "group" parameter in multiple queries to associate the queries with each other
- Use case 1: Assign acyl-carnitine query and each acyl-carnite subfragment query a "group" argument of "acyl-carnitine"
- In "ms2_cluster_plots_group" result, each group will be consolidated and visualized together
- Use case 2: Assign all phosphatidylcholine lipids a "group" argument of "PC"
- Function will be expanded in future release to automate quantifying total group (ie. PC) content
-
"KEGG":
- Argument does not currently carry any special meaning, but this and any other parameter in the query file will be carried over into excel/csv outputs
- The query file is a good place to define unique compound identifiers for ad-hoc usage
- metabolites: KEGG, InChI string/key, CASRN, SMILES, IUPAC, PubChem ID
- proteins: PDB ID, Gene name, UniProt, Ensembl, RefSeq
-
MS1 queries will return a dataframe of total intensity of MS1 scan vs retention time for each MS1 scan that matches query
- Peak will be fit with a gaussian and area will be determined from gaussian.
- Peak must meet detection criteria to be considered valid
- peak prominence > Intensity_Max / 10
- peak height > Intensity_Average x 1.1
- peak center > (RTMIN - (RTMAX - RTMIN))
- peak center < (RTMAX + (RTMAX - RTMIN))
- fwhm < (RTMAX - RTMIN)
-
MS2 queries will return a dataframe of total intensity of MS2 scan vs retention time for each MS2 scan that matches query
- Downstream analysis will split MS2 query results for each file based on any collision energy grouping as defined in scan metadata
- ie. HCD and CID will be split
- ie. Collision energy 20 vs 30 vs 40 will be split
- An MS2 group will be each combination, ie. HCD_20, CID_30, etc.
- When more than one valid scan is returned from the MassQL query for an MS2 group, the highest intensity scan will be used for downstream analysis
- Downstream analysis will split MS2 query results for each file based on any collision energy grouping as defined in scan metadata
Peaks are first detected from y-data vs x-data using scipy.signal.find_peaks() with the following thresholds:
- Height: greater than
mean(ydata) * 1.1 - Distance: at least 5 observations apart
- Prominence: at least
max(ydata) / 10
Detected peaks are then filtered based on their x-axis value within a target retention time window:
(rtmin - range_rt) ≤ peak_x ≤ (rtmax + range_rt)
Where:
range_rt = rtmax - rtmin
Peaks are further refined by excluding those with excessive width:
FWHM < range_rt
The remaining peaks are ranked by their height, and the most prominent one is selected for detailed analysis.
The Full Width at Half Maximum (FWHM) is calculated and used to approximate the peak as a Gaussian function.
σ = fwhm / (2 * sqrt(2 * log(2)))
peak_area = height × σ × sqrt(2π)
peak_area_alt: Integrates the total signal using trapezoidal integration, without relying on peak picking.peak_intensity: Records the maximum y-value (intensity) from the data, also independent of peak picking.
mzML files can be created from raw files on the fly if MSConvert (part of ProteoWizard software) is installed separately
https://proteowizard.sourceforge.io/download.html
You will need to identify the path to "msconvert.exe" and select it in the "msconvertexe" field in the MassQLab GUI window. You will also need to check the "convert_raw" button.
On my system, the path to mscovert.exe looks like this:
"C:\Users\username\AppData\Local\Apps\ProteoWizard 3.0.23118.b2ed96f 64-bit\msconvert.exe"