Start new project
MetaPhage comes with an utility script to initialize a new project. The utility, newProject.py, is located in the bin subdirectory.
newProject.py creates a project configuration file to be fed to nextflow. The configuration file acts as an analysis protocol, and makes it easier to generate reproducible results.
Help screen
usage: newProject.py [-h] [-s SAVE] -i READSDIR [-o OUTPUT_DIR]
[-d DATABASE_DIR] [-m METADATA] [-p PROJECT]
[-v MAIN_VARIABLE] [-a ALPHA_DIV_1] [-A ALPHA_DIV_2]
[-b BETA_DIV] [-V VIOLIN] [-S SUM_VIOL_VAR] [-H HEATMAP]
[--single-end] [-l] [--img IMG] [--tmp TEMPDIR]
[--work WORKINGDIR] [--verbose]
Generate a Metaphage configuration file for a new project.
optional arguments:
-h, --help show this help message and exit
Main arguments:
-s SAVE, --save SAVE Configuration file output [default: None]
-i READSDIR, --reads-dir READSDIR
Directory containing the reads.
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory [default: ./MetaPhage]
-d DATABASE_DIR, --database-dir DATABASE_DIR
Database directory
-m METADATA, --metadata-file METADATA
Metadata file.
-p PROJECT, --project PROJECT
Project name
Metadata arguments:
-v MAIN_VARIABLE, --main-variable MAIN_VARIABLE
Default variable of the metadata table for comparisons
-a ALPHA_DIV_1, --alpha-div-1 ALPHA_DIV_1
Variable for alpha diversity (otherwise -v)
-A ALPHA_DIV_2, --alpha-div-2 ALPHA_DIV_2
Secondary variable for alpha diversity (otherwise -v)
-b BETA_DIV, --beta-div BETA_DIV
Variable for alpha diversity (otherwise -v)
-V VIOLIN, --violin VIOLIN
Variable for violin plots (otherwise -v)
-S SUM_VIOL_VAR, --sum-viol-var SUM_VIOL_VAR
Variable for total violin plots (otherwise -v)
-H HEATMAP, --heatmap HEATMAP
Variable for heatmap (otherwise -v)
Metadata arguments:
--single-end Single end reads (by default is inferred)
-l, --local-run Configure for local execution
--img IMG Singularity image [default: None]
--tmp TEMPDIR Temporary directory [default: /tmp]
--work WORKINGDIR Nextflow work directory [default: /tmp]
--verbose Enable verbose output
Main parameters
Input files
-i
,--reads-dir
DIRECTORY: path to the input reads.-m
,--metadata-file
FILE: CSV file with the metadata. Will contain the columns used for diversity analyses.
Other paths
-d
,--database-dir
DIRECTORY: path to the downloaded databases. By default will select the./db
subdirectory of the MetaPhage installation directory.-o
,--output-dir
DIRECTORY: path to the output directory (default:./MetaPhage
).-s
,--save
FILE: configuration file created by the script (default: stdout)--tmp
TEMPDIR: Temporary directory [default: /tmp]--work
WORKINGDIR: Nextflow work directory [default: /tmp]
Metadata variables
To produce plots and diversity analyses we can specify the variables (column headers in the metadata file). To perform a primary analysis, it is possible to simply specify the main variable that will be used for all the plots:
-v
,--main-variable
NAME: metadata variable to be used in all the plots. By default will be Treatment but an error will be thrown if no "Treatment" column is provided.
Other variables:
-a
,--alpha-div-1
, ALPHA_DIV_1: Variable for alpha diversity (otherwise -v)-A
,--alpha-div-2
, ALPHA_DIV_2: Secondary variable for alpha diversity (otherwise -v)-b
,--beta-div
, BETA_DIV: Variable for alpha diversity (otherwise -v)-V
,--violin
, VIOLIN: Variable for violin plots (otherwise -v)-S
,--sum-viol-var
, SUM_VIOL_VAR: Variable for total violin plots (otherwise -v)-H
,--heatmap
, HEATMAP: Variable for heatmap (otherwise -v)
Assumptions
The program will scan the reads directory to check the number of samples and if they are single end or paired ends (based on the presence of "_1"/"_R1" and "_2"/"_R2" tags). The list of samples is then compared with the "Sample" colum in the metadata file.
The "Metadata" variables are checked and if not present in the metadata file will throw and error listing the detected columns to simplify the troubleshooting.
If it is planned to run the pipeline in a single node / virtual machine, the script can be used to launch it with -l
(--local-run
), and in this case the total RAM and cores available will be also written in the configuration file.
Example output
params {
config_profile_name = 'MetaPhage project'
config_profile_description = 'MetaPhage analysis configuration'
// INPUT PATHS
readPath = "/home/ubuntu/volume/metaphage-test/input-change"
fqpattern = "_{1,2}.fastq.gz"
metaPath = "/home/ubuntu/volume/metaphage-test/input-change/metadata"
dbPath = "/qib/platforms/Informatics/telatin/git/MetaPhage/db"
// OUTPUT/WORKING PATHS
outdir = "/home/ubuntu/volume/metaphage-test/MetaPhage"
temp_dir = "/tmp"
// METADATA
metadata = true
virome_dataset = true
singleEnd = false
sum_viol_var = "Infant_delivery_type"
heatmap_var = "Infant_delivery_type"
alpha_var1 = "Infant_delivery_type"
alpha_var2 = "Infant_formula_type"
beta_var = "Infant_delivery_type"
violin_var = "Infant_delivery_type"
}
How to use it
nextflow run main.nf -c project.conf [other options]
The project.conf` file is the output of newProject.py, and contains the information required to locate the files and generate the plots. It is missing the information on how to run the pipeline (using a scheduler…) or where to locate the required programs.
If running locally, you can run the pipeline inside the appropriate conda environment and this will remove the need to specify otherwise. Alternatively, you can use Singularity or Docker. If you pre-downloaded the singularity image (for example for off-line execution) you can add -with-singularity PATH_TO_IMAGE
.
Similarily, you can pre-download the Docker image and then add -with-docker andreatelatin/metaphage:1.0
.
See also: