This is a tutorial on how to run MIRA 3.4 on a small genome. Note that Mira 3.9 is the better program to use and it uses manifest files to specify the run parmeters. The input data is from the file:
galaxy_filter_by_quality_on_data_16_solexa.fastq. It's about 1.6 GB in size and has TODO sequences.
For detailed information read the following manuals: Definitive Guide To MIRA_3.4.pdf or the later version Definitive Guide To MIRA_3.9.pdf For a brief starter see "Chapter 4: Short usage introduction to MIRA3" in either of the above documents.
Prepare Input Files
For this example we will use the file: galaxy_filter_by_quality_on_data_16_solexa.fastq . This file is 1.6 GB in size and contains 6,511,619 sequences (26,046,476 lines). You can see how to do a line count here using the UNIX word count (wc) utility. By using the
-l option it will count lines instead of words.
$ wc -l galaxy_filter_by_quality_on_data_16_solexa.fastq $ 26046476 galaxy_filter_by_quality_on_data_16_solexa.fastq
However this will take too long to run as a small test on the cluster so what we can do is to reduce its size by a quarter. Fasta format files have exactly a multiple of 4 lines so we can divide the file by 4 using the UNIX split utility like this:
split -l 6511619 galaxy_filter_by_quality_on_data_16_solexa.fastq z_
(26,046,476 lines / 4 = 6,511,619 lines so we tell wc above to split the file and put 6511619 lines in each file. It will create 4 files z_aa, z_ab, z_ac & z_ad.)
Let's use the small file "z_aa"
mv z_aa galaxy_small.fastq
Summary: lines in file:
galaxy_filter_by_quality_on_data_26M_solexa.fastq 26,046,476 galaxy_6M_lines_solexa.fastq 6,511,619 OK speed, takes a few minutes. galaxy_1M_lines_solexa.fastq 1,627,904 Too quick
Input Files Required and Output Expected
Your run script: galaxy.sh Input data: galaxy_small.fastq (~ 400 MB) galaxy.manifest <-- only required for version 3.9 Output files / directories produced: galaxy.e156 <-- error file (stderr), the number will be different galaxy.o156 <-- output file (stdout), the number will be different galaxy_test_assembly/ directory galaxy_test_d_chkpt/ galaxy_test_d_info/ galaxy_test_d_results/ galaxy_test_d_tmp/
This is what the script run_galaxy.sh might look like: click to show
Run miramem to estimate your memory usage. Just start the program and answer the questions.
Submit your job to the queuing system. Don't forget to adjust the PBS parameters in your submission script based on the Miramen guesses above.
$ qsub run_galaxy.sh
Check the queue to see it is there and its status.
$ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 211.hpcnode1 scaffold.build. 10053650 570:36:5 R workq
235.hpcnode1 Histomonas.map. 10053650 0 Q workq
236.hpcnode1 galaxy.sh mlake 0 Q workq
The S column above indicates the job's state:
Q Job is queued.
R Job is running.
E Job is exiting after having run.
F Job is finished.
H Job is held.
S Job is suspended.
Here we see that its status is "Q" meaning that its in the queue, scheduled to be run. If you run qstat again at some point you will see its status will have changed to "R" when it's running.
Also note its "Job ID". You can use this to remove a job from the queue using
qdel job_id, e.g.
$ qdel 236
After the Run
Keep checking the qstat command and your working directory to see if your program has finished. If the program aborts or is killed for some reason then information may be in the error file.
When the run is finished you will have these files:
galaxy.sh.e275 <-- small, only a few lines galaxy.sh.o275 <-- largish, about 100 MB
and these directories:
$ ls test_assembly/ test_d_chkpt test_d_info test_d_results test_d_tmp $
Here is the
tcmalloc: large alloc 1510699008 bytes == 0x2160000 @ tcmalloc: large alloc 1698308096 bytes == 0x1804d1000 @ tcmalloc: large alloc 2337718272 bytes == 0x1e5d73000 @ tcmalloc: large alloc 1758498816 bytes == 0x1e5d73000 @ tcmalloc: large alloc 1930358784 bytes == 0x1e5d73000 @ tcmalloc: large alloc 3509243904 bytes == 0x271f5f000 @ real 104m 39.479s user 126m 23.940s sys 145m 42.007s
Next Generation Sequencing (NGS)/De novo RNA assembly at Wikibooks.
Assembly of 454 data with MIRA3 This is for MIRA version 3.4.0. It was written by Bastien Chevreux, who wrote MIRA.
Sheffield University, Molecular Ecology Lab Various Perl programs for Sequence manipulation, blast, screening, alignment, assembly, file conversion etc.
Location for some data: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001665/SRR001665_1.fastq.gz