MIRA Version 3.4 Howto

This is a tutorial on how to run MIRA 3.4 on a small genome. Note that Mira 3.9 is the better program to use and it uses manifest files to specify the run parmeters. The input data is from the file: galaxy_filter_by_quality_on_data_16_solexa.fastq. It's about 1.6 GB in size and has TODO sequences.

For detailed information read the following manuals: Definitive Guide To MIRA_3.4.pdf or the later version Definitive Guide To MIRA_3.9.pdf For a brief starter see "Chapter 4: Short usage introduction to MIRA3" in either of the above documents.

Prepare Input Files

For this example we will use the file:  galaxy_filter_by_quality_on_data_16_solexa.fastq . This file is 1.6 GB in size and contains 6,511,619 sequences (26,046,476 lines). You can see how to do a line count here using the UNIX word count (wc) utility. By using the -l option it will count lines instead of words.

$ wc -l galaxy_filter_by_quality_on_data_16_solexa.fastq 
$ 26046476  galaxy_filter_by_quality_on_data_16_solexa.fastq

However this will take too long to run as a small test on the cluster so what we can do is to reduce its size by a quarter. Fasta format files have exactly a multiple of 4 lines so we can divide the file by 4 using the UNIX split utility like this:

split -l 6511619 galaxy_filter_by_quality_on_data_16_solexa.fastq z_

(26,046,476 lines / 4 = 6,511,619 lines so we tell wc above to split the file and put 6511619 lines in each file. It will create 4 files z_aa, z_ab, z_ac & z_ad.)

Let's use the small file "z_aa" mv z_aa galaxy_small.fastq

Summary: lines in file:

galaxy_filter_by_quality_on_data_26M_solexa.fastq  26,046,476  
galaxy_6M_lines_solexa.fastq                        6,511,619  OK speed, takes a few minutes.
galaxy_1M_lines_solexa.fastq                        1,627,904  Too quick

Input Files Required and Output Expected

Your run script:
    galaxy.sh

Input data:
    galaxy_small.fastq   (~ 400 MB)
    galaxy.manifest <-- only required for version 3.9 

Output files / directories produced:
    galaxy.e156    <-- error file (stderr), the number will be different
    galaxy.o156    <-- output file (stdout), the number will be different
    galaxy_test_assembly/ directory 
        galaxy_test_d_chkpt/
        galaxy_test_d_info/
        galaxy_test_d_results/
        galaxy_test_d_tmp/

This is what the script run_galaxy.sh might look like: click to show

Procedure

  1. Run miramem to estimate your memory usage. Just start the program and answer the questions.

    $ /opt/mira-3.9/bin/miramem

  2. Submit your job to the queuing system. Don't forget to adjust the PBS parameters in your submission script based on the Miramen guesses above.

    $ qsub run_galaxy.sh

  3. Check the queue to see it is there and its status.

    $ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 211.hpcnode1 scaffold.build. 10053650 570:36:5 R workq
    235.hpcnode1 Histomonas.map. 10053650 0 Q workq
    236.hpcnode1 galaxy.sh mlake 0 Q workq
    $

The S column above indicates the job's state:
Q Job is queued.
R Job is running.
E Job is exiting after having run.
F Job is finished.
H Job is held.
S Job is suspended.

Here we see that its status is "Q" meaning that its in the queue, scheduled to be run. If you run qstat again at some point you will see its status will have changed to "R" when it's running.

Also note its "Job ID". You can use this to remove a job from the queue using qdel job_id, e.g.

$ qdel 236

After the Run

Keep checking the qstat command and your working directory to see if your program has finished. If the program aborts or is killed for some reason then information may be in the error file.

When the run is finished you will have these files:

galaxy.sh.e275 <-- small, only a few lines
galaxy.sh.o275 <-- largish, about 100 MB

and these directories:

$ ls test_assembly/
  test_d_chkpt  
  test_d_info  
  test_d_results  
  test_d_tmp
$ 

Here is the galaxy.sh.e275 file:

tcmalloc: large alloc 1510699008 bytes == 0x2160000 @ 
tcmalloc: large alloc 1698308096 bytes == 0x1804d1000 @ 
tcmalloc: large alloc 2337718272 bytes == 0x1e5d73000 @ 
tcmalloc: large alloc 1758498816 bytes == 0x1e5d73000 @ 
tcmalloc: large alloc 1930358784 bytes == 0x1e5d73000 @ 
tcmalloc: large alloc 3509243904 bytes == 0x271f5f000 @ 

real    104m 39.479s
user    126m 23.940s
sys     145m 42.007s

Useful References

Next Generation Sequencing (NGS)/De novo RNA assembly at Wikibooks.

Assembly of 454 data with MIRA3 This is for MIRA version 3.4.0. It was written by Bastien Chevreux, who wrote MIRA.

Sheffield University, Molecular Ecology Lab Various Perl programs for Sequence manipulation, blast, screening, alignment, assembly, file conversion etc.

Location for some data: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001665/SRR001665_1.fastq.gz