Here are some more advanced things you can do using bash once you start feeling more comfortable.
man bash_command
Although I give examples of commands below with a short description, this only scratches the surface of what they can do. If you have questions regarding a command, you should always check the man pages about the command. These pages provide the full documentation for all bash commands. If there are still questions after viewing the man pages, there is always Google and stackoverflow.
ls | grep -c 'string'
Lists all files in a directory and then returns only those with a particular string.
ls *string | sort
Lists all the files in a directory containing the specified string and then sorts them alphabetically. I typically use this to generate slurm array file lists to process fastqs (need to add > file_name
to save the list to a file. I also use the command with | cat *string > file_name
to concatenate fastq files for genome/transcriptome assemblies.
history
Returns the last hundred or so commands issued.
history | grep 'string'
I usually combine history with grep to look for specific commands that I previously executed.
rm file_name
Delete a file
cat file_name | grep -Anumber 'string'
Searches a file for a string and returns the line with the string AND number lines after the string.
for FILE in *file_ending; do
mv -v "$FILE"
"${FILE//string_to_replace/new_string}";
done
Loops through the current directory files with file_ending, changes string_to_replace to new_string.
cat list_of_commands.txt | xargs -I cmd --max-procs=2 bash -c cmd &
This one-liner runs a list of commands in parallel. The list of commands can be generated using the for loop above and writing them to a file, one per line. Also note that you can use GNU Parallel
ls *.fa | xargs -I file -n 1 -P 2 /Applications/muscle3.8.31_i86darwin64 -in file -out file.aln
Similar to the one-liner above, this one takes all of the fasta files in a directory and estimates an alignment for two at a time (i.e., ‘-P 2’). Note you don’t have to add ‘$’ to the variable ‘file’ coming through the pipe. In addition, xargs can handle adding strings with no additional command (e.g., ‘-out file.aln’ just adds ‘.aln’ to the end of the file name)
1) Get the web address to the .sra file
2) Download the file to a local directory on your personal computer or HPC:
curl /web/address/to/sra/file/name_of_sra_file --output /path/where/you/want/.sra/file/to/live/name_of_sra_file
3) Extract .fastq files from .sra file:
/path/to/binary/fastq-dump.2.3.3-3 --split-files name_of_sra_file
#!/bin/bash
# ---------------------------
#$ -S /bin/bash
# job name
#$ -N FF1b_beast2
# The next line says to run an array of 100 jobs
#$ -t 1-100
# email
#$ -M add email address here
# -m be
#
# output file
#$ -o /home/beast/out-$JOB_ID-$TASK_ID
# error file
#$ -e /home/beast/err-$JOB_ID-$TASK_ID
#
cd /home/beast/
#$ -cwd
module load beast/2.1.2
# The next line processes a text file of beast xml files 1 per line to use as input
INFILE=$(sed -n "$SGE_TASK_ID"p my_file_list.txt)
# Generate the output file name by using search and replace in the input file name
OUTFILE=${INFILE/.xml/.result}
#Run Beast
beast $INFILE > $OUTFILE
I am starting to get in the habit of using jupyter notebooks to work on code, but also to render html presentations. Doing it this way saves a TON of time and limits having to use PowerPoint (ugh!). Also, if you keep your presentations in a git repo, they are always available. The main driver behind this is Reveal.js, which is the open source software behind rendering html presentations. I only know little bits of html, but jupyter notebooks have a function to convert the notebook to a html reveal presentation. Here are the main steps:
jupyter-nbconvert --to slides <jupyter_notebook_name> --reveal-prefix=<path/to/reveal.js/dir>
Sometimes I have to download large HiSeq runs (GBs to TBs) from a remote server. Below are the commands I give via the terminal using GNU Screen and Linux scp. In the past I have also used Globus. Globus is a pay service that has a GUI interface (and command line) and many universities pay to have access.
screen -S <session name>
scp -P<port#> <wanted_file> <usr_name>@<local_ip_address>:/local/directory/path
Ctrl
+ a
then d
To re-attach the screen, log back into the server (step 1) and type screen -r <session name>
Occasionally I need to delete VERY large direcoties in my file system (local or HPC). I usually use rm
to remove files (e.g., rm <file_name>
) and directories (e.g., rm -rf <directory>
), but it is exceptionally slow for large directories. I recently found a workaround on Stack Exchange that significantly speeds up the process.
mkdir <directory_name>
These command assume you have created a git repo on Bitbucket and initiated it locally on your personal computer (git init
).
git pull
from the local repo directory. You can also see if you local repo is up-to-date by executing git status
.git add file_name
.git commit -m "info_about_the_commit"
The information should convey the status of the committed script.git push
.Read file by line
with open('file_name.txt') as f:
for line in f:
do something with line
Read file by line and look for lines that start with ‘>’
with open('file_name.txt') as f:
for line in f:
if line.startswith('>'):
do something with line
Read tab-delimited file by line; split each line by tabs; add column 2 element to a list
column2_list = []#empty list to store column 2 elements
with open('file_name.txt') as f:
for line in f:
elements = line.split('\t')
wanted_element = elements[1]#python uses zero-based counting so element 2 is actually 1
column2_list.append(wanted_element)
Loop through fasta files in a directory with a file ending of “.fa”
import glob
for fasta_file in glob.glob("*.fa"):
do something with fasta file
An example of spawning multiple processes across computer cores using the python Multiprocessing Module. The example below aligns multiple fastas in a directory spawning a number of processes equal to the number of cores in the computer.
import multiprocessing
import os
import time
def get_files(dir_path, file_ending):
'''get the files in a directory with a particular ending and put them in a list'''
file_list = [f for f in os.listdir(dir_path) if f.endswith(file_ending)]
return file_list
def run_muscle(x):
'''runs an alignment throught muscle'''
outfile = x[:-3] + '_aligned.fa'
command = '/Applications/muscle3.8.31_i86darwin64 -in %s -out %s' % (x, outfile)
os.system(command)
def start_process():
print 'Starting', multiprocessing.current_process().name
start = time.time()
#Get the fasta files in the directory
fastas = get_files('/Users/owen/Documents/python/tmp', '.fa')
#Make the Pool of workers based on my number of cores
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size, initializer=start_process)
pool.map(run_muscle, fastas)
pool.close()
pool.join()
print 'It took', time.time() - start, 'seconds'
Retrieves taxonomic ranks for a given NCBI taxon id and writes it to a single line. The taxon ids need to be in a separate file, one per line. The script can be executed by typing:
python script_name file_with_taxon_ids.txt outfile_name
import sys
import csv
import time
from Bio import Entrez
infile = sys.argv[1]
output_csv = sys.argv[2]
def get_tax_data(taxid):
"""once we have the taxid, we can fetch the record"""
search = Entrez.efetch(id = taxid, db = "taxonomy", retmode = "xml")
return Entrez.read(search)
start = time.time()
bacteria_taxonomy = open(output_csv, 'w')
header = ['superkingdom', 'kingdom', 'phylum', 'class', 'order', 'suborder', 'family', 'subfamily', 'genus', 'subgenus', 'species', 'no rank']#change these based on line 22
w = csv.writer(bacteria_taxonomy)
w.writerow(header)
Entrez.email = "you@email.com"#enter your email address here
if not Entrez.email:
print "you must add your email address"
sys.exit(2)
lines = open(infile).read().splitlines()
counter = 0
for line in lines:
print counter, line
search = Entrez.efetch(id = str(line), db = "taxonomy", retmode = "xml")
data = Entrez.read(search)
lineage = {d['Rank']:d['ScientificName'] for d in data[0]['LineageEx'] if d['Rank'] in ['superkingdom','kingdom','phylum','class','order','suborder','family','subfamily','genus','subgenus','species', 'no rank']}
line = []
ranks = ['superkingdom', 'kingdom', 'phylum', 'class', 'order', 'suborder', 'family', 'subfamily', 'genus', 'subgenus', 'species', 'no rank']
for r in ranks:
if r in lineage:
line.append(lineage[r])
else:
line.append('NA')
w.writerow(line)
time.sleep(2)
counter+=1
bacteria_taxonomy.close()
print 'It took', time.time()-start, 'seconds'
For an introduction to string slicing and lists go here