Just a collection of usefull one-liners, often a bit too long to remember
Pretty useful for big tables (source):
awk -F'\t' -v c='colname' 'NR==1{for (i=1; i<=NF; i++) if ($i==c){p=i; break}; next} {print $p}' myfile.tsvApparently this has issues if you want to retrieve the column in the last position: it returns the whole line.
Alternative (source):
awk -F'\t' -v colname='colname' '{if(NR==1) for(i=1;i<=NF;i++) { if($i~colname) { colnum=i;break} } else print $colnum}' myfile.tsvawk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' ids.txt myseqs.fastaWhere
ids.txtis a list of the names of the sequences to extract frommyseqs.fasta
awk 'BEGIN {RS = ">" ; FS = "\n" ; ORS = ""} $2 {print ">"$0}' myseqs.fastaUseful to do any work from the exported amino acid gene calls from
anvi-get-sequences-for-gene-callsas it keeps the non-protein coding gene headers with no sequence.
This will add up the values of column 2, giving the total sum for each unique value in column 1 (source).
awk -F "\t" '{a[$1] += $2; OFS="\t"} END {for (i in a) print i, a[i]}' myfile.tsvawk 'BEGIN{OFS=FS=" "}{if(/^>/){NF--}}{print $1}' myseqs.fastaChange the
OFS=FS=" "to the character marking where you want to delete from.
Convenient for cleaning fasta headers for phylo work.
This is another alternative that should work in most cases:
cut -f1 -d " " myseqs.fastasort -k1,1n -k 2,2 -k5,5gr myfile.tsvSort first based on number on column 1, then standard (alphabetic) sorting of column 2, and finally do "general numeric sort" in reverse order of column 5 (i.e. recognises scientific notation and sorts high to low).
cat test.fastq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c
pasteplaces all lines in 4 columns.
The second line in each FastQ record is the actual sequence (we grab them with
cut).
tris used to remove end of line (\n) characters, otherwisewccounts them.