Jim Cheung

Tuesday, November 07, 2017

Rich Hickey's talk: Effective Programs - 10 Years of Clojure (video, transcript, discussions on hackernews) brings lots of discussions about types

and I think this is also related:


couple new tools for clojure:

and again, Clojure transducers from the ground up: using them in practice.


some good reads from past couple weeks:


after read Learn to use Awk with hundreds of examples and Why Learn AWK?

I'm reading The AWK Programming Language (download pdf from archive.org) again


start coding in python, because all my teammates are python people

surely emacs has great support on *python, install elpy, company-jedi packages

then I can eval the buffer by c-c c-c, just like cider, (more on elpy)

also support virtualenv, once you created a virtualenv folder in your project, use m-x pyvenv-activate to select it and your repl now works within the env.


a snippet for catch python exceptions with traceback:

import logging

logging.basicConfig(filename='/tmp/out.log', level=logging.ERROR)

try:
    run_my_stuff()
except:
    logging.exception('Got exception on main handler')

I'm learning to write good python codes as well:

Monday, November 20, 2017

a awk script to extract slow queries:

awk '/Query_time: [2-9]{2,}/{a=1;print "";print NR;print l;print;next}/Query_time/{a=0}/Id:/{l=$0;next}a>0' mysql-slowquery.log


some notes when learning postgresql:

switch to postgres user:

> sudo -iu postgres

By default, another assumption that the Postgres authentication system makes is that there will be an database with the same name as the role being used to login, which the role has access to.

include sql file # \i /path/to/file.sql

describe tables

import from csv file:

# copy table_name from '/path/to/csv_file' CSV HEADER;

to use crosstab, first enable tablefunc:

# CREATE extension tablefunc;

The SELECT parameter feed to crosstab must return 3 columns:

  1. as identifier
  2. as category
  3. value

In practice the SQL query should always specify ORDER BY 1,2 to ensure that the input rows are properly ordered

use the crosstab function with a second parameter, which represents the complete list of categories.

export to csv

COPY (select id,name from tablename) TO 'filepath/aa.csv' DELIMITER ',' CSV HEADER;

reference:


I also start learning jupyter (ipython, pandas etc.)

start: jupyter notebook

shortcuts:

Cell shortcuts:

Shortcut	Action
Shift-Enter	run cell
Ctrl-Enter	run cell in-place
Alt-Enter	run cell, insert below
Ctrl-m x	cut cell
Ctrl-m c	copy cell
Ctrl-m v	paste cell
Ctrl-m d	delete cell
Ctrl-m z	undo last cell deletion
Ctrl-m -	split cell
Ctrl-m a	insert cell above
Ctrl-m b	insert cell below
Ctrl-m o	toggle output
Ctrl-m O	toggle output scroll
Ctrl-m l	toggle line numbers
Ctrl-m s	save notebook
Ctrl-m j	move cell down
Ctrl-m k	move cell up
Ctrl-m y	code cell
Ctrl-m m	markdown cell
Ctrl-m t	raw cell
Ctrl-m 1-6	heading 1-6 cell
Ctrl-m p	select previous
Ctrl-m n	select next
Ctrl-m i	interrupt kernel
Ctrl-m .	restart kernel
Ctrl-m h	show keyboard shortcuts

for db password, use dotenv

  1. pip3 install python-dotenv
  2. %load_ext dotenv
  3. %dotenv

pandas

10 Minutes to pandas

axis:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
                |         |
                | axis=0  |
                ↓         ↓

Thursday, November 23, 2017

continue playing jupyter and pandas:

jupyter

magics

%who  list var names
%whos list var names with data
%who_ls Return a sorted list of all interactive variables
%edit This will open an editor
%hist This presents a list of the last several input command lines.
%rerun  Re-run previous input
%save [options] filename n1-n2 n3-n4 ... n5 .. n6 ...
%notebook filename: Export and convert IPython notebooks.
%dhist history of visited directories
use %cd -<n> to go to directory number <n>

access cell history:

_ previous cell output
__  next previous
___ next next previous
_13  output of cell 13
_i13  content of input cell 13
In[13]  input of cell 13
Out[13] output of cell 13
In[13:18] range
exec(In[13])  execute cell 13

Once you run %rehashx, all of your $PATH has been loaded as IPython aliases

SList

String lists (IPython.utils.text.SList) are handy way to process output from system commands.

They are produced by var = !cmd syntax.

SList (.p, .n, .l, .s, .grep(), .fields() available

The .s, .n, .p properties

SLists are inherited from normal python lists, so every list method is available:
lines.append('hey')

save session

%save -a [filename] 1 10-16 20

%save -a [filename] ~0

pandas

connect mysql:

from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('mysql://user:pass@host:port/db')
f = pd.read_sql_query('SELECT * FROM mytable', engine, index_col = 'ID')

glob files:

import os
from glob import glob

data_dir = '/data'
data_files = [f for f in glob(os.path.join(data_dir, '*.gz'))]

read gzip files, first 2 lines:

import gzip

with gzip.open('/tmp/foo.gz') as f:
    head = [next(f) for x in range(2)]
    
field_names = head[1].decode('ascii')

combine multiple dataframes:

import pandas as pd

df = pd.concat([pd.read_table(f, compression='gzip', skiprows=2, names=field_names) for f in data_files])

also reading Python for Data Analysis, 2nd Edition


an old article (2015) shows up on hackernews: A CEO's Guide to Emacs

as usual, can find some useful emacs packages in the discussions:

I tried dumb-jump, good, it just works.

avy looks as same as ace-jump, which I'm using.

Wednesday, November 29, 2017

some notes on parallel command:

from https://github.com/LangilleLab/microbiome_helper/wiki/Quick-Introduction-to-GNU-Parallel

You can put options you want to use every time in a global configuration file here: /etc/parallel/config

above example can use pipe:

ls test_seq*.fas | parallel --eta -j 2 --load 80% --noswap 'blastp -db pdb_blast_db_example/pdb_seqres.txt -query {} -out blastp_outfiles2/{.}.out -evalue 0.0001 -word_size 7 -outfmt "6 std stitle staxids sscinames" -max_target_seqs 10 -num_threads 1'  

or call command in a file:

for f in test_seq*.fas; do  
  out=${f/.fas/.out};  
  echo "blastp -db pdb_blast_db_example/pdb_seqres.txt -query $f -out blastp_outfiles3/$out -evalue 0.0001 -word_size 7 -outfmt \"6 std stitle staxids sscinames\" -max_target_seqs 10 -num_threads 1" >> blastp_cmds.txt    
done   

cat blastp_cmds.txt | parallel --eta -j 2 --load 80% --noswap '{}'  

In this case since the whole command is being input we can just refer to the input itself with {}.

from https://www.biostars.org/p/63816/:

It is often faster to write a command using GNU Parallel than making a for loop:

for i in *gz; do 
  zcat $i > $(basename $i .gz).unpacked
done

can be written as:

parallel 'zcat {} > {.}.unpacked' ::: *.gz

Blast on multiple machines

cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result

Running composed commands

parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna


TIL, python 3's print() function can write to file:

with open("test", 'w') as f:
    print('No new line,', file=f, end='')
    print('Has new line', file=f)


played a little bit with sipwitch, there're so little information on the web, luckily, I found some detailed setup guide on Raspberry Pi for Secret Agents


I don't remember whether I read this before: Data Science at the Command Line

I'll go over the list again.


Peter Norvig's pytudes: Python programs to practice or demonstrate skills.

TIL github supports .ipynb file, very good for sharing document mixed with codes.

Blog Archive