Tuesday, November 07, 2017

Rich Hickey's talk: Effective Programs - 10 Years of Clojure (video, transcript, discussions on hackernews) brings lots of discussions about types

and I think this is also related:

couple new tools for clojure:

and again, Clojure transducers from the ground up: using them in practice.

some good reads from past couple weeks:

after read Learn to use Awk with hundreds of examples and Why Learn AWK?

I'm reading The AWK Programming Language (download pdf from archive.org) again

start coding in python, because all my teammates are python people

surely emacs has great support on *python, install elpy, company-jedi packages

then I can eval the buffer by c-c c-c, just like cider, (more on elpy)

also support virtualenv, once you created a virtualenv folder in your project, use m-x pyvenv-activate to select it and your repl now works within the env.

a snippet for catch python exceptions with traceback:

import logging

logging.basicConfig(filename='/tmp/out.log', level=logging.ERROR)

    logging.exception('Got exception on main handler')

I'm learning to write good python codes as well:

Monday, November 20, 2017

a awk script to extract slow queries:

awk '/Query_time: [2-9]{2,}/{a=1;print "";print NR;print l;print;next}/Query_time/{a=0}/Id:/{l=$0;next}a>0' mysql-slowquery.log

some notes when learning postgresql:

switch to postgres user:

> sudo -iu postgres

  • > createuser –interactive
  • > dropuser jim

By default, another assumption that the Postgres authentication system makes is that there will be an database with the same name as the role being used to login, which the role has access to.

  • > createdb jim
  • > dropdb jim

include sql file # \i /path/to/file.sql

describe tables

  • # \d
  • # \dt (without seq tables)
  • # \d+ table_name (like desc table_name)

import from csv file:

# copy table_name from '/path/to/csv_file' CSV HEADER;

to use crosstab, first enable tablefunc:

# CREATE extension tablefunc;

The SELECT parameter feed to crosstab must return 3 columns:

  1. as identifier
  2. as category
  3. value

In practice the SQL query should always specify ORDER BY 1,2 to ensure that the input rows are properly ordered

use the crosstab function with a second parameter, which represents the complete list of categories.

export to csv

COPY (select id,name from tablename) TO 'filepath/aa.csv' DELIMITER ',' CSV HEADER;


I also start learning jupyter (ipython, pandas etc.)

start: jupyter notebook

  • edit mode: green border
  • command mode: blue border


  • Esc: switch to command mode
  • [Shift+Enter]: run that cell (and select the cell below it; or, use [Control+Enter] if you want to stay in that cell); this works even if you are in Edit Mode
  • x: delete that cell
  • a: create a new cell above
  • b: create a new cell below
  • m: make that cell a markdown cell
  • l: add line numbers (useful when working with others)
  • y: make that cell a code cell (which is the default for a new cell)
  • [Enter]: enter Edit Mode (turning the cell green and putting the cursor inside)
  • ↑ ↓: use the up and down arrows to move through the cells
  • [Number]: type a number (i.e., 1, 2, 3,…) to add a heading with that size (then use Enter to get into Edit Mode)

Cell shortcuts:

Shortcut	Action
Shift-Enter	run cell
Ctrl-Enter	run cell in-place
Alt-Enter	run cell, insert below
Ctrl-m x	cut cell
Ctrl-m c	copy cell
Ctrl-m v	paste cell
Ctrl-m d	delete cell
Ctrl-m z	undo last cell deletion
Ctrl-m -	split cell
Ctrl-m a	insert cell above
Ctrl-m b	insert cell below
Ctrl-m o	toggle output
Ctrl-m O	toggle output scroll
Ctrl-m l	toggle line numbers
Ctrl-m s	save notebook
Ctrl-m j	move cell down
Ctrl-m k	move cell up
Ctrl-m y	code cell
Ctrl-m m	markdown cell
Ctrl-m t	raw cell
Ctrl-m 1-6	heading 1-6 cell
Ctrl-m p	select previous
Ctrl-m n	select next
Ctrl-m i	interrupt kernel
Ctrl-m .	restart kernel
Ctrl-m h	show keyboard shortcuts

for db password, use dotenv

  1. pip3 install python-dotenv
  2. %load_ext dotenv
  3. %dotenv


10 Minutes to pandas


|            |  A      |  B     |
|      0     | 0.626386| 1.52325|----axis=1----->
                |         |
                | axis=0  |
                ↓         ↓

Thursday, November 23, 2017

continue playing jupyter and pandas:



%who  list var names
%whos list var names with data
%who_ls Return a sorted list of all interactive variables
%edit This will open an editor
%hist This presents a list of the last several input command lines.
%rerun  Re-run previous input
%save [options] filename n1-n2 n3-n4 ... n5 .. n6 ...
%notebook filename: Export and convert IPython notebooks.
%dhist history of visited directories
use %cd -<n> to go to directory number <n>

access cell history:

_ previous cell output
__  next previous
___ next next previous
_13  output of cell 13
_i13  content of input cell 13
In[13]  input of cell 13
Out[13] output of cell 13
In[13:18] range
exec(In[13])  execute cell 13

Once you run %rehashx, all of your $PATH has been loaded as IPython aliases


String lists (IPython.utils.text.SList) are handy way to process output from system commands.

They are produced by var = !cmd syntax.

SList (.p, .n, .l, .s, .grep(), .fields() available

  • lines = !ls -l
  • lines.grep('embed',prune=1)
  • lines.fields(8,0) (8th and first fields)
  • `lines.fields(*range(1,8))
  • lines.grep(r'^\?').fields(1)

The .s, .n, .p properties

  • The .s property returns one string where lines are separated by single space
  • The .n property return one string where the lines are separated by a newline
  • If the items in string list are file names, .p can be used to get a list of “path” objects for convenient file manipulation.

SLists are inherited from normal python lists, so every list method is available: lines.append('hey')

save session

%save -a [filename] 1 10-16 20

%save -a [filename] ~0


connect mysql:

from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('mysql://user:pass@host:port/db')
f = pd.read_sql_query('SELECT * FROM mytable', engine, index_col = 'ID')

glob files:

import os
from glob import glob

data_dir = '/data'
data_files = [f for f in glob(os.path.join(data_dir, '*.gz'))]

read gzip files, first 2 lines:

import gzip

with gzip.open('/tmp/foo.gz') as f:
    head = [next(f) for x in range(2)]
field_names = head[1].decode('ascii')

combine multiple dataframes:

import pandas as pd

df = pd.concat([pd.read_table(f, compression='gzip', skiprows=2, names=field_names) for f in data_files])

also reading Python for Data Analysis, 2nd Edition

an old article (2015) shows up on hackernews: A CEO's Guide to Emacs

as usual, can find some useful emacs packages in the discussions:

I tried dumb-jump, good, it just works.

avy looks as same as ace-jump, which I'm using.

Wednesday, November 29, 2017

some notes on parallel command:

from https://github.com/LangilleLab/microbiome_helper/wiki/Quick-Introduction-to-GNU-Parallel

  • The file name: {}
  • The file name with the extension removed: {.}
  • To indicate that everything that follows should be read in from the command line: :::
parallel --eta -j 2 --load 80% --noswap 'blastp -db pdb_blast_db_example/pdb_seqres.txt -query {} -out blastp_outfiles/{.}.out -evalue 0.0001 -word_size 7 -outfmt "6 std stitle staxids sscinames" -max_target_seqs 10 -num_threads 1' ::: test_seq*.fas
  • –eta: Shows the estimated time remaining to run all jobs.
  • -j 2 (or –jobs 2): The number of commands to run at the same time
  • –load 80%: The maximum CPU load at which new jobs will not be started
  • –noswap: New jobs won't be started if there is both swap-in and swap-out activity

You can put options you want to use every time in a global configuration file here: /etc/parallel/config

above example can use pipe:

ls test_seq*.fas | parallel --eta -j 2 --load 80% --noswap 'blastp -db pdb_blast_db_example/pdb_seqres.txt -query {} -out blastp_outfiles2/{.}.out -evalue 0.0001 -word_size 7 -outfmt "6 std stitle staxids sscinames" -max_target_seqs 10 -num_threads 1'  

or call command in a file:

for f in test_seq*.fas; do  
  echo "blastp -db pdb_blast_db_example/pdb_seqres.txt -query $f -out blastp_outfiles3/$out -evalue 0.0001 -word_size 7 -outfmt \"6 std stitle staxids sscinames\" -max_target_seqs 10 -num_threads 1" >> blastp_cmds.txt    

cat blastp_cmds.txt | parallel --eta -j 2 --load 80% --noswap '{}'  

In this case since the whole command is being input we can just refer to the input itself with {}.

from https://www.biostars.org/p/63816/:

It is often faster to write a command using GNU Parallel than making a for loop:

for i in *gz; do 
  zcat $i > $(basename $i .gz).unpacked

can be written as:

parallel 'zcat {} > {.}.unpacked' ::: *.gz

Blast on multiple machines

cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result

Running composed commands

parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna

TIL, python 3's print() function can write to file:

with open("test", 'w') as f:
    print('No new line,', file=f, end='')
    print('Has new line', file=f)

played a little bit with sipwitch, there're so little information on the web, luckily, I found some detailed setup guide on Raspberry Pi for Secret Agents

I don't remember whether I read this before: Data Science at the Command Line

I'll go over the list again.

Peter Norvig's pytudes: Python programs to practice or demonstrate skills.

TIL github supports .ipynb file, very good for sharing document mixed with codes.

Blog Archives

Search Blog: