Jim Cheung

Friday, December 01, 2017

A nice summary for aws re:invent 2017: AWS re:Invent - A Musical Review of the 2017 Keynote

EKS and Fargate are going to kill ECS (even though fargate is part of ecs)

S3 Select and Glacier Select is super nice for large objects.

and most important to me: aws lambda to support golang

jupyter notebook is now my default development environment, I can run code, shell and data wrangling on it.

setup is simple, and no need to create project, add dependencies ...

I didn't try other kernels, ipython kernel is good enough.

my workflow is write the code in jupyter, export to python script, run it with gnu parallel

my code (with pandas) is usually slow and memory demand, but it still works ...

just found a nice addon for jupyter:

and this is super useful: 28 Jupyter Notebook tips, tricks and shortcuts

multicursor support, srsly? (under linux hold ctrl to add cursors)

one must have package is rmcgibbo/jupyter-emacskeys: Emacs keybindings for Jupyter notebook

however, ctrl-n, ctrl-w seems belongs to browser and they're not working.

install Menu Wizard addon to disable those keys, now emacs keybindings works fine.

reading Mastering Pandas for Finance and Learning pandas

pandas is very useful, but sometimes quite difficult to understand.

Saturday, December 03, 2017

to read proggit:

import pandas as pd
from pandas.io.json import json_normalize
import json
from urllib import request

u = 'https://www.reddit.com/r/programming.json'

r = request.urlopen(u)

j = json.loads(r.read().decode())

cols = ['data.title', 'data.url', 'data.permalink', 'data.score', 'data.num_comments']

df = json_normalize(j['data']['children'])[cols]
df.columns = [x.replace('data.', '') for x in cols]

df[['title', 'score', 'num_comments']].sort_values(['score', 'num_comments'], ascending=False)

to read hackernews:

import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

u1 = 'https://news.ycombinator.com/'

r = requests.get(u1)
soup = BeautifulSoup(r.content, 'lxml')

titles = soup.find_all('a', {'class': 'storylink'})

titles_df = pd.DataFrame([{'title': x.text, 'url': x.get('href')} for x in titles])

def extract_subs(s):
    return {'score': re.sub('\D', '', s.find('span', {'class': 'score'}).text),
            'permalink': s.find('span', {'class': 'age'}).find('a').get('href'),
            'comments': re.sub('\D', '', s.find_all('a')[-1].text)}

subs = soup.find_all('td', {'class': 'subtext'})
subs_df = pd.DataFrame([extract_subs(s) for s in subs])

df = titles_df.merge(subs_df, left_index=True, right_index=True)

df.sort_values(['score', 'comments'], ascending=False)

Friday, December 08, 2017

a very useful extension for jupyter: minrk/nbextension-scratchpad: scratchpad extension for Jupyter notebook

if you're using virtualenv, you may want to add --sys-prefix option

$ jupyter nbextension install --sys-prefix nbextension-scratchpad

I also modified main.js, changed toggle hotkey from ctrl-b to ctrl-o, since I'm using emacs keybindings.

this extension can let you inspect data without adding/removing cells on the notebook, very helpful.

I also tried different kernels for jupyter:



$ git clone git@github.com:roryk/clojupyter.git
$ cd clojupyter

add dependencies to project.clj:

[huri "0.10.0-SNAPSHOT"]
[incanter "1.5.7"]
$ make install

node.js (clojurescript)


$ git clone git@github.com:notablemind/jupyter-nodejs.git
$ cd jupyter-nodejs
$ npm install
$ node install.js
$ npm run build
$ npm run build-ext
%load_ext clojurescript http://himera-emh.herokuapp.com
(range 10)

but they can not compare to ipython kernel.

and played with some development platforms:

gorilla repl

Gorilla REPL

add to project.clj

:plugins [[lein-gorilla "0.4.0"]]

run lein gorilla to start


$ pip install spyder
$ pip install pyqt5



jupyter is still better than them.

read a little bit Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark

since 1.3, spark also has dataframe type, it's different thing in a distributed environment, but good to know it exists.

Apache Zeppelin is another web-based notebook, from what I read from web, it might be a better option for SQL (and of course spark)

$ ./bin/install-interpreter.sh --list
$ ./bin/install-interpreter.sh --name jdbc,postgresql

start the notebook by $ ./bin/zeppelin-daemon.sh

go to http://localhost:8080/, update interpreter properties for mysql:

also add artifact:


for redshift, following this page: Generic JDBC Interpreter

it has Dynamic Form, one important feature for reporting, it even has a report layout and scheduler (cron) ...

Wednesday, December 13, 2017


by default anonymous user is enabled, and the web ui is bound to, better to add some security

setup users is quite simple, just follow Apache Shiro Authentication for Apache Zeppelin

for data source, for example mysql username and password, can store in credentials: Data Source Authorization in Apache Zeppelin

then delete user and password settings from interpreter, interpreter will fetch those info by matching Interpreter Group].[Interpreter Name]

so now I'm using jupyter notebook for python and shell, zeppelin for SQL


I use pandas for almost everything now: csv, json, excel, sql ...

some notes for SQL:

utf8 for mysql connection

local_mysql = engine.create_engine("mysql://ubuntu@")

escape % character

from sqlalchemy import text

sql = text("""
SELECT DATE_FORMAT(create_date, '%Y-%m-%d') FROM blog

redshift -> mysql

df = pd.read_sql_query(read_query, redshift_conn, chunksize=1000)
for chunk in df:
    chunk.to_sql('output_table_name', mysql_conn, index=False, if_exists='append')

mysql -> csv

df = pd.read_sql_query(read_query, mysql_conn, chunksize=2000)
for chunk in df:
    chunk.to_csv(output_file, mode='a', index=False, header=(not os.path.exists(output_file)))

notes for excel:

append to excel file:

row = 1
for chunk in df:
    chunk.to_excel(output_file, startrow=row, index=False, header=(not os.path.exists(output_file)))
    row = row + len(chunk)

more and more data tasks, we're looking for some task management tools.

my colleague is setting up Apache Airflow, while I'm looking at Luigi

there're many comparison on airflow and luigi, usually airflow is the preferred one.

I personally like luigi, much simpler than airflow, it must be difficult to maintain an airflow server.

to write a new task is easy, extend the luigi.Task class, implement few functions:

an example:

import luigi

class HelloWorld(luigi.Task):
    def requires(self):
        return None

    def output(self):
        return luigi.LocalTarget('helloworld.txt')

    def run(self):
        with self.output().open('w') as outfile:
            outfile.write('Hello World!\n')

if __name__ == '__main__':

to trigger the task, run one of these:

when using the later command, if __name__ == '__main__': is not required.

by default, luigi tries to connect a central scheduler, --local-scheduler is required when there's no central scheduler running.

to start a local central scheduler is easy, just run $ luigid

access the web-ui by visiting http://localhost:8082/

one thing to note is, task history is not enabled by default, enable it by update the configuration file

luigi tasks are idempotent, once output() Target exists, task won't run again. so to rerun a task, just delete the output.

one useful feature of luigi is RangeDaily tool:

import luigi
import time

class TestRange(luigi.Task):
    d = luigi.DateParameter()
    _complete = False

    def run(self):
        print("date is ", self.d)
        self._complete = True

    def complete(self):
        return self._complete

trigger by:

$ PYTHONPATH=. luigi --module range_test RangeDaily --of RangeTest --local-scheduler --start 2017-12-01

it will schedule n tasks starting from 2017-12-01, very useful for collecting daily metrics from different kind of sources.

another trick is print elapsed time for each task:

from https://gist.github.com/samuell/93cc7eb6803fa2790042:

class TimeTaskMixin():
    def print_execution_time(self, processing_time):
        print('### PROCESSING TIME ###: ' + str(processing_time))
class TaskA(luigi.Task, TimeTaskMixin):

I learned python for about a month now, virtualenv, ipython, jupyter, pandas and luigi really made me happy.

I borrowed The Python 3 Standard Library by Example from local library, there're lots of good stuffs inside the standard library ...

saw my colleague performed production db update by using one of the Percona Toolkit: pt-online-schema-change


time \
    pt-online-schema-change \
        --ask-pass \
        --user ${DB_USER} \
        -h ${DB_HOST} \
        --progress time,10 \
        --max-lag 5s \
        --execute D=${DB_DATABASE},t=${DB_TABLE} \
        --nocheck-replication-filters \
        --alter "${ALTER_QUERY}"

Thursday, December 28, 2017

I think there will be no more update for this year.

as usual summary of this year's reading, happy new year!

| 2017-01 | Ansible for DevOps                                                                 |
|         | Gawk: Effective AWK Programming                                                    |
|         | Learning ClojureScript                                                             |
| 2017-02 | Learning ClojureScript                                                             |
|         | Programming Elixir 1.2                                                             |
|         | Infrastructure as Code                                                             |
| 2017-03 | Text Processing with Ruby                                                          |
|         | Clojure Programming Cookbook                                                       |
|         | Reactive Web Applications                                                          |
|         | Read Read-Eval-Print-╬╗ove                                                          |
|         | ClojureScript Unraveled                                                            |
|         | The Joy of Clojure 2nd Edtion                                                      |
|         | Professional Clojure                                                               |
| 2017-06 | Reactive Programming with RxJava                                                   |
|         | Understanding ECMAScript 6                                                         |
|         | MongoDB in Action, Second Edition                                                  |
|         | Clojure High Performance Programming                                               |
| 2017-07 | Effective MySQL Replication Techniques in Depth                                    |
|         | Go in Action                                                                       |
|         | Go Web Programming                                                                 |
| 2017-08 | Working with Linux – Quick Hacks for the Command Line                        |
|         | Linux Phrasebook (2nd Edition)                                                     |
| 2017-09 | The Language of SQL (2nd Edition)                                                  |
|         | Designing Data-Intensive Applications                                              |
|         | Python Crash Course                                                                |
|         | Fluent Python                                                                      |
|         | Effective Python: 59 Specific Ways to Write Better Python                          |
|         | Fast Data Architectures for Streaming Applications                                 |
| 2017-10 | Introduction to Machine Learning with Python                                       |
|         | Python Data Science Handbook                                                       |
|         | Docker Management Design Patterns                                                  |
| 2017-11 | The AWK Programming Language                                                       |
|         | Python for Data Analysis, 2nd Edition                                              |
| 2017-12 | Mastering Pandas for Finance                                                       |
|         | Learning pandas                                                                    |
|         | Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark |
Blog Archive