Friday, December 01, 2017

A nice summary for aws re:invent 2017: AWS re:Invent - A Musical Review of the 2017 Keynote

EKS and Fargate are going to kill ECS (even though fargate is part of ecs)

S3 Select and Glacier Select is super nice for large objects.

and most important to me: aws lambda to support golang


jupyter notebook is now my default development environment, I can run code, shell and data wrangling on it.

setup is simple, and no need to create project, add dependencies ...

I didn't try other kernels, ipython kernel is good enough.

my workflow is write the code in jupyter, export to python script, run it with gnu parallel

my code (with pandas) is usually slow and memory demand, but it still works ...


just found a nice addon for jupyter:

and this is super useful: 28 Jupyter Notebook tips, tricks and shortcuts

multicursor support, srsly? (under linux hold ctrl to add cursors)

one must have package is rmcgibbo/jupyter-emacskeys: Emacs keybindings for Jupyter notebook

however, ctrl-n, ctrl-w seems belongs to browser and they're not working.

install Menu Wizard addon to disable those keys, now emacs keybindings works fine.


reading Mastering Pandas for Finance and Learning pandas

pandas is very useful, but sometimes quite difficult to understand.

Saturday, December 03, 2017

to read proggit:

import pandas as pd
from pandas.io.json import json_normalize
import json
from urllib import request

u = 'https://www.reddit.com/r/programming.json'

r = request.urlopen(u)

j = json.loads(r.read().decode())

cols = ['data.title', 'data.url', 'data.permalink', 'data.score', 'data.num_comments']

df = json_normalize(j['data']['children'])[cols]
df.columns = [x.replace('data.', '') for x in cols]

df[['title', 'score', 'num_comments']].sort_values(['score', 'num_comments'], ascending=False)


to read hackernews:

import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

u1 = 'https://news.ycombinator.com/'

r = requests.get(u1)
soup = BeautifulSoup(r.content, 'lxml')

titles = soup.find_all('a', {'class': 'storylink'})

titles_df = pd.DataFrame([{'title': x.text, 'url': x.get('href')} for x in titles])

def extract_subs(s):
    return {'score': re.sub('\D', '', s.find('span', {'class': 'score'}).text),
            'permalink': s.find('span', {'class': 'age'}).find('a').get('href'),
            'comments': re.sub('\D', '', s.find_all('a')[-1].text)}


subs = soup.find_all('td', {'class': 'subtext'})
subs_df = pd.DataFrame([extract_subs(s) for s in subs])
subs_df.head()


df = titles_df.merge(subs_df, left_index=True, right_index=True)

df.sort_values(['score', 'comments'], ascending=False)

Friday, December 08, 2017

a very useful extension for jupyter: minrk/nbextension-scratchpad: scratchpad extension for Jupyter notebook

if you're using virtualenv, you may want to add –sys-prefix option

$ jupyter nbextension install --sys-prefix nbextension-scratchpad

I also modified main.js, changed toggle hotkey from ctrl-b to ctrl-o, since I'm using emacs keybindings.

this extension can let you inspect data without adding/removing cells on the notebook, very helpful.


I also tried different kernels for jupyter:

clojure

https://github.com/roryk/clojupyter

$ git clone git@github.com:roryk/clojupyter.git
$ cd clojupyter

add dependencies to project.clj:

[huri "0.10.0-SNAPSHOT"]
[incanter "1.5.7"]
$ LEIN_SNAPSHOTS_IN_RELEASE=1 make
$ make install

node.js (clojurescript)

https://github.com/notablemind/jupyter-nodejs

$ git clone git@github.com:notablemind/jupyter-nodejs.git
$ cd jupyter-nodejs
$ npm install
$ node install.js
$ npm run build
$ npm run build-ext
%load_ext clojurescript http://himera-emh.herokuapp.com
%%clojurescript
(range 10)

but they can not compare to ipython kernel.


and played with some development platforms:

gorilla repl

Gorilla REPL

add to project.clj

:plugins [[lein-gorilla "0.4.0"]]

run lein gorilla to start

spyder

$ pip install spyder
$ pip install pyqt5

Rodeo

Rodeo

jupyter is still better than them.


read a little bit Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark

since 1.3, spark also has dataframe type, it's different thing in a distributed environment, but good to know it exists.


Apache Zeppelin is another web-based notebook, from what I read from web, it might be a better option for SQL (and of course spark)

$ ./bin/install-interpreter.sh --list
$ ./bin/install-interpreter.sh --name jdbc,postgresql

start the notebook by $ ./bin/zeppelin-daemon.sh

go to http://localhost:8080/, update interpreter properties for mysql:

  • default.driver: com.mysql.jdbc.Driver
  • default.url: jdbc:mysql://localhost:3306/
  • default.user: root

also add artifact:

mysql:mysql-connector-java:6.0.6

for redshift, following this page: Generic JDBC Interpreter

it has Dynamic Form, one important feature for reporting, it even has a report layout and scheduler (cron) ...

Wednesday, December 13, 2017

Zeppelin

by default anonymous user is enabled, and the web ui is bound to 0.0.0.0, better to add some security

setup users is quite simple, just follow Apache Shiro Authentication for Apache Zeppelin

for data source, for example mysql username and password, can store in credentials: Data Source Authorization in Apache Zeppelin

then delete user and password settings from interpreter, interpreter will fetch those info by matching Interpreter Group].[Interpreter Name]

so now I'm using jupyter notebook for python and shell, zeppelin for SQL


pandas

I use pandas for almost everything now: csv, json, excel, sql ...

some notes for SQL:

utf8 for mysql connection

local_mysql = engine.create_engine("mysql://ubuntu@127.0.0.1:3306/ubuntu?charset=utf8")

escape % character

from sqlalchemy import text

sql = text("""
SELECT DATE_FORMAT(create_date, '%Y-%m-%d') FROM blog
""")

redshift -> mysql

df = pd.read_sql_query(read_query, redshift_conn, chunksize=1000)
for chunk in df:
    chunk.to_sql('output_table_name', mysql_conn, index=False, if_exists='append')

mysql -> csv

df = pd.read_sql_query(read_query, mysql_conn, chunksize=2000)
for chunk in df:
    chunk.to_csv(output_file, mode='a', index=False, header=(not os.path.exists(output_file)))

notes for excel:

append to excel file:

row = 1
for chunk in df:
    chunk.to_excel(output_file, startrow=row, index=False, header=(not os.path.exists(output_file)))
    row = row + len(chunk)


more and more data tasks, we're looking for some task management tools.

my colleague is setting up Apache Airflow, while I'm looking at Luigi

there're many comparison on airflow and luigi, usually airflow is the preferred one.

I personally like luigi, much simpler than airflow, it must be difficult to maintain an airflow server.

to write a new task is easy, extend the luigi.Task class, implement few functions:

  • requires(): return Task, which are current task depends on. (dependencies)
  • output(): return Target, better return just one. luigi checks the Target to determine whether current task is completed. (check the complete() method)
  • run(): actual code that is run

an example:

import luigi

class HelloWorld(luigi.Task):
    def requires(self):
        return None

    def output(self):
        return luigi.LocalTarget('helloworld.txt')

    def run(self):
        with self.output().open('w') as outfile:
            outfile.write('Hello World!\n')

if __name__ == '__main__':
    luigi.run()

to trigger the task, run one of these:

  • $ python hello.py HelloWorld –local-scheduler
  • $ PYTHONPATH=. luigi –module hello HelloWorld –local-scheduler

when using the later command, if __name__ == '__main__': is not required.

by default, luigi tries to connect a central scheduler, –local-scheduler is required when there's no central scheduler running.

to start a local central scheduler is easy, just run $ luigid

access the web-ui by visiting http://localhost:8082/

one thing to note is, task history is not enabled by default, enable it by update the configuration file

luigi tasks are idempotent, once output() Target exists, task won't run again. so to rerun a task, just delete the output.


one useful feature of luigi is RangeDaily tool:

import luigi
import time

class TestRange(luigi.Task):
    d = luigi.DateParameter()
    _complete = False

    def run(self):
        print("date is ", self.d)
        time.sleep(1)
        self._complete = True

    def complete(self):
        return self._complete

trigger by:

$ PYTHONPATH=. luigi --module range_test RangeDaily --of RangeTest --local-scheduler --start 2017-12-01

it will schedule n tasks starting from 2017-12-01, very useful for collecting daily metrics from different kind of sources.

another trick is print elapsed time for each task:

from https://gist.github.com/samuell/93cc7eb6803fa2790042:

class TimeTaskMixin():
    @luigi.Task.event_handler(luigi.Event.PROCESSING_TIME)
    def print_execution_time(self, processing_time):
        print('### PROCESSING TIME ###: ' + str(processing_time))
        
class TaskA(luigi.Task, TimeTaskMixin):
    ...


I learned python for about a month now, virtualenv, ipython, jupyter, pandas and luigi really made me happy.

I borrowed The Python 3 Standard Library by Example from local library, there're lots of good stuffs inside the standard library ...


saw my colleague performed production db update by using one of the Percona Toolkit: pt-online-schema-change

DB_HOST=''
DB_USER=''
DB_DATABASE=''
DB_TABLE=''
ALTER_QUERY='ADD INDEX ...'

time \
    pt-online-schema-change \
        --ask-pass \
        --user ${DB_USER} \
        -h ${DB_HOST} \
        --progress time,10 \
        --max-lag 5s \
        --execute D=${DB_DATABASE},t=${DB_TABLE} \
        --nocheck-replication-filters \
        --alter "${ALTER_QUERY}"

Blog Archives

Search Blog: