Friday, December 01, 2017
A nice summary for aws re:invent 2017: AWS re:Invent - A Musical Review of the 2017 Keynote
EKS and Fargate are going to kill ECS (even though fargate is part of ecs)
S3 Select and Glacier Select is super nice for large objects.
and most important to me: aws lambda to support golang
jupyter notebook is now my default development environment, I can run code, shell and data wrangling on it.
setup is simple, and no need to create project, add dependencies ...
I didn't try other kernels, ipython kernel is good enough.
my workflow is write the code in jupyter, export to python script, run it with gnu parallel
my code (with pandas) is usually slow and memory demand, but it still works ...
just found a nice addon for jupyter:
- spotify/luigi: Luigi is a Python module that helps you build complex pipelines of batch jobs.
- Building Data Science Pipelines with Luigi and Jupyter Notebooks
and this is super useful: 28 Jupyter Notebook tips, tricks and shortcuts
multicursor support, srsly? (under linux hold ctrl
to add cursors)
one must have package is rmcgibbo/jupyter-emacskeys: Emacs keybindings for Jupyter notebook
however, ctrl-n
, ctrl-w
seems belongs to browser and they're not working.
install Menu Wizard addon to disable those keys, now emacs keybindings works fine.
reading Mastering Pandas for Finance and Learning pandas
pandas is very useful, but sometimes quite difficult to understand.
Saturday, December 03, 2017
to read proggit:
import pandas as pd
from pandas.io.json import json_normalize
import json
from urllib import request
u = 'https://www.reddit.com/r/programming.json'
r = request.urlopen(u)
j = json.loads(r.read().decode())
cols = ['data.title', 'data.url', 'data.permalink', 'data.score', 'data.num_comments']
df = json_normalize(j['data']['children'])[cols]
df.columns = [x.replace('data.', '') for x in cols]
df[['title', 'score', 'num_comments']].sort_values(['score', 'num_comments'], ascending=False)
to read hackernews:
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
u1 = 'https://news.ycombinator.com/'
r = requests.get(u1)
soup = BeautifulSoup(r.content, 'lxml')
titles = soup.find_all('a', {'class': 'storylink'})
titles_df = pd.DataFrame([{'title': x.text, 'url': x.get('href')} for x in titles])
def extract_subs(s):
return {'score': re.sub('\D', '', s.find('span', {'class': 'score'}).text),
'permalink': s.find('span', {'class': 'age'}).find('a').get('href'),
'comments': re.sub('\D', '', s.find_all('a')[-1].text)}
subs = soup.find_all('td', {'class': 'subtext'})
subs_df = pd.DataFrame([extract_subs(s) for s in subs])
subs_df.head()
df = titles_df.merge(subs_df, left_index=True, right_index=True)
df.sort_values(['score', 'comments'], ascending=False)
Friday, December 08, 2017
a very useful extension for jupyter: minrk/nbextension-scratchpad: scratchpad extension for Jupyter notebook
if you're using virtualenv
, you may want to add --sys-prefix
option
$ jupyter nbextension install --sys-prefix nbextension-scratchpad
I also modified main.js
, changed toggle hotkey from ctrl-b
to ctrl-o
, since I'm using emacs keybindings.
this extension can let you inspect data without adding/removing cells on the notebook, very helpful.
I also tried different kernels for jupyter:
clojure
https://github.com/roryk/clojupyter
$ git clone git@github.com:roryk/clojupyter.git
$ cd clojupyter
add dependencies to project.clj
:
[huri "0.10.0-SNAPSHOT"]
[incanter "1.5.7"]
$ LEIN_SNAPSHOTS_IN_RELEASE=1 make
$ make install
node.js (clojurescript)
https://github.com/notablemind/jupyter-nodejs
$ git clone git@github.com:notablemind/jupyter-nodejs.git
$ cd jupyter-nodejs
$ npm install
$ node install.js
$ npm run build
$ npm run build-ext
%load_ext clojurescript http://himera-emh.herokuapp.com
%%clojurescript
(range 10)
but they can not compare to ipython kernel.
and played with some development platforms:
gorilla repl
add to project.clj
:plugins [[lein-gorilla "0.4.0"]]
run lein gorilla
to start
spyder
$ pip install spyder
$ pip install pyqt5
Rodeo
jupyter is still better than them.
read a little bit Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark
since 1.3
, spark also has dataframe type, it's different thing in a distributed environment, but good to know it exists.
Apache Zeppelin is another web-based notebook, from what I read from web, it might be a better option for SQL (and of course spark)
$ ./bin/install-interpreter.sh --list
$ ./bin/install-interpreter.sh --name jdbc,postgresql
start the notebook by $ ./bin/zeppelin-daemon.sh
go to http://localhost:8080/
, update interpreter
properties for mysql:
default.driver
:com.mysql.jdbc.Driver
default.url
:jdbc:mysql://localhost:3306/
default.user
:root
also add artifact:
mysql:mysql-connector-java:6.0.6
for redshift, following this page: Generic JDBC Interpreter
it has Dynamic Form, one important feature for reporting, it even has a report layout and scheduler (cron) ...
Wednesday, December 13, 2017
Zeppelin
by default anonymous user is enabled, and the web ui is bound to 0.0.0.0
, better to add some security
setup users is quite simple, just follow Apache Shiro Authentication for Apache Zeppelin
for data source, for example mysql username and password, can store in credentials: Data Source Authorization in Apache Zeppelin
then delete user and password settings from interpreter, interpreter will fetch those info by matching Interpreter Group].[Interpreter Name]
so now I'm using jupyter notebook for python and shell, zeppelin for SQL
pandas
I use pandas for almost everything now: csv, json, excel, sql ...
some notes for SQL:
utf8 for mysql connection
local_mysql = engine.create_engine("mysql://ubuntu@127.0.0.1:3306/ubuntu?charset=utf8")
escape % character
from sqlalchemy import text
sql = text("""
SELECT DATE_FORMAT(create_date, '%Y-%m-%d') FROM blog
""")
redshift -> mysql
df = pd.read_sql_query(read_query, redshift_conn, chunksize=1000)
for chunk in df:
chunk.to_sql('output_table_name', mysql_conn, index=False, if_exists='append')
mysql -> csv
df = pd.read_sql_query(read_query, mysql_conn, chunksize=2000)
for chunk in df:
chunk.to_csv(output_file, mode='a', index=False, header=(not os.path.exists(output_file)))
notes for excel:
append to excel file:
row = 1
for chunk in df:
chunk.to_excel(output_file, startrow=row, index=False, header=(not os.path.exists(output_file)))
row = row + len(chunk)
more and more data tasks, we're looking for some task management tools.
my colleague is setting up Apache Airflow, while I'm looking at Luigi
there're many comparison on airflow and luigi, usually airflow is the preferred one.
I personally like luigi, much simpler than airflow, it must be difficult to maintain an airflow server.
to write a new task is easy, extend the luigi.Task
class, implement few functions:
requires()
: returnTask
, which are current task depends on. (dependencies)output()
: returnTarget
, better return just one. luigi checks theTarget
to determine whether current task is completed. (check thecomplete()
method)run()
: actual code that is run
an example:
import luigi
class HelloWorld(luigi.Task):
def requires(self):
return None
def output(self):
return luigi.LocalTarget('helloworld.txt')
def run(self):
with self.output().open('w') as outfile:
outfile.write('Hello World!\n')
if __name__ == '__main__':
luigi.run()
to trigger the task, run one of these:
$ python hello.py HelloWorld --local-scheduler
$ PYTHONPATH=. luigi --module hello HelloWorld --local-scheduler
when using the later command, if __name__ == '__main__':
is not required.
by default, luigi tries to connect a central scheduler, --local-scheduler
is required when there's no central scheduler running.
to start a local central scheduler is easy, just run $ luigid
access the web-ui by visiting http://localhost:8082/
one thing to note is, task history is not enabled by default, enable it by update the configuration file
luigi tasks are idempotent, once output()
Target
exists, task won't run again. so to rerun a task, just delete the output.
one useful feature of luigi is RangeDaily
tool:
import luigi
import time
class TestRange(luigi.Task):
d = luigi.DateParameter()
_complete = False
def run(self):
print("date is ", self.d)
time.sleep(1)
self._complete = True
def complete(self):
return self._complete
trigger by:
$ PYTHONPATH=. luigi --module range_test RangeDaily --of RangeTest --local-scheduler --start 2017-12-01
it will schedule n
tasks starting from 2017-12-01
, very useful for collecting daily metrics from different kind of sources.
another trick is print elapsed time for each task:
from https://gist.github.com/samuell/93cc7eb6803fa2790042:
class TimeTaskMixin():
@luigi.Task.event_handler(luigi.Event.PROCESSING_TIME)
def print_execution_time(self, processing_time):
print('### PROCESSING TIME ###: ' + str(processing_time))
class TaskA(luigi.Task, TimeTaskMixin):
...
I learned python for about a month now, virtualenv, ipython, jupyter, pandas and luigi really made me happy.
I borrowed The Python 3 Standard Library by Example from local library, there're lots of good stuffs inside the standard library ...
saw my colleague performed production db update by using one of the Percona Toolkit: pt-online-schema-change
DB_HOST=''
DB_USER=''
DB_DATABASE=''
DB_TABLE=''
ALTER_QUERY='ADD INDEX ...'
time \
pt-online-schema-change \
--ask-pass \
--user ${DB_USER} \
-h ${DB_HOST} \
--progress time,10 \
--max-lag 5s \
--execute D=${DB_DATABASE},t=${DB_TABLE} \
--nocheck-replication-filters \
--alter "${ALTER_QUERY}"
Thursday, December 28, 2017
I think there will be no more update for this year.
as usual summary of this year's reading, happy new year!
|---------+------------------------------------------------------------------------------------|
| 2017-01 | Ansible for DevOps |
| | Gawk: Effective AWK Programming |
| | Learning ClojureScript |
|---------+------------------------------------------------------------------------------------|
| 2017-02 | Learning ClojureScript |
| | Programming Elixir 1.2 |
| | Infrastructure as Code |
|---------+------------------------------------------------------------------------------------|
| 2017-03 | Text Processing with Ruby |
| | Clojure Programming Cookbook |
| | Reactive Web Applications |
| | Read Read-Eval-Print-λove |
| | ClojureScript Unraveled |
| | The Joy of Clojure 2nd Edtion |
| | Professional Clojure |
|---------+------------------------------------------------------------------------------------|
| 2017-06 | Reactive Programming with RxJava |
| | Understanding ECMAScript 6 |
| | MongoDB in Action, Second Edition |
| | Clojure High Performance Programming |
|---------+------------------------------------------------------------------------------------|
| 2017-07 | Effective MySQL Replication Techniques in Depth |
| | Go in Action |
| | Go Web Programming |
|---------+------------------------------------------------------------------------------------|
| 2017-08 | Working with Linux – Quick Hacks for the Command Line |
| | Linux Phrasebook (2nd Edition) |
|---------+------------------------------------------------------------------------------------|
| 2017-09 | The Language of SQL (2nd Edition) |
| | Designing Data-Intensive Applications |
| | Python Crash Course |
| | Fluent Python |
| | Effective Python: 59 Specific Ways to Write Better Python |
| | Fast Data Architectures for Streaming Applications |
|---------+------------------------------------------------------------------------------------|
| 2017-10 | Introduction to Machine Learning with Python |
| | Python Data Science Handbook |
| | Docker Management Design Patterns |
|---------+------------------------------------------------------------------------------------|
| 2017-11 | The AWK Programming Language |
| | Python for Data Analysis, 2nd Edition |
|---------+------------------------------------------------------------------------------------|
| 2017-12 | Mastering Pandas for Finance |
| | Learning pandas |
| | Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark |
|---------+------------------------------------------------------------------------------------|
Blog Archive
- Newer Entries
- 2018 January
- 2018 February
- 2018 March
- 2018 April
- 2018 May
- 2018 June
- 2018 July
- 2018 August
- 2018 September
- 2018 October
- 2018 November
- 2018 December
- 2019 January
- 2019 February
- 2019 March
- 2019 April
- 2019 May
- 2019 July
- 2019 October
- 2019 November
- 2019 December
- 2020 August
- 2020 September
- 2020 October
- 2020 November
- 2020 December
- 2021 January
- 2021 February
- 2021 March
- 2021 April
- 2021 May
- 2021 June
- 2021 August
- 2021 September
- 2021 December
- 2022 March
- 2022 April
- 2022 May
- 2022 June
- 2022 July
- 2022 August
- 2022 September
- 2022 October
- 2022 November
- 2022 December
- 2023 January
- 2023 February
- 2023 March
- 2023 April
- 2023 July
- 2023 August
- 2023 September
- 2023 October
- 2023 November
- 2023 December
- 2024 January
- 2024 February
- 2024 March
- 2024 April
- 2024 May
- 2024 June
- 2024 August
- 2024 September
- 2024 October
- 2024 November
- Older Entries
- 2017 November
- 2017 October
- 2017 September
- 2017 August
- 2017 July
- 2017 June
- 2017 May
- 2017 April
- 2017 March
- 2017 February
- 2017 January
- 2016 December
- 2016 November
- 2016 October
- 2016 September
- 2016 August
- 2016 July
- 2016 June
- 2016 May
- 2016 April
- 2016 March
- 2016 February
- 2016 January
- 2015 December
- 2015 November
- 2015 October
- 2015 September
- 2015 August
- 2015 July
- 2015 June
- 2015 May
- 2015 April
- 2015 March
- 2015 February
- 2015 January
- 2014 December
- 2014 November
- 2014 October
- 2014 September
- 2014 August
- 2014 March
- 2014 February
- 2014 January
- 2013 December
- 2013 October
- 2013 July
- 2013 June
- 2013 May
- 2013 March
- 2013 February
- 2013 January
- 2012 December
- 2012 November
- 2012 October
- 2012 September
- 2012 August