Jim Cheung

Thursday, January 04, 2018

some notes on luigi:

luigi does not come with feature to trigger jobs, even if you're using luigid

rundeck or jenkins could help.


I use rundeck and run task with local scheduler.

to retry task when using local scheduler:

you'll need to combine --scheduler-retry-count=2 --scheduler-retry-delay=10 --worker-keep-alive --local-scheduler


by default all exit codes are 0 unless you defined these in your luigi.cfg: (see here)

[retcode]
already_running=10
missing_data=20
not_run=25
task_failed=30
scheduling_error=35
unhandled_exception=40

when task failed, rundeck marks it as successful since it receives exit code 0

I'll just add --retcode-task-failed=1 to solve the problem.


to avoid adding too many arguments, can use a luigi.cfg or pass them to luigi.run()

LUIGI_CONFIG_PATH can be used for per task config file:

LUIGI_CONFIG_PATH=task1.cfg luigi --module=task1 Task1 --local-scheduler

or add this to the end of file:

if __name__ == '__main__':
    run_params = ['Task1', 
        '--scheduler-retry-count=2', 
        '--scheduler-retry-delay=10', 
        '--workder-keep-alive', 
        '--local-scheduler']
    
    luigi.run(run_params)

then you can execute the python script directly, which runs with the pre-defined options

or run with normal luigi --module=... for a different set of options

note, in order to use --retcode-xxx arguments, need to run with luigi.retcodes.run_with_retcodes(run_params)


for some table insert tasks, I'll implement a complete() method to check whether data was inserted before: (see here)

    def complete(self):
        ... select count(*) from table where last_insert_date='xxx' ...
        return bool

luigi.contrib has database Target, however, they just check whether table exists.


pandas, I have unicode error on dataframe.to_csv()

finally it was solved by adding ENV LANG=C.UTF-8 LC_ALL=C.UTF-8 to Dockerfile

Saturday, January 06, 2018

to access database via a bastion machine, use ssh tunnel:

ssh -N -L 3366:database-host:3306 bastion-host

or add to .ssh/config:

Host bastion-host
  Hostname bastion-ip
  Localforward 3366 database-host:3306

then simply use ssh -N bastion-host to setup the tunnel

reference:


some clojure stuffs to try out:

Tuesday, January 09, 2018

Rich Hickey wrote about Git Deps for Clojure, I don't have time to read Deps and CLI Guide yet.

it's definitely the implementation of his Spec-ulation talk


biggest news recently, Meltdown and Spectre:


Papers I've Read in 2017

the first and writer's favorite paper Out of the Tar Pit / Ben Moseley, Peter Marks / 2006 (PDF), I actually was reading it as well. David Nolen mentioned it in his talk: A Practical Functional Relational Architecture

and reviewed by the morning paper: Out of the Tar Pit too


I also watched few interesting talks:

I always want to try creating a k8s cluster on lxd


need to test out an old code base, it will request external api for data

the first think I want to do is setup a proxy to cache external api responses, also able to monitor reqeusts

since those are POST requests, maybe a little bit difficult for varnish

nginx is more simple:

http {
    client_max_body_size 50k;

    proxy_cache_path /tmp/cache levels=1:2 keys_zone=apiCache:1m inactive=4h;
 
    server {
        listen 8080;
        server_name localhost;

        location / {
            try_files $uri @cache_backend;
        }
 
        location @cache_backend {
            proxy_pass http://external-api-host;
            proxy_cache apiCache;
            proxy_cache_methods POST;
            proxy_cache_key "$request_uri|$request_body";
            proxy_buffers 8 32k;
            proxy_buffer_size 64k;
            proxy_cache_valid 4h;
            proxy_cache_use_stale updating;
            add_header X-Cached $upstream_cache_status;
        }
    }
}

I found setup a proxy between services is quite useful, I may wrote a simple go program to do it.

Sunday, January 28, 2018

didn't update for a long time. so here're summary from last few weeks:

Dask, like pandas + luigi, looks interesting but don't have time to try yet.


slackhq/go-audit: go-audit is an alternative to the auditd daemon that ships with many distros, I don't know anything about auditd before, it maybe useful under some circumstances.


I'm quite interested in Druid, high-performance, column-oriented, distributed data store, it's often compared with cassandra and hbase.

Druid has experimental built-in sql support, uses Apache Calcite

I think sql support is an very important feature.


after discussing how to scale websockets with my colleagues, I found Nchan - flexible pubsub for the modern web and gorilla/websocket: A WebSocket implementation for Go.

but I found something more fun: - ameo: Redis compatible GET,SET,DEL and PUBLISH/SUBSCRIBE on riak_core with WebSocket API


I used Capistrano in my old company, so not really have chance to use Fabric. but now basically everything is python, I need to try it out and compare it with ansible


go has standard library for pprof: pprof - The Go Programming Language

but uber has a better one: uber/go-torch: Stochastic flame graph profiler for Go programs


in this article: Stream & Go: News Feeds for Over 300 Million End Users

they uses rocksdb and raft to replace cassandra, the solution is not open sourced, but sounds interesting to me.


opentracing is a way to monitor microservices, jaegertracing/jaeger: CNCF Jaeger, a Distributed Tracing System is a system that supports opentracing, also by uber

about monitoring, statd is something I looked into. I heard another one Diamond, python, already worth trying.

etsy/logster: Parse log files, generate metrics for Graphite and Ganglia, log parser written in python, another winner.


awslabs/goformation: GoFormation is a Go library for working with CloudFormation templates.

I still don't understand cloudformation, I rather use awscli or api instead.

awslabs has many good stuffs, like:

awslabs/amazon-sagemaker-examples: Example notebooks that show how to apply machine learning and deep learning in Amazon SageMaker

our team is doing similar tasks, so I have to say it's very practical.


reading Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing


so aws is Announcing Go Support for AWS Lambda

and an article about it: Speed and Stability: Why Go is a Great Fit for Lambda


some bookmarked links from my twitter:


finally, after 9 months, my /dev/tty keycap set arrived:

/dev/tty

/dev/tty

Blog Archive