jchk - /blog/2018-01

Thursday, January 04, 2018

some notes on luigi:

luigi does not come with feature to trigger jobs, even if you're using luigid

rundeck or jenkins could help.

I use rundeck and run task with local scheduler.

to retry task when using local scheduler:

you'll need to combine --scheduler-retry-count=2 --scheduler-retry-delay=10 --worker-keep-alive --local-scheduler

retry count could be a per task policy
default value for retry delay is 15 minutes, which is too long for local scheduler
for local scheduler, worker needs to be kept alive in order to retry task, without --worker-keep-alive, whole task ends once failed.

by default all exit codes are 0 unless you defined these in your luigi.cfg: (see here)

[retcode]
already_running=10
missing_data=20
not_run=25
task_failed=30
scheduling_error=35
unhandled_exception=40

when task failed, rundeck marks it as successful since it receives exit code 0

I'll just add --retcode-task-failed=1 to solve the problem.

to avoid adding too many arguments, can use a luigi.cfg or pass them to luigi.run()

LUIGI_CONFIG_PATH can be used for per task config file:

LUIGI_CONFIG_PATH=task1.cfg luigi --module=task1 Task1 --local-scheduler

or add this to the end of file:

if __name__ == '__main__':
    run_params = ['Task1', 
        '--scheduler-retry-count=2', 
        '--scheduler-retry-delay=10', 
        '--workder-keep-alive', 
        '--local-scheduler']
    
    luigi.run(run_params)

then you can execute the python script directly, which runs with the pre-defined options

or run with normal luigi --module=... for a different set of options

note, in order to use --retcode-xxx arguments, need to run with luigi.retcodes.run_with_retcodes(run_params)

for some table insert tasks, I'll implement a complete() method to check whether data was inserted before: (see here)

    def complete(self):
        ... select count(*) from table where last_insert_date='xxx' ...
        return bool

luigi.contrib has database Target, however, they just check whether table exists.

pandas, I have unicode error on dataframe.to_csv()

finally it was solved by adding ENV LANG=C.UTF-8 LC_ALL=C.UTF-8 to Dockerfile

Saturday, January 06, 2018

to access database via a bastion machine, use ssh tunnel:

ssh -N -L 3366:database-host:3306 bastion-host

or add to .ssh/config:

Host bastion-host
  Hostname bastion-ip
  Localforward 3366 database-host:3306

then simply use ssh -N bastion-host to setup the tunnel

reference:

some clojure stuffs to try out:

Tuesday, January 09, 2018

Rich Hickey wrote about Git Deps for Clojure, I don't have time to read Deps and CLI Guide yet.

it's definitely the implementation of his Spec-ulation talk

biggest news recently, Meltdown and Spectre:

Papers I've Read in 2017

the first and writer's favorite paper Out of the Tar Pit / Ben Moseley, Peter Marks / 2006 (PDF), I actually was reading it as well. David Nolen mentioned it in his talk: A Practical Functional Relational Architecture

and reviewed by the morning paper: Out of the Tar Pit too

I also watched few interesting talks:

Beyond Eventual Consistency (a nice summary on distributed databases)
Ask Your Proxy, It Knows Everything (so it's called service mesh?)
Cluster-in-a-Box: Deploying Kubernetes on lxd (I like lxd)

I always want to try creating a k8s cluster on lxd

need to test out an old code base, it will request external api for data

the first think I want to do is setup a proxy to cache external api responses, also able to monitor reqeusts

since those are POST requests, maybe a little bit difficult for varnish

nginx is more simple:

http {
    client_max_body_size 50k;

    proxy_cache_path /tmp/cache levels=1:2 keys_zone=apiCache:1m inactive=4h;
 
    server {
        listen 8080;
        server_name localhost;

        location / {
            try_files $uri @cache_backend;
        }
 
        location @cache_backend {
            proxy_pass http://external-api-host;
            proxy_cache apiCache;
            proxy_cache_methods POST;
            proxy_cache_key "$request_uri|$request_body";
            proxy_buffers 8 32k;
            proxy_buffer_size 64k;
            proxy_cache_valid 4h;
            proxy_cache_use_stale updating;
            add_header X-Cached $upstream_cache_status;
        }
    }
}

I found setup a proxy between services is quite useful, I may wrote a simple go program to do it.

Sunday, January 28, 2018

didn't update for a long time. so here're summary from last few weeks:

Dask, like pandas + luigi, looks interesting but don't have time to try yet.

slackhq/go-audit: go-audit is an alternative to the auditd daemon that ships with many distros, I don't know anything about auditd before, it maybe useful under some circumstances.

I'm quite interested in Druid, high-performance, column-oriented, distributed data store, it's often compared with cassandra and hbase.

Druid has experimental built-in sql support, uses Apache Calcite

I think sql support is an very important feature.

after discussing how to scale websockets with my colleagues, I found Nchan - flexible pubsub for the modern web and gorilla/websocket: A WebSocket implementation for Go.

but I found something more fun: - ameo: Redis compatible GET,SET,DEL and PUBLISH/SUBSCRIBE on riak_core with WebSocket API

I used Capistrano in my old company, so not really have chance to use Fabric. but now basically everything is python, I need to try it out and compare it with ansible

go has standard library for pprof: pprof - The Go Programming Language

but uber has a better one: uber/go-torch: Stochastic flame graph profiler for Go programs

in this article: Stream & Go: News Feeds for Over 300 Million End Users

they uses rocksdb and raft to replace cassandra, the solution is not open sourced, but sounds interesting to me.

opentracing is a way to monitor microservices, jaegertracing/jaeger: CNCF Jaeger, a Distributed Tracing System is a system that supports opentracing, also by uber

about monitoring, statd is something I looked into. I heard another one Diamond, python, already worth trying.

etsy/logster: Parse log files, generate metrics for Graphite and Ganglia, log parser written in python, another winner.

awslabs/goformation: GoFormation is a Go library for working with CloudFormation templates.

I still don't understand cloudformation, I rather use awscli or api instead.

awslabs has many good stuffs, like:

awslabs/amazon-sagemaker-examples: Example notebooks that show how to apply machine learning and deep learning in Amazon SageMaker

our team is doing similar tasks, so I have to say it's very practical.

reading Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing

so aws is Announcing Go Support for AWS Lambda

and an article about it: Speed and Stability: Why Go is a Great Fit for Lambda

some bookmarked links from my twitter:

ProxySQL Firewalling, always fun to put something in between mysql.
It’s About Time For Time Series Databases, I think more importantly is rea discussion on hackernews
How to solve 90% of NLP problems: a step-by-step guide, NLP is something I have to solve.
Building a Distributed Log from Scratch, Part 5
Go Go, Go! Stream Processing for Go, they have a similar one before: Go Python, Go! Stream Processing for Python
REST is the new SOAP
a new podcast: Command Line Heroes
Scaling Kubernetes to 2,500 Nodes
Scaling SQLite to 4M QPS on a Single Server (EC2 vs Bare Metal)
Twirp: a sweet new RPC framework for Go, I also think grpc is not simple enough, so may give this a try.
Using a Yubikey for GPG and SSH
Modern Big Data Pipelines over Kubernetes, I don't really want to invest in large scale luigi / airflow is because I think kubernetes will kill them all, it will be a waste of time for anything else.
The Modern Dev Team, I agree, what microservices / kubernetes trying to solve is something erlang been working on for a long time.

finally, after 9 months, my /dev/tty keycap set arrived:

/dev/tty

Blog Archive