Thursday, January 04, 2018
some notes on luigi:
luigi does not come with feature to trigger jobs, even if you're using luigid
rundeck or jenkins could help.
I use rundeck and run task with local scheduler.
to retry task when using local scheduler:
you'll need to combine --scheduler-retry-count=2 --scheduler-retry-delay=10 --worker-keep-alive --local-scheduler
- retry count could be a per task policy
- default value for retry delay is 15 minutes, which is too long for local scheduler
- for local scheduler, worker needs to be kept alive in order to retry task, without
--worker-keep-alive
, whole task ends once failed.
by default all exit codes are 0
unless you defined these in your luigi.cfg
: (see here)
[retcode]
already_running=10
missing_data=20
not_run=25
task_failed=30
scheduling_error=35
unhandled_exception=40
when task failed, rundeck marks it as successful since it receives exit code 0
I'll just add --retcode-task-failed=1
to solve the problem.
to avoid adding too many arguments, can use a luigi.cfg
or pass them to luigi.run()
LUIGI_CONFIG_PATH
can be used for per task config file:
LUIGI_CONFIG_PATH=task1.cfg luigi --module=task1 Task1 --local-scheduler
or add this to the end of file:
if __name__ == '__main__':
run_params = ['Task1',
'--scheduler-retry-count=2',
'--scheduler-retry-delay=10',
'--workder-keep-alive',
'--local-scheduler']
luigi.run(run_params)
then you can execute the python script directly, which runs with the pre-defined options
or run with normal luigi --module=...
for a different set of options
note, in order to use --retcode-xxx
arguments, need to run with luigi.retcodes.run_with_retcodes(run_params)
for some table insert tasks, I'll implement a complete()
method to check whether data was inserted before: (see here)
def complete(self):
... select count(*) from table where last_insert_date='xxx' ...
return bool
luigi.contrib has database Target
, however, they just check whether table exists.
pandas, I have unicode error on dataframe.to_csv()
finally it was solved by adding ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
to Dockerfile
Saturday, January 06, 2018
to access database via a bastion machine, use ssh tunnel:
ssh -N -L 3366:database-host:3306 bastion-host
or add to .ssh/config
:
Host bastion-host
Hostname bastion-ip
Localforward 3366 database-host:3306
then simply use ssh -N bastion-host
to setup the tunnel
reference:
some clojure stuffs to try out:
- onyx-platform/onyx: Distributed, masterless, high performance, fault tolerant data processing
- Netflix/PigPen: Map-Reduce for Clojure
- nathanmarz/specter: Clojure(Script)'s missing piece
- juxt/tick: Time as a value.
- framed-data/overseer: Overseer is a library for building and running data pipelines in Clojure.
- gorillalabs/sparkling: A Clojure library for Apache Spark: fast, fully-features, and developer friendly
- didiercrunch/lein-jupyter: A Leiningen plugin to integrate clojure with jupyter notebook
$ lein jupyter install-kernel $ lein jupyter notebook
Tuesday, January 09, 2018
Rich Hickey wrote about Git Deps for Clojure, I don't have time to read Deps and CLI Guide yet.
it's definitely the implementation of his Spec-ulation talk
biggest news recently, Meltdown and Spectre:
- Notes from the Intelpocalypse
- Why Raspberry Pi isn't vulnerable to Spectre or Meltdown
- Spectre & Meltdown: tapping into the CPU's subconscious thoughts
the first and writer's favorite paper Out of the Tar Pit / Ben Moseley, Peter Marks / 2006 (PDF), I actually was reading it as well. David Nolen mentioned it in his talk: A Practical Functional Relational Architecture
and reviewed by the morning paper: Out of the Tar Pit too
I also watched few interesting talks:
- Beyond Eventual Consistency (a nice summary on distributed databases)
- Ask Your Proxy, It Knows Everything (so it's called service mesh?)
- Cluster-in-a-Box: Deploying Kubernetes on lxd (I like
lxd
)
I always want to try creating a k8s cluster on lxd
need to test out an old code base, it will request external api for data
the first think I want to do is setup a proxy to cache external api responses, also able to monitor reqeusts
since those are POST
requests, maybe a little bit difficult for varnish
nginx
is more simple:
http {
client_max_body_size 50k;
proxy_cache_path /tmp/cache levels=1:2 keys_zone=apiCache:1m inactive=4h;
server {
listen 8080;
server_name localhost;
location / {
try_files $uri @cache_backend;
}
location @cache_backend {
proxy_pass http://external-api-host;
proxy_cache apiCache;
proxy_cache_methods POST;
proxy_cache_key "$request_uri|$request_body";
proxy_buffers 8 32k;
proxy_buffer_size 64k;
proxy_cache_valid 4h;
proxy_cache_use_stale updating;
add_header X-Cached $upstream_cache_status;
}
}
}
I found setup a proxy between services is quite useful, I may wrote a simple go
program to do it.
Sunday, January 28, 2018
didn't update for a long time. so here're summary from last few weeks:
Dask, like pandas + luigi, looks interesting but don't have time to try yet.
slackhq/go-audit: go-audit is an alternative to the auditd daemon that ships with many distros, I don't know anything about auditd before, it maybe useful under some circumstances.
I'm quite interested in Druid, high-performance, column-oriented, distributed data store, it's often compared with cassandra and hbase.
Druid has experimental built-in sql support, uses Apache Calcite
I think sql support is an very important feature.
after discussing how to scale websockets with my colleagues, I found Nchan - flexible pubsub for the modern web and gorilla/websocket: A WebSocket implementation for Go.
but I found something more fun: - ameo: Redis compatible GET,SET,DEL and PUBLISH/SUBSCRIBE on riak_core with WebSocket API
I used Capistrano in my old company, so not really have chance to use Fabric. but now basically everything is python, I need to try it out and compare it with ansible
go has standard library for pprof: pprof - The Go Programming Language
but uber has a better one: uber/go-torch: Stochastic flame graph profiler for Go programs
in this article: Stream & Go: News Feeds for Over 300 Million End Users
they uses rocksdb and raft to replace cassandra, the solution is not open sourced, but sounds interesting to me.
opentracing is a way to monitor microservices, jaegertracing/jaeger: CNCF Jaeger, a Distributed Tracing System is a system that supports opentracing, also by uber
about monitoring, statd is something I looked into. I heard another one Diamond, python, already worth trying.
etsy/logster: Parse log files, generate metrics for Graphite and Ganglia, log parser written in python, another winner.
awslabs/goformation: GoFormation is a Go library for working with CloudFormation templates.
I still don't understand cloudformation, I rather use awscli or api instead.
awslabs has many good stuffs, like:
our team is doing similar tasks, so I have to say it's very practical.
reading Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing
so aws is Announcing Go Support for AWS Lambda
and an article about it: Speed and Stability: Why Go is a Great Fit for Lambda
some bookmarked links from my twitter:
- ProxySQL Firewalling, always fun to put something in between mysql.
- It’s About Time For Time Series Databases, I think more importantly is rea discussion on hackernews
- How to solve 90% of NLP problems: a step-by-step guide, NLP is something I have to solve.
- Building a Distributed Log from Scratch, Part 5
- Go Go, Go! Stream Processing for Go, they have a similar one before: Go Python, Go! Stream Processing for Python
- REST is the new SOAP
- a new podcast: Command Line Heroes
- Scaling Kubernetes to 2,500 Nodes
- Scaling SQLite to 4M QPS on a Single Server (EC2 vs Bare Metal)
- Twirp: a sweet new RPC framework for Go, I also think grpc is not simple enough, so may give this a try.
- Using a Yubikey for GPG and SSH
- Modern Big Data Pipelines over Kubernetes, I don't really want to invest in large scale luigi / airflow is because I think kubernetes will kill them all, it will be a waste of time for anything else.
- The Modern Dev Team, I agree, what microservices / kubernetes trying to solve is something erlang been working on for a long time.
finally, after 9 months, my /dev/tty keycap set arrived:
Blog Archive
- Newer Entries
- 2018 February
- 2018 March
- 2018 April
- 2018 May
- 2018 June
- 2018 July
- 2018 August
- 2018 September
- 2018 October
- 2018 November
- 2018 December
- 2019 January
- 2019 February
- 2019 March
- 2019 April
- 2019 May
- 2019 July
- 2019 October
- 2019 November
- 2019 December
- 2020 August
- 2020 September
- 2020 October
- 2020 November
- 2020 December
- 2021 January
- 2021 February
- 2021 March
- 2021 April
- 2021 May
- 2021 June
- 2021 August
- 2021 September
- 2021 December
- 2022 March
- 2022 April
- 2022 May
- 2022 June
- 2022 July
- 2022 August
- 2022 September
- 2022 October
- 2022 November
- 2022 December
- 2023 January
- 2023 February
- 2023 March
- 2023 April
- 2023 July
- 2023 August
- 2023 September
- 2023 October
- 2023 November
- 2023 December
- 2024 January
- 2024 February
- 2024 March
- 2024 April
- 2024 May
- 2024 June
- 2024 August
- 2024 September
- 2024 October
- 2024 November
- Older Entries
- 2017 December
- 2017 November
- 2017 October
- 2017 September
- 2017 August
- 2017 July
- 2017 June
- 2017 May
- 2017 April
- 2017 March
- 2017 February
- 2017 January
- 2016 December
- 2016 November
- 2016 October
- 2016 September
- 2016 August
- 2016 July
- 2016 June
- 2016 May
- 2016 April
- 2016 March
- 2016 February
- 2016 January
- 2015 December
- 2015 November
- 2015 October
- 2015 September
- 2015 August
- 2015 July
- 2015 June
- 2015 May
- 2015 April
- 2015 March
- 2015 February
- 2015 January
- 2014 December
- 2014 November
- 2014 October
- 2014 September
- 2014 August
- 2014 March
- 2014 February
- 2014 January
- 2013 December
- 2013 October
- 2013 July
- 2013 June
- 2013 May
- 2013 March
- 2013 February
- 2013 January
- 2012 December
- 2012 November
- 2012 October
- 2012 September
- 2012 August