jchk - /read/kubernetes-in-action

sample app

app.js:

const http = require('http');
const os = require('os');

console.log("Kubia server starting...");

var handler = function(request, response) {
  console.log("Received request from " + request.connection.remoteAddress);
  response.writeHead(200);
  response.end("You've hit " + os.hostname() + "\n");
};

var www = http.createServer(handler);
www.listen(8080);

Dockerfile:

FROM node:7
ADD app.js /app.js
ENTRYPOINT ["node", "app.js"]

introduction

$ kubectl cluster-info

bash completion

$ source <(kubectl completion bash)

create alias

alias k=kubectl

$ source <(kubectl completion bash | sed s/kubectl/k/g)

The simplest way to deploy your app is to use the kubectl run command, which will create all the necessary components without having to deal with JSON or YAML.

$ kubectl run kubia --image=luksa/kubia --port=8080 --generator=run/v1

To make the pod accessible from the outside, you’ll expose it through a Service object. You’ll create a special service of type LoadBalancer

By creating a LoadBalancer-type service, an external load balancer will be created and you can connect to the pod through the load balancer’s public IP.

$ kubectl expose rc kubia --type=LoadBalancer --name kubia-http

$ kubectl get svc

We’re using the abbreviation rc instead of replicationcontroller (po for pods, svc for services).

scale up

$ kubectl scale rc kubia --replicas=3

$ kubectl get rc

$ kubectl get pods

$ kubectl get pods -o wide

$ kubectl describe pod kubia-hczji

dashboard

$ kubectl cluster-info | grep dashboard

$ gcloud container clusters describe kubia | grep -E "(username|password):"

$ minikube dashboard

pods

All pods in a Kubernetes cluster reside in a single flat, shared, network-address space.

which means every pod can access every other pod at the other pod’s IP address.

No NAT (Network Address Translation) gateways exist between them.

Deciding when to use multiple containers in a pod

Do they need to be run together or can they run on different hosts?
Do they represent a single whole or are they independent components?
Must they be scaled together or individually?

pod definition

three important sections are found in almost all Kubernetes resources:

metadata includes the name, namespace, labels, and other information about the pod.
spec contains the actual description of the pod’s contents, such as the pod’s containers, volumes, and other data.
status contains the current information about the running pod, such as what condition the pod is in, the description and status of each container, and the pod’s internal IP and other basic info.

more info:

$ kubectl explain pods
$ kubectl explain pod.spec

A basic pod manifest kubia-manual.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: kubia-manual
spec:
  containers:
  - image: luksa/kubia
    name: kubia
    ports:
    - containerPort: 8080
      protocol: TCP

$ kubectl create -f kubia-manual.yaml

$ kubectl get po kubia-manual -o yaml
$ kubectl get pods

$ docker logs <container id>
$ kubectl logs kubia-manual
$ kubectl logs kubia-manual -c kubia

$ kubectl port-forward kubia-manual 8888:8080

$ curl localhost:8888

labels

metadata:
  name: kubia-manual-v2
  labels:
    creation_method: manual
    env: prod

$ kubectl create -f kubia-manual-with-labels.yaml
$ kubectl get po --show-labels
$ kubectl get po -L creation_method,env

add new label:

$ kubectl label po kubia-manual creation_method=manual

update label:

$ kubectl label po kubia-manual-v2 env=debug --overwrite

listing pods using a label selector

$ kubectl get po -l creation_method=manual

$ kubectl get po -l env
$ kubectl get po -l '!env'

label nodes

$ kubectl label node gke-kubia-85f6-node-0rrx gpu=true

$ kubectl get nodes -l gpu=true

schedule pods to specific node

apiVersion: v1
kind: Pod
metadata:
  name: kubia-gpu
spec:
  nodeSelector:
    gpu: "true"
  containers:
  - image: luksa/kubia
    name: kubia

namespaces

$ kubectl get ns

$ kubectl get po --namespace kube-system

create namespace

apiVersion: v1
kind: Namespace
metadata:
  name: custom-namespace

$ kubectl create -f custom-namespace.yaml

# or 

$ kubectl create namespace custom-namespace

create pod under namespace:

$ kubectl create -f kubia-manual.yaml -n custom-namespace

deleting pod

$ kubectl delete po kubia-gpu

# delete by label

$ kubectl delete po -l creation_method=manual

$ kubectl delete po -l rel=canary

# delete by namespace

$ kubectl delete ns custom-namespace

# delete all under current namespace

$ kubectl delete po --all

Replication and other controllers

liveness probes

Kubernetes can probe a container using one of the three mechanisms:

An HTTP GET probe performs an HTTP GET request on the container’s IP address, a port and path you specify.
A TCP Socket probe tries to open a TCP connection to the specified port of the container.
An Exec probe executes an arbitrary command inside the container and checks the command’s exit status code.

http get probe:

apiVersion: v1
kind: pod
metadata:
  name: kubia-liveness
spec:
  containers:
  - image: luksa/kubia-unhealthy
    name: kubia
    livenessProbe:
      httpGet:
        path: /
        port: 8080

$ kubectl get po kubia-liveness

$ kubectl logs mypod --previous

$ kubectl describe po kubia-liveness

delay probe

livenessProbe:
     httpGet:
       path: /
       port: 8080
     initialDelaySeconds: 15

If you don’t set the initial delay, the prober will start probing the container as soon as it starts.

even if you set the failure threshold to 1, Kubernetes will retry the probe several times before considering it a single failed attempt. Therefore, implementing your own retry loop into the probe is wasted effort.

If you’re running a Java app in your container, be sure to use an HTTP GET liveness probe instead of an Exec probe, where you spin up a whole new JVM to get the liveness information. The same goes for any JVM-based or similar applications, whose start-up procedure requires considerable computational resources.

ReplicationController

A ReplicationController has three essential parts:

A label selector, which determines what pods are in the ReplicationController’s scope
A replica count, which specifies the desired number of pods that should be running

A pod template, which is used when creating new pod replicas

apiVersion: v1
kind: ReplicationController
metadata:
  name: kubia
spec:
  replicas: 3
  selector:
    app: kubia
  template:
    metadata:
      labels:
        app: kubia
    spec:
      containers:
      - name: kubia
        image: luksa/kubia
        ports:
        - containerPort: 8080

$ kubectl create -f kubia-rc.yaml

$ kubectl get pods

$ kubectl get rc

$ kubectl describe rc kubia

Moving pods in and out of the scope of a ReplicationController

If you change a pod’s labels so they no longer match a ReplicationController’s label selector, the pod becomes like any other manually created pod. It’s no longer managed by anything.

$ kubectl label pod kubia-dmdck type=special
$ kubectl label pod kubia-dmdck app=foo --overwrite

Changing the pod template

$ kubectl edit rc kubia

You can tell kubectl to use a text editor of your choice by setting the KUBE_EDITOR environment variable.

Horizontally scaling pods

$ kubectl scale rc kubia --replicas=10

or $ kubectl edit rc kubia

spec:
  replicas: 3
  selector:
    app: kubia

delete rc but keep its pods running:

$ kubectl delete rc kubia --cascade=false

ReplicaSet

ReplicaSet is a new generation of ReplicationController and replaces it completely.

A ReplicaSet behaves exactly like a ReplicationController, but it has more expressive pod selectors.

Whereas a ReplicationController’s label selector only allows matching pods that include a certain label, a ReplicaSet’s selector also allows matching pods that lack a certain label or pods that include a certain label key, regardless of its value.

apiVersion: apps/v1beta2
kind: ReplicaSet
metadata:
  name: kubia
spec:
  replicas: 3
  selector:
    matchLabels:
      app: kubia
  template:
    metadata:
      labels:
        app: kubia
    spec:
      containers:
      - name: kubia
        image: luksa/kubia

ReplicaSets aren't part of the v1 API, but belong to the apps API group and version v1beta2.

$ kubectl get rs

$ kubectl describe rs

matchExpressions:

selector:
  matchExpressions:
    - key: app
      operator: In
      values:
        - kubia

each expression must contain a key, an operator, and possibly (depending on the operator) a list of values.

You’ll see four valid operators:

In—Label’s value must match one of the specified values.
NotIn—Label’s value must not match any of the specified values.
Exists—Pod must include a label with the specified key (the value isn’t important). When using this operator, you shouldn’t specify the values field.
DoesNotExist—Pod must not include a label with the specified key. The values property must not be specified.

Running exactly one pod on each node with DaemonSets

A DaemonSet makes sure it creates as many pods as there are nodes and deploys each one on its own node,

If a node goes down, the DaemonSet doesn’t cause the pod to be created elsewhere. But when a new node is added to the cluster, the DaemonSet immediately deploys a new pod instance to it.

apiVersion: apps/v1beta2
kind: DaemonSet
metadata:
  name: ssd-monitor
spec:
  selector:
    matchLabels:
      app: ssd-monitor
  template:
    metadata:
      labels:
        app: ssd-monitor
    spec:
      nodeSelector:
        disk: ssd
      containers:
      - name: main
        image: luksa/ssd-monitor

$ kubectl create -f ssd-monitor-daemonset.yaml

$ kubectl get ds

$ kubectl label node minikube disk=ssd

Running pods that perform a single completable task

Job resource

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
spec:
  template:
    metadata:
      labels:
        app: batch-job
    spec:
      restartPolicy: OnFailure
      containers:
      - name: main
        image: luksa/batch-job

Job pods can’t use the default policy, because they’re not meant to run indefinitely. Therefore, you need to explicitly set the restart policy to either OnFailure or Never.

$ kubectl get jobs

# after job is done
$ kubectl get po -a
$ kubectl logs batch-job-28qf4

Running multiple pod instances in a Job

run five pods sequentially

apiVersion: batch/v1
kind: Job
metadata:
  name: multi-completion-batch-job
spec:
  completions: 5
  template:
    <template is the same as above>

running job pods in parallel:

apiVersion: batch/v1
kind: Job
metadata:
  name: multi-completion-batch-job
spec:
  completions: 5
  parallelism: 2
  template:
    <same as above>

Scaling a Job

You can even change a Job’s parallelism property while the Job is running.

$ kubectl scale job multi-completion-batch-job --replicas 3

limiting the time allowed for a Job pod to complete

A pod’s time can be limited by setting the activeDeadlineSeconds property in the pod spec.

You can configure how many times a Job can be retried before it is marked as failed by specifying the spec.backoffLimit field in the Job manifest. If you don’t explicitly specify it, it defaults to 6.

scheduling Jobs to run periodically or once in the future

Creating a CronJob:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: batch-job-every-fifteen-minutes
spec:
  schedule: "0,15,30,45 * * * *"
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: periodic-batch-job
        spec:
          restartPolicy: OnFailure
          containers:
          - name: main
            image: luksa/batch-job

It may happen that the Job or pod is created and run relatively late.

You may have a hard requirement for the job to not be started too far over the scheduled time. In that case, you can specify a deadline by specifying the startingDeadlineSeconds field.

apiVersion: batch/v1beta1
kind: CronJob
spec:
  schedule: "0,15,30,45 * * * *"
  startingDeadlineSeconds: 15
  ...

Services

apiVersion: v1
kind: Service
metadata:
  name: kubia
spec:
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: kubia

$ kubectl get svc
NAME         CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
kubernetes   10.111.240.1     <none>        443/TCP   30d
kubia        10.111.249.153   <none>        80/TCP    6m

$ kubectl exec kubia-7nog1 -- curl -s http://10.111.249.153

you want all requests made by a certain client to be redirected to the same pod every time, you can set the service’s sessionAffinity property to ClientIP (instead of None, which is the default)

apiVersion: v1
kind: Service
spec:
  sessionAffinity: ClientIP
  ...

Exposing multiple ports in the same service

apiVersion: v1
kind: Service
metadata:
  name: kubia
spec:
  ports:
  - name: http
    port: 80
    targetPort: 8080
  - name: https
    port: 443
    targetPort: 8443
  selector:
    app: kubia

When creating a service with multiple ports, you must specify a name for each port.

Discovering services through environment variables

When a pod is started, Kubernetes initializes a set of environment variables pointing to each service that exists at that moment.

$ kubectl exec kubia-3inly env
...
KUBIA_SERVICE_HOST=10.111.249.153
KUBIA_SERVICE_PORT=80
...

Discovering services through DNS

Whether a pod uses the internal DNS server or not is configurable through the dnsPolicy property in each pod’s spec.

root@kubia-3inly:/# curl http://kubia.default.svc.cluster.local
You've hit kubia-5asi2

root@kubia-3inly:/# curl http://kubia.default
You've hit kubia-3inly

root@kubia-3inly:/# curl http://kubia
You've hit kubia-8awf3

You can omit the namespace and the svc.cluster.local suffix because of how the DNS resolver inside each pod’s container is configured:

root@kubia-3inly:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local ...

Endpoints resource

$ kubectl get endpoints kubia

Manually configuring service endpoints

apiVersion: v1
kind: Service
metadata:
  name: external-service
spec:
  ports:
  - port: 80

Creating an Endpoints resource for a service without a selector

apiVersion: v1
kind: Endpoints
metadata:
  name: external-service
subsets:
  - addresses:
    - ip: 11.11.11.11
    - ip: 22.22.22.22
    ports:
    - port: 80

Creating an ExternalName service

apiVersion: v1
kind: Service
metadata:
  name: external-service
spec:
  type: ExternalName
  externalName: someapi.somecompany.com
  ports:
  - port: 80

After the service is created, pods can connect to the external service through the external-service.default.svc.cluster.local domain name (or even external-service) instead of using the service’s actual FQDN.

ExternalName services are implemented solely at the DNS level—a simple CNAME DNS record is created for the service. Therefore, clients connecting to the service will connect to the external service directly, bypassing the service proxy completely. For this reason, these types of services don’t even get a cluster IP.

Exposing services to external clients

You have a few ways to make a service accessible externally:

Setting the service type to NodePort—each cluster node opens a port on the node itself (hence the name) and redirects traffic received on that port to the underlying service
Setting the service type to LoadBalancer, an extension of the NodePort type—This makes the service accessible through a dedicated load balancer, provisioned from the cloud infrastructure Kubernetes is running on. The load balancer redirects traffic to the node port across all the nodes. Clients connect to the service through the load balancer’s IP.
Creating an Ingress resource, a radically different mechanism for exposing multiple services through a single IP address—It operates at the HTTP level (network layer 7) and can thus offer more features than layer 4 services can.

NodePort service

apiVersion: v1
kind: Service
metadata:
  name: kubia-nodeport
spec:
  type: NodePort
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30123
  selector:
    app: kubia

$ kubectl get svc kubia-nodeport

Using JSONPath to get the IPs of all your nodes

$ kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="ExternalIP")].address}'

Exposing a service through an external load balancer

If Kubernetes is running in an environment that doesn’t support LoadBalancer services, the load balancer will not be provisioned, but the service will still behave like a NodePort service.

apiVersion: v1
kind: Service
metadata:
  name: kubia-loadbalancer
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: kubia

$ kubectl get svc kubia-loadbalancer

browser is using keep-alive connections and sends all its requests through a single connection, whereas curl opens a new connection every time.

Services work at the connection level, so when a connection to a service is first opened, a random pod is selected and then all network packets belonging to that connection are all sent to that single pod.

Even if session affinity is set to None, users will always hit the same pod (until the connection is closed).

preventing unnecessary network hops

configuring the service to redirect external traffic only to pods running on the node that received the connection

spec:
  externalTrafficPolicy: Local
  ...

If a service definition includes this setting and an external connection is opened through the service’s node port, the service proxy will choose a locally running pod.

If no local pods exist, the connection will hang. You therefore need to ensure the load balancer forwards connections only to nodes that have at least one such pod.

Using this annotation also has other drawbacks. Normally, connections are spread evenly across all the pods, but when using this annotation, that’s no longer the case.

It also affects the preservation of the client’s IP, because there’s no additional hop between the node receiving the connection and the node hosting the target pod (SNAT isn’t performed).

Exposing services externally through an Ingress resource

One important reason is that each LoadBalancer service requires its own load balancer with its own public IP address, whereas an Ingress only requires one, even when providing access to dozens of services.

an Ingress controller needs to be running in the cluster.

Enabling the Ingress add-on in Minikube

$ minikube addons list

$ minikube addons enable ingress

$ kubectl get po --all-namespaces

creating an Ingress resource

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: kubia
spec:
  rules:
  - host: kubia.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: kubia-nodeport
          servicePort: 80

Ingress controllers on cloud providers (in GKE, for example) require the Ingress to point to a NodePort service. But that’s not a requirement of Kubernetes itself.

$ kubectl get ingresses

Exposing multiple services through the same Ingress

You can map multiple paths on the same host to different services

...
  - host: kubia.example.com
    http:
      paths:
      - path: /kubia
        backend:
          serviceName: kubia
          servicePort: 80
      - path: /foo
        backend:
          serviceName: bar
          servicePort: 80

Similarly, you can use an Ingress to map to different services based on the host in the HTTP request instead of (only) the path

spec:
  rules:
  - host: foo.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: foo
          servicePort: 80
  - host: bar.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: bar
          servicePort: 80

Configuring Ingress to handle TLS traffic

When a client opens a TLS connection to an Ingress controller, the controller terminates the TLS connection.

The application running in the pod doesn’t need to support TLS.

create the private key and certificate:

$ openssl genrsa -out tls.key 2048
$ openssl req -new -x509 -key tls.key -out tls.cert -days 360 -subj /CN=kubia.example.com

$ kubectl create secret tls tls-secret --cert=tls.cert --key=tls.key

Instead of signing the certificate ourselves, you can get the certificate signed by creating a CertificateSigningRequest (CSR) resource.

$ kubectl certificate approve <name of the CSR>

The private key and the certificate are now stored in the Secret called tls-secret

ingress manifest:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: kubia
spec:
  tls:
  - hosts:
    - kubia.example.com
    secretName: tls-secret
  rules:
  - host: kubia.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: kubia-nodeport
          servicePort: 80

Instead of deleting the Ingress and re-creating it from the new file, you can invoke kubectl apply -f kubia-ingress-tls.yaml, which updates the Ingress resource with what’s specified in the file.

$ curl -k -v https://kubia.example.com/kubia

Although they currently support only L7 (HTTP/HTTPS) load balancing, support for L4 load balancing is also planned.

Signaling when a pod is ready to accept connections

readiness probe is invoked periodically and determines whether the specific pod should receive client requests or not.

Types of readiness probes

Like liveness probes, 3 types of readiness probes exist:

An Exec probe, where a process is executed. The container’s status is determined by the process’ exit status code.
An HTTP GET probe, which sends an HTTP GET request to the container and the HTTP status code of the response determines whether the container is ready or not.
A TCP Socket probe, which opens a TCP connection to a specified port of the container. If the connection is established, the container is considered ready.

When a container is started, Kubernetes can be configured to wait for a configurable amount of time to pass before performing the first readiness check. After that, it invokes the probe periodically and acts based on the result of the readiness probe. If a pod reports that it’s not ready, it’s removed from the service.

Unlike liveness probes, if a container fails the readiness check, it won’t be killed or restarted. This is an important distinction between liveness and readiness probes.

Liveness probes keep pods healthy by killing off unhealthy containers and replacing them with new, healthy ones,

whereas readiness probes make sure that only pods that are ready to serve requests receive them.

Adding a readiness probe to the pod template

$ kubectl edit rc kubia

apiVersion: v1
kind: ReplicationController
...
spec:
  ...
  template:
    ...
    spec:
      containers:
      - name: kubia
        image: luksa/kubia
        readinessProbe:
          exec:
            command:
            - ls
            - /var/ready
        ...

$ kubectl get po

$ kubectl exec kubia-2r1qb -- touch /var/ready

The readiness probe is checked periodically—every 10 seconds by default.

Understanding what real-world readiness probes should do

Manually removing pods from services should be performed by either deleting the pod or changing the pod’s labels instead of manually flipping a switch in the probe.

If you want to add or remove a pod from a service manually, add enabled=true as a label to your pod and to the label selector of your service. Remove the label when you want to remove the pod from the service.

You should always define a readiness probe, even if it’s as simple as sending an HTTP request to the base URL.

Don’t include pod shutdown logic into your readiness probes, because Kubernetes removes the pod from all services as soon as you delete the pod.

Creating a headless service

Setting the clusterIP field in a service spec to None makes the service headless, as Kubernetes won’t assign it a cluster IP through which clients could connect to the pods backing it

apiVersion: v1
kind: Service
metadata:
  name: kubia-headless
spec:
  clusterIP: None
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: kubia

$ kubectl exec <pod name> -- touch /var/ready

$ kubectl run dnsutils --image=tutum/dnsutils --generator=run-pod/v1 --command -- sleep infinity

$ kubectl exec dnsutils nslookup kubia-headless