Preface

Now I am continuing to learning Docker from the book Docker Pro by Huawei’s Docker group.

So Docker is a combination of varieties of old techniques, such as cgroup(2006), namespace and aufs(1993).

Architecture

Screen Shot 2018-11-13 at 8.22.13 PM.png

Libcontainer is the actual docker engine. It uses clone() to create container, pivot_root to enter container and cgroupfs to manage resources.

Docker components:

Screen Shot 2018-11-13 at 8.36.53 PM.png

Container shares the same kernel with OS of host.

Construction of Container

container = cgroup + namespace + rootfs + engine

How can we construct a container? Let’s learn about the process step by step.

Firstly, create a new process, which has its own namespace, with the help of clone:

pid = clone(fun, stack, flags, clone_arg);
/*
flags: CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWIPC | CLONE_NEWUTS | ...
*/

Second, insert PID of the process we created before into cgroup system, so it can be manipulated by cgroup:

echo $pid > /sys/fs/cgroup/cpu/tasks
echo $pid > /sys/fs/cgroup/cpuset/tasks
echo $pid > /sys/fs/cgroup/blkio/tasks
echo $pid > /sys/fs/cgroup/mempry/tasks
echo $pid > /sys/fs/cgroup/device/tasks
echo $pid > /sys/fs/cgroup/freezer/tasks

Third, let the process run func, which will invoke pivot_root to make itself into a new rootfs:

func()
{
    // ...
    pivot_root("path_of_rootfs/", path);
    // ...
    exec("/bin/bash");
}

Sure, func will give us a shell finally. Now let’s dive into details.

Cgroup

Cgroup is QoS of resources, including CPU, memory, block I/O and bandwidth. It consists of different sub-systems:

  • devices
  • cpuset
  • cpu,cpuacct
  • blkio (block I/O)
  • freezer
  • memory
  • net_cls, net_prio
  • huge_tlb
  • perf_event

Like procfs or sysfs, cgroup is a virtual filesystem and must be mounted before you use it, though that has been done when Linux starts.

E.g. mount a sub-system

mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset

E.g. files of cpuset sub-system

> ls
cgroup.clone_children  cpuset.memory_pressure
cgroup.procs           cpuset.memory_pressure_enabled
cgroup.sane_behavior   cpuset.memory_spread_page
cpuset.cpu_exclusive   cpuset.memory_spread_slab
cpuset.cpus            cpuset.mems
cpuset.effective_cpus  cpuset.sched_load_balance
cpuset.effective_mems  cpuset.sched_relax_domain_level
cpuset.mem_exclusive   notify_on_release
cpuset.mem_hardwall    release_agent
cpuset.memory_migrate  tasks

tasks records all the process controlled by it.

E.g. create a cgroup

mkdir /sys/fs/cgroup/cpuset/child

E.g. configure cgroup

echo 0 > /sys/fs/cgroup/cpuset/child/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/child/cpuset.mems

Then all the processes controlled by this cgroup can only run on No.0 CPU and allocate memory from No.0 memory.

E.g. add process to cgroup

echo $$ > /sys/fs/cgroup/cpuset/child/tasks

All the sub-processes will be automatically added to this cgroup.

Namespace

Namespace ensures that different namespaces can use the same resource simultaneously without troubles. There are 6 namespaces:

  • IPC (isolates System V IPC and POSIX message queue)
  • Network
  • Mount
  • PID
  • UTS (isolates hostname and domain)
  • User (isolates userid and groupid)

We can configure namespace with APIs clone, setns and unshare:

#include <sched.h>

// clone, __clone2 - create a child process
// new process into new namespace
int clone(int (*fn)(void *), void *child_stack,
         int flags, void *arg, ...
         /* pid_t *ptid, void *newtls, pid_t *ctid */ );

// unshare - disassociate parts of the process execution context
// existing process into new namespace
int unshare(int flags);

// setns - reassociate thread with a namespace
// existing process into existing namespace
int setns(int fd, int nstype);

But how can a process specify a namespace? It can open a virtual file of another process’s procfs and deliver this file-descriptor to setns:

ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 root root 0 Nov 16 21:49 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Nov 16 21:49 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Nov 16 21:49 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Nov 16 21:49 net -> 'net:[4026531992]'
lrwxrwxrwx 1 root root 0 Nov 16 21:49 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Nov 16 21:49 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Nov 16 21:49 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Nov 16 21:49 uts -> 'uts:[4026531838]'

[exec] command in docker is based on setns

Let’s dirty our hand and have a try!

UTS

#include <sys/utsname.h>
#include <sched.h>
#include <stdio.h>

#define STACK_SIZE (1024 * 1024)
static char stack[STACK_SIZE];
static char *const child_args[] = {"/bin/bash", NULL};

static int child(void *arg)
{
        execv("/bin/bash", child_args);
        return 0;
}

int main(int argc, char **argv)
{
        pid_t pid;
        pid = clone(child, stack + STACK_SIZE, SIGCHLD | CLONE_NEWUTS, NULL);

        waitpid(pid, NULL, 0);

        return 0;
}

Screen Shot 2018-11-16 at 10.00.51 PM.png

IPC

We can change the macro CLONE_NEWUTS to CLONE_NEWIPC and test it:

Screen Shot 2018-11-16 at 10.05.33 PM.png

Network

CLONE_NEWNET

Screen Shot 2018-11-16 at 10.14.35 PM.png

PID

CLONE_NEWPID

Screen Shot 2018-11-16 at 10.36.13 PM.png

User

CLONE_NEWUSER

Screen Shot 2018-11-16 at 10.49.05 PM.png

That’s good. But why is it 65534? Can we change it to 0, just as what we see in a docker container? We can use /proc/PID/uid_map to realise it. Referring to Docker基础技术:Linux Namespace(下), we should modify the code.

// ... some headers

#define STACK_SIZE (1024 * 1024)

static char container_stack[STACK_SIZE];
char* const container_args[] = {
    "/bin/bash",
    NULL
};

int pipefd[2];

void set_map(char* file, int inside_id, int outside_id, int len) {
    FILE* mapfd = fopen(file, "w");
    if (NULL == mapfd) {
        perror("open file error");
        return;
    }
    fprintf(mapfd, "%d %d %d", inside_id, outside_id, len);
    fclose(mapfd);
}

void set_uid_map(pid_t pid, int inside_id, int outside_id, int len) {
    char file[256];
    sprintf(file, "/proc/%d/uid_map", pid);
    set_map(file, inside_id, outside_id, len);
}

void set_gid_map(pid_t pid, int inside_id, int outside_id, int len) {
    char file[256];
    sprintf(file, "/proc/%d/gid_map", pid);
    set_map(file, inside_id, outside_id, len);
}

int container_main(void* arg)
{

    printf("Container [%5d] - inside the container!\n", getpid());

    printf("Container: eUID = %ld;  eGID = %ld, UID=%ld, GID=%ld\n",
            (long) geteuid(), (long) getegid(), (long) getuid(), (long) getgid());

    /* wait for parent-process */
    char ch;
    close(pipefd[1]);
    read(pipefd[0], &ch, 1);

    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return 1;
}

int main()
{
    const int gid=getgid(), uid=getuid();

    printf("Parent: eUID = %ld;  eGID = %ld, UID=%ld, GID=%ld\n",
            (long) geteuid(), (long) getegid(), (long) getuid(), (long) getgid());

    pipe(pipefd);

    printf("Parent [%5d] - start a container!\n", getpid());

    int container_pid = clone(container_main, container_stack+STACK_SIZE,
            CLONE_NEWUSER | SIGCHLD, NULL);

    printf("Parent [%5d] - Container [%5d]!\n", getpid(), container_pid);

    set_uid_map(container_pid, 0, uid, 1);
    set_gid_map(container_pid, 0, gid, 1);

    printf("Parent [%5d] - user/group mapping done!\n", getpid());

    /* inform sub-process */
    close(pipefd[1]);

    waitpid(container_pid, NULL, 0);
    printf("Parent - container stopped!\n");
    return 0;
}

Screen Shot 2018-11-16 at 11.04.00 PM.png

If you combine all the things together, you will get something really similar to a docker container!