Why clone twice
如前一篇博文介绍,在nsenter包中,实际进行了两次进程clone,分别有parent, child, init三个进程进行相应的交互处理后,最后留下init运行go runtime。
这三个进程的关系是 parent –> child –> init,注意箭头只是clone关系,因为在clone时,clone flags参数为CLONE_PARENT | SIGCHLD
static  int  clone_parent(jmp_buf *env, int jmpval)  __attribute__  ((noinline));
static  int  clone_parent(jmp_buf *env, int jmpval)
{
    struct  clone_t ca  =  {
        .env =  env,
        .jmpval =  jmpval,
    };
    return  clone(child_func,  ca.stack_ptr,  CLONE_PARENT  |  SIGCHLD,  &ca);
}
所以这三个进程实际是具有相同ppid。
这里之所以要clone两次,一次是因为CLONE_NEWPID后,child进程才会进入该namespace,描述见下,这就是第二次clone生成init进程的原因, 当在child进程中设置namespace后,child进程的pid namespace并不起作用,需要在clone出init进程,使其与child进程同namespace,但是pid namespace生效。
**CLONE_NEWPID** (since Linux 2.6.24)
              If **CLONE_NEWPID** is set, then create the process in a new PID
              namespace.  If this flag is not set, then (as with [fork(2)](http://man7.org/linux/man-pages/man2/fork.2.html))
              the process is created in the same PID namespace as the
              calling process.  This flag is intended for the implementation
              of containers.
另一个clone的原因是,由于内核原因,一方面不能将USER namespace与其它namespace一起挂载,那样会导致namespace的所属不清楚的问题,另一方面对于rootless container应为没有CAP_SYS_ADMIN权限而无法挂载其它namespace(见下说明),所以首先需要先挂载user namespace,而有些操作系统挂载了user namespace后如果不做uid/gid map的话,后面操作也会报错,所以需要在挂载user namespace后先完成uid/gid map。 而一旦先挂载了user namespace,那么配置必须要由原来的namespace来做(见下说明2),于是这里必须得有一次clone,也就是parent clone出child进程。
       Starting in Linux 3.8, unprivileged processes can create user
       namespaces, and the other types of namespaces can be created with
       just the **CAP_SYS_ADMIN** capability in the caller's user namespace.
       If **CLONE_NEWUSER** is specified along with other **CLONE_NEW*** flags in a
       single [clone(2)](http://man7.org/linux/man-pages/man2/clone.2.html) or [unshare(2)](http://man7.org/linux/man-pages/man2/unshare.2.html) call, the user namespace is guaranteed
       to be created first, giving the child ([clone(2)](http://man7.org/linux/man-pages/man2/clone.2.html)) or caller
       ([unshare(2)](http://man7.org/linux/man-pages/man2/unshare.2.html)) privileges over the remaining namespaces created by the
       call.  Thus, it is possible for an unprivileged caller to specify
       this combination of flags
      The _uid_map_ file exposes the mapping of user IDs from the user
       namespace of the process _pid_ to the user namespace of the process
       that opened _uid_map_ (but see a qualification to this point below).
       In other words, processes that are in different user namespaces will
       potentially see different values when reading from a particular
       _uid_map_ file, depending on the user ID mappings for the user
       namespaces of the reading processes.
我们知道把user namespace与其它namespace分开挂载的话,将会有很多种方案:
- 先clone user-ns,再clone others-ns
- 先ushare user-ns,再clone others-ns
- 先clone user-ns, 再unshare other-ns
第一种方案,必须要开启dump clone flags,这对于rootless container来讲将没法满足。 第二种方案,unshare user-ns后,原来的进程由于进入了新namespace,将没有权限设置多个uid/gid map(uid, gid两个)。 所以这里runc采用先clone在unshare的方式. 然而事实并不是就调用一下clone然后再unshare那么简单,因为要考虑使用已有namespace的问题。最后的逻辑是先clone,然后挂载已有namespace(见下说明),接着进入user namesapce,然后再unshare,注释如下:
//挂载已有的namespaces
if (config.namespaces)
    join_namespaces(config.namespaces);
//先new user namespace
if (config.cloneflags &  CLONE_NEWUSER)  {
    if (unshare(CLONE_NEWUSER)  < 0)
        bail("failed to unshare user namespace");
    config.cloneflags &=  ~CLONE_NEWUSER;
if (config.namespaces)  {
    if (prctl(PR_SET_DUMPABLE, 1, 0, 0, 0)  < 0)
        bail("failed  to  set  process  as  dumpable");
}
//调用parent 配置uid/gid映射
s  =  SYNC_USERMAP_PLS;
if (write(syncfd,  &s, sizeof(s))  != sizeof(s))
    bail("failed  to  sync  with  parent:  write(SYNC_USERMAP_PLS)");
/* ...  wait  for  mapping  ... */
if (read(syncfd,  &s, sizeof(s))  != sizeof(s))
    bail("failed to sync with parent: read(SYNC_USERMAP_ACK)");
.........
//unshare namespaces
if (unshare(config.cloneflags &  ~CLONE_NEWCGROUP)  < 0)
    bail("failed  to  unshare  namespaces");
说明: 这里join_namespaces是加入runc配置的已有namespace,这个已有namespace是在bundle的config.json中的linux»namespaces配置, 如果这里配置了path,则是使用已有namespace,没有配置则是进行cloneflags进行ushare。这部分处理逻辑在container初始化bootstrapdata时进行了判断https://github.com/opencontainers/runc/blob/v1.0.0-rc8/libcontainer/container_linux.go#L500
"namespaces": [
                        {
                                "type": "pid"
                        },
                        {
                                "type": "network"
                        },
                        {
                                "type": "ipc"
                        },
                        {
                                "type": "uts"
                        },
                        {
                                "type": "mount"
                        }
                ],