Awesome
Learning Linux Kernel internals
The projects sole purpose is to help me learn about the Linux kernel.
The kernel does not have have access to libc but has many functions that are available from inside the kernel that are similar. For example, there is printk.
The kernel stack is small and of fixed size which is configurable using a compile time option.
Processes
A process is represented as a struct named task_struct which contains all the information that the kernel needs about the process like the processes address space, open files, pending signals, the state of the process, its virtual memory space, etc).
Virtual Address Space
Each process has its own virtual address space and from the processes point of view it is the only process that exists.
This address space looks something like the this:
+-------------------------+ 0xffffffff
1GB| |
| Kernel space |
| |
|-------------------------| 0xc0000000 [TASK_SIZE](https://github.com/torvalds/linux/blob/4a3033ef6e6bb4c566bd1d556de69b494d76976c/arch/arm/include/asm/memory.h#L31)
| User space |
|-------------------------|
| Stack segment |
| ↓ |
|-------------------------| esp (extended stack pointer)
| |
|-------------------------|
| Memory Mapped Segment |
| ↓ |
|-------------------------|
3GB| |
| |
| |
| |
| | program break
|-------------------------| [brk](https://github.com/torvalds/linux/blob/b07f636fca1c8fbba124b0082487c0b3890a0e0c/include/linux/mm_types.h#L451)
| ↑ |
| Heap segment |
| |
|-------------------------| [start brk](https://github.com/torvalds/linux/blob/b07f636fca1c8fbba124b0082487c0b3890a0e0c/include/linux/mm_types.h#L451)
| BSS segment |
| |
|-------------------------| [end data](https://github.com/torvalds/linux/blob/b07f636fca1c8fbba124b0082487c0b3890a0e0c/include/linux/mm_types.h#L450)
| Data segment |
| | [start data](https://github.com/torvalds/linux/blob/b07f636fca1c8fbba124b0082487c0b3890a0e0c/include/linux/mm_types.h#L450)
|-------------------------| [end code](https://github.com/torvalds/linux/blob/b07f636fca1c8fbba124b0082487c0b3890a0e0c/include/linux/mm_types.h#L450)
| Text segment | 0x08048000
| |
+-------------------------+ 0
Each process will have a virtual address space that goes from 0 to TASK_SIZE
.
The rest, from TASK_SIZE to 2³² or 2⁶⁴ is reserved for the kernel and is the
same for each process. So, while the Kernel space is the same for each process
the user address space will be different.
The Memory Management Unit (MMU), which is a hardware component, manages virtual addresses by mapping virtual addresses to physical addresses, and also provides protection by check privileges.
Page table
32 22 21 12 11 0
+---------------------------+ +-------------------+
+--| Directory | Page | Offset | ----------+ | Page frame #1 | 4Kb (4096 bytes)
| +---------------------------+ | +-------------------+
| | | | Page frame #2 | 4Kb (4096 bytes)
| | | +-------------------+
| +---------------+| +-------------+ | | Page frame #3 | 4Kb (4096 bytes)
| | Page Directory|| | Page Table | | +-------------------+
| +---------------+| +-------------+ | | Page frame #4 | 4Kb (4096 bytes)
+->| Entry (PDE) |-->| +Page index |-------------> +-------------------+
| +---------------+ +-------------+
|
+---+
|cr3|
+---+
The physical address of the Page Directory is stored in control register cr3
.
So a virtual address consists of three parts, a directory entry pointer, a page
table index, and an page frame offset.
The page tables are stored in main memory and must be initialized by the kernel before enabling the paging unit.
The
The unit the MMU operate with is a page
. The size can vary but lets say it is
4 KB. A page frame is the physical page.
Physical Memory
+---------------------------+
4KB | PageFrame1: page content |
+---------------------------+
4KB | PageFrame2: page content |
+---------------------------+
So the MMU will always read/store units of page size, which go into the page frame in physical memory.
Just to be clear on one things here. When we allocate memory with mmap what we get is a reservation of virtual memory, there is not physical memory allocated for this virutal memory yet.
Think about when a process gets created, memory will be mapped in to the processes virtual address space. Like the code segment, each entry in the code segment is addressable using a virtual address which is mapped to a physical address. A different process could have the same virtual address but it would not be mapped to the same physical address.
A process is represented by a task_struct
(see details in the Processes section).
One of its fields is named mm
and points to a mm_struct.
The kernel space is the same for each process, but user processes cannot read or write to the data in the kernel space, and not excecute code either.
The reuse of the stack region tends to keep stack memory in the cpu caches which improves performance.
Memory Management Unit (MMU)
The memory management unit is a physical component, as I understand it most often on the CPU itself.
Linked lists
Normally if a struct is to become part of a linked list we would store a next/prev pointer, for example
struct something {
int nr;
struct something* next;
struct something* prev;
};
But the way this is done in the kernel is to embed a linked list instead:
struct list_head {
struct list_head* next
struct list_head* prev;
};
struct something {
int nr;
struct list_head list;
};
There is an list.c example that does not use any internal kernel headers but hopefully gives a "feel" for how this works.
Docker images for kernel development
$ docker run --privileged -ti -v$PWD:/root -w/root centos /bin/bash
$ yum install -y gcc kernel-devel
Device Drivers
TODO: add from notes.
Networking
This section will take a closer look at how a packet moves through the system. We will start by looking at an incoming TCP/IP v4 packet.
During the boot process at some point inet_init is called. There is the following line:
fs_initcall(inet_init);
fs_initcall is a macro which looks like this:
#define fs_initcall(fn) _define_initcall(fn, 5)
#define __define_initcall(fn, id) ___define_initcall(fn, id, .initcall##id)
#define ___define_initcall(fn, id, __sec) \
static initcall_t __initcall_##fn##id __used \
__attribute__((__section__(#__sec ".init"))) = fn;
#endif
So the preprocessor would expand this into something like:
static initcall_t __initcall_inet_init5 __used __attribute__((__section__(.initcall5 ".init"))) = inet_init;
During linking the GNU linker will use a linkerscript, which is text file with
commands which describes how the sections in the input object files should be
mapped to the output file. There is a default linker script if you don't specify
one and it can be viewed using ldd --verbose
.
$ docker run --privileged -ti -v$PWD:/root/ -w/root/ gcc /bin/bash
We are going to compile and assemble but not link:
$ gcc -c linkerscript.c
And then we will link using the following command:
$ ld -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 /usr/lib/x86_64-linux-gnu/crt1.o /usr/lib/x86_64-linux-gnu/crti.o -lc linkerscript.o /usr/lib/x86_64-linux-gnu/crtn.o
crt1.o
, crti.o
, and crtn.o
are object files that make up the C Run Time (CRT).
crt1.o
provides the _start
symbol that the ld jumps to, and it also responsible
for calling main()
, and later for calling exit()
.
crti.o
(c runtime init) contains the prologue section .init
.
crtn.o
contains the epilogue section .fini
.
inet_init
does things like register protocol handlers, for example it calls
dev_add_pack(&ip_packet_type
where ip_packet type looks like this:
static struct packet_type ip_packet_type __read_mostly = {
.type = cpu_to_be16(ETH_P_IP),
.func = ip_rcv,
.list_func = ip_list_rcv,
};
Notice that .func
is being set to ip_rcv
. This is the handler for all IPv4
packets.
+--------------------------------------------------------------------------+
| Network Driver |
+--------------------------------------------------------------------------+
|
↓
+-----------------------+
| ip_rcv() |
+-----------------------+
|
↓
+-----------------------+
| NF_INET_PRE_ROUTING |
| raw->ct->magle->dnat |
+-----------------------+
|
↓
+-----------------------+ +------------------+
| ip_rcv_finish() |--->| Routing Subsystem|
+-----------------------+ +------------------+
|
↓
+------------------+
|ip_local_deliver()|
+------------------+
|
↓
+------------------+
|NF_INET_LOCAL_IN |
|mangle->filter-> |
|security->snat |
+------------------+
|
↓
+-------------------------+
|ip_local_deliver_finish()|
+-------------------------+
|
↓
+--------------------------------------------------------------------------+
| Transport Layer |
+--------------------------------------------------------------------------+
NF
stands for Netfilter which is the subsystem for iptables, so these are
callouts/hooks for various stages in the processing of packages.
mangle
is for modifying packet attributes (like ttl for example).
ct
above stands for connection tracking (conntrack/CT) and is not a chain but iptables is a stateful firewal and is tracks the state of the connection. The states can be
NEW,
ESTABLISHED,
RELATED,
INVALID,
UNTRACKED,
DNAT(a packets whose dest address was changed by rules in the nat table),
SNAT` (similar to DNAT but for src address).
There are 5 hooks, PRE_ROUTING
, INPUT
, FORWARD
, OUTPUT
, and POST_ROUTING
.
Rules can be added to all of these hooks and the rules are organized using chains
for different.
Lets take a look at the first one so that we understand how these work. ip_rcv calls NF_HOOK as the last thing it does:
return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
net, NULL, skb, dev, NULL,
ip_rcv_finish);
static inline int NF_HOOK(uint8_t pf,
unsigned int hook,
struct net *net,
struct sock *sk,
struct sk_buff *skb,
struct net_device *in,
struct net_device *out,
int (*okfn)(struct net *, struct sock *, struct sk_buff *);
An incoming (ingress) packet arrives on the network interface card (NIC):
+------------------+--------------+---------------+--------+----------------+
| Ethernet header | IP Header | TCP Header | Data | Frame check sum|
+------------------+--------------+---------------+--------+----------------+
| destination mac | length | src port |
| source mac | IP type (TCP)| dest port |
| type (IP) | checksum | checksum |
| source IP |
| dest IP |
The NIC will check if we accept the destination mac and verify the frame check sum. If these checks are successful the packet will be stored in a memeory location that was allocated by the device driver for the NIC. After this the NIC will trigger an interrupt.
The device driver's top half will acknowledge the interrupt and then schedule
the bottom half and then return.
The device driver's bottom half will retreive the packet from the buffer where
is was stored and allocate a new socket kernel buffer (SKB) which is a struct
named skb_buff
and can be found in include/linux/skbuff.h
Note that when you see something named xmit
just read it as transmit.
Raw sockets
These sockets that give access to the packet as seen by the NIC, and that is not handled by the other network layers (L2, L3, and L4).
$ docker run --privileged -ti -v$PWD:/root/ -w/root/ gcc /bin/bash
$ gcc -o raw-socket raw-socket.c
$ ./raw-socket
Execute a new process (container) in the same namespace:
$ docker exec -ti d80c81eead6a /bin/bash
$ curl www.google.com
And you will see the information printed in the other terminal.
We can get information about the listening socket using netstat
(needs to be
installed using apt-get update && apt-get install net-tools):
root@d80c81eead6a:~# netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
raw 0 0 0.0.0.0:tcp 0.0.0.0:* 7
Memory layout
$ gcc -c simple.c
$ size simple.o
text data bss dec hex filename
74 0 0 74 4a simple.o
root@c641a3216288:~# objdump -h simple.o
simple.o: file format elf64-x86-64
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000012 0000000000000000 0000000000000000 00000040 2**0
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .data 00000000 0000000000000000 0000000000000000 00000052 2**0
CONTENTS, ALLOC, LOAD, DATA
2 .bss 00000000 0000000000000000 0000000000000000 00000052 2**0
ALLOC
3 .comment 00000012 0000000000000000 0000000000000000 00000052 2**0
CONTENTS, READONLY
4 .note.GNU-stack 00000000 0000000000000000 0000000000000000 00000064 2**0
CONTENTS, READONLY
5 .eh_frame 00000038 0000000000000000 0000000000000000 00000068 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
Virtual Memory Address (VMA) is the address the section will have when the output
object file is executed. It is zero now because we have not linked it into an
executable yet.
Load Memory Address (LMA) is the address into which the section will be loaded.
This is most often the same but can be different in some situations.
.eh_frame
is an exception frame and contains one or more Call Frame Information
(CFI) records. This is used for stack unwinding and other things.
Now, if we link this into an executable we can compare:
$ objdump -h simple
mmap (sys/mman.h)
Is a call that creates a new mapping in the virtual address space of the calling process. mmap.c is an example of the usage of this function call.
void *mmap(void *addr, size_t length, int prot, int flags,
int fd, off_t offset);
addr
can be NULL in which case the kernel will choose the page-aligned address
where this mapping will be created. If not null it is taken as a hint as to
where to place this mapping. length
specifies the length of the mapping.
prot
specifies the memory protection for the mapping and can be one of:
PROT_EXEC may be executed
PROT_READ may be read from
PROT_WRITE may be written to
PROT_NONE may not be accessed
So PROT_NONE
strikes me as a little strange as what use is a mapping if it
cannot be accessed?
These mappings can be useful to protect this memory region and later use it
for smaller virtual mappings. These smaller regions could be handed out using
the flag MAP_FIXED
with an address that is part of the larger region (at least
I think this is what it's for).
flags
indicates whether updates to this mapping are visible to other processes
that have a mapping to the same region.
MAP_SHARED
Other processes with mapping to the same region in memory will be visible to those processes.
MAP_SHARED_VALIDATE
Same as MAP_SHARED but will validate the passed in flags and fail with an error of EOPNOTSUPP if an unknown flag is pased in.
MAP_PRIVATE
Updates to the mapping are not visible to other processes mapping to the same file. Only applicable to file mapped?
MAP_ANONYMOUS
The mapping is not backed by any file and its contents are initialized to zero.
With this value the fd
argument is ignored but some implementations require
fd
to be -1
so it is safest to use -1
.
MAP_NORESERVE
Does not reserve swap space for this mapping. If there is no physical memory available writing will case a SIGSEGV.
Program startup
While our c programs have a main function that is considered the entry point,
the real entry point is specified by the linker, either via the -e
flag or
perhaps in the linkerscript.
libc
$ gcc -o simple simple.c
Just to be clear about one thing, this will be a dynamically linked executable
since we did not specify the -static
flag.
$ ldd simple
linux-vdso.so.1 (0x00007fff46456000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe7097dc000)
/lib64/ld-linux-x86-64.so.2 (0x00007fe7099a9000)
vsdo
is a virtual library (notice that it is not associated with a file) that
is automatically mapped in the virtual address space of a process by the kernel.
This is a virtual dynamic shared object (vdso) and is a small library that the
kernel maps into the virtual address space of all user processes. The motivation
for this is that there are some system calls that are used very often, enough
to cause a performace issue with having to switch into kernel mode. An example
of a frequently called function is gettimeofday which can be called directly from
user code and also is called from the c library. This library can be found using
the auxiliary vectors which is a mechanism to transfer some kernel level info
to the user process. This info is passed by binary loaders. The ELF loader parses
the ELF file and maps the various segments into the processes virtual address space
, sets up the entry point, and initializes the process stack.
When we run ./simple
how does the kernel know how to handle this?
In my case I'm using the bash shell, which is also just program running on the
system. bash does some initial setup and then enters a read loop where is wait
for commands and executes them as they are entered. This will eventually call
execve(command, args, env)
:
int execve(const char *filename, char *const argv [], char *const envp[]);
We can find the implmentation of execve
in fs/exec.c. The v
at the end of exec stands for argv, and the e
stands
for the envp argumnets.
We can use strace
to see this for our example:
$ strace ./simple non-used-arg
execve("./simple", ["./simple", "non-used-arg"], 0x7ffe36115288 /* 10 vars */) = 0
So to answer the question, it is the bash shell that calls execve. For some reason
that was not clear to me before.
Take a look at load_elf_binary
for details on the loading. This function will inspect the elf program header
and look for an INTERPR
header. which is our case is:
readelf -l simple
Elf file type is EXEC (Executable file)
Entry point 0x401020
There are 11 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
PHDR 0x0000000000000040 0x0000000000400040 0x0000000000400040
0x0000000000000268 0x0000000000000268 R 0x8
INTERP 0x00000000000002a8 0x00000000004002a8 0x00000000004002a8
0x000000000000001c 0x000000000000001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
When the interpreter is run it will call the .init section, do the table relocations,
and then return control back to load_elf_binary
. More details of the linker
and these tables can be found below.
Keep in mind that the execve
call will replace the current/calling processes virtual
address space, so once everything has been step up, the next instruction pointed to
by rip will be executed.
libc
is the c library and ld-linux-x86_64
is the dynamic linker.
$ objdump -f simple
simple: file format elf64-x86-64
architecture: i386:x86-64, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x0000000000401020
So we dissassemble and see what exists at 0x0000000000401020
:
$ objdump -d simple
Disassembly of section .text:
0000000000401020 <_start>:
401020: 31 ed xor %ebp,%ebp
401022: 49 89 d1 mov %rdx,%r9
401025: 5e pop %rsi
401026: 48 89 e2 mov %rsp,%rdx
401029: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
40102d: 50 push %rax
40102e: 54 push %rsp
40102f: 49 c7 c0 80 11 40 00 mov $0x401180,%r8
401036: 48 c7 c1 20 11 40 00 mov $0x401120,%rcx
40103d: 48 c7 c7 02 11 40 00 mov $0x401102,%rdi
401044: ff 15 a6 2f 00 00 callq *0x2fa6(%rip) # 403ff0 <__libc_start_main@GLIBC_2.2.5>
40104a: f4 hlt
40104b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Notice that our start address is for the _start_
label (and not our main function).
So where is _start defined?
It can be found in ./glibc/sysdeps/x86_64/start.S
:
%rdx Contains a function pointer to be registered with `atexit'.
This is how the dynamic linker arranges to have DT_FINI
functions called for shared libraries that have been loaded
before this code runs.
%rsp The stack contains the arguments and environment:
0(%rsp) argc
LP_SIZE(%rsp) argv[0]
...
(LP_SIZE*argc)(%rsp) NULL
(LP_SIZE*(argc+1))(%rsp) envp[0]
...
NULL
ENTRY (_start)
...
call *__libc_start_main@GOTPCREL(%rip)
The ENTRY
directive is what is setting the entry point for the program which
is the same thing as passing the entry point to the linker using -e _start
.
The first instruction, xor %ebp, %ebp
is just clearing the %ebp register (setting
it to zero):
101
^101
----
000
Now before we look into this just recall that on x86_64 the registers used for passing parameters are this following:
1: rdi
2: rsi
3: rdx
4: rcx
5: r8
6: r9
Also remember that objdump
by default outputs assembly in AT&T syntax so the first
operand in the instructions above is the source and the second is the destination.
So we are moving the current value in rdx into r9, which we know can be used as argument (nr 6) of a function call. This should be the shared library termination function (if there is one):
401022: 49 89 d1 mov %rdx,%r9
Next we are popping the top-most value off the stack, which is argc
, and
saving it in register rsi (which is the second argument of __libc_start_main
:
401025: 5e pop %rsi
Next, since we popped argc off the stack, the next value on the stack is argv
and this is stored in register rdx, the third argument to __libc_start_main
:
401026: 48 89 e2 mov %rsp,%rdx
The next instruction is aligning the stack on a 16-byte boundry:
401029: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
40102d: 50 push %rax
Next we push the value of the stackpointer onto the stack:
40102e: 54 push %rsp
So this is moving the value `0x401180` into r8 (the fifth argument fini).
This is `__libc_csu_fini`:
```console
40102f: 49 c7 c0 80 11 40 00 mov $0x401180,%r8
0000000000401180 <__libc_csu_fini>:
401180: c3 retq
Next, we have
Which is moving the value 0x401120
into rcx which is the fourth argument init:
401036: 48 c7 c1 20 11 40 00 mov $0x401120,%rcx
0000000000401120 <__libc_csu_init>:
Next we are moving the value 0x401102
into rdi (the first argument main):
40103d: 48 c7 c7 02 11 40 00 mov $0x401102,%rdi
0000000000401102 <main>:
So all of that was setting up the arguments to call __libc_start_main which has a signture of:
int __libc_start_main(int *(main) (int, char * *, char * *),
int argc,
char** ubp_av,
void (*init) (void),
void (*fini) (void),
void (*rtld_fini) (void),
void (* stack_end));
The actual call look like this:
401044: ff 15 a6 2f 00 00 callq *0x2fa6(%rip) # 403ff0 <__libc_start_main@GLIBC_2.2.5>
Notice that this is (%rip). The parentheses means that this is a memory address
which is used as a base register, and we are using the value in 0x2fa6).
I think this is the same as writing %rpb + 0x2fa6
. This type of addressing
relative to the instruction pointer was not possible in 32 bit systems if I understand
things correctly, in those one would have to dump to a label and there push the
current instruction pointer, which could then be used by the caller.
The *
means that this is an absolute jump call (not a relative one).
TODO: double check the above as I'm a little unsure about this.
__libc_start_main
can be found in glibc/csu/libc-start.c
# define LIBC_START_MAIN __libc_start_main
STATIC int
LIBC_START_MAIN (int (*main) (int, char **, char ** MAIN_AUXVEC_DECL),
int argc, char **argv,
#ifdef LIBC_START_MAIN_AUXVEC_ARG
ElfW(auxv_t) *auxvec,
#endif
__typeof (main) init,
void (*fini) (void),
void (*rtld_fini) (void), void *stack_end) {
/* Result of the 'main' function. */
int result;
...
/* Store the lowest stack address. This is done in ld.so if this is
the code for the DSO. */
__libc_stack_end = stack_end;
/* Set up the stack checker's canary. */
uintptr_t stack_chk_guard = _dl_setup_stack_chk_guard (_dl_random);
# ifdef THREAD_SET_STACK_GUARD
THREAD_SET_STACK_GUARD (stack_chk_guard);
# else
__stack_chk_guard = stack_chk_guard;
# endif
...
/* Register the destructor of the dynamic linker if there is any. */
if (__glibc_likely (rtld_fini != NULL))
__cxa_atexit ((void (*) (void *)) rtld_fini, NULL, NULL);
...
/* Register the destructor of the program, if any. */
if (fini)
__cxa_atexit ((void (*) (void *)) fini, NULL, NULL);
...
if (init)
(*init) (argc, argv, __environ MAIN_AUXVEC_PARAM);
...
#ifdef HAVE_CLEANUP_JMP_BUF
/* Memory for the cancellation buffer. */
struct pthread_unwind_buf unwind_buf;
int not_first_call;
not_first_call = setjmp ((struct __jmp_buf_tag *) unwind_buf.cancel_jmp_buf);
if (__glibc_likely (! not_first_call))
{
struct pthread *self = THREAD_SELF;
/* Store old info. */
unwind_buf.priv.data.prev = THREAD_GETMEM (self, cleanup_jmp_buf);
unwind_buf.priv.data.cleanup = THREAD_GETMEM (self, cleanup);
/* Store the new cleanup handler info. */
THREAD_SETMEM (self, cleanup_jmp_buf, &unwind_buf);
/* Run the program. */
result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);
}
else
{
/* Remove the thread-local data. */
# ifdef SHARED
PTHFCT_CALL (ptr__nptl_deallocate_tsd, ());
# else
extern void __nptl_deallocate_tsd (void) __attribute ((weak));
__nptl_deallocate_tsd ();
# endif
/* One less thread. Decrement the counter. If it is zero we
terminate the entire process. */
result = 0;
# ifdef SHARED
unsigned int *ptr = __libc_pthread_functions.ptr_nthreads;
# ifdef PTR_DEMANGLE
PTR_DEMANGLE (ptr);
# endif
# else
extern unsigned int __nptl_nthreads __attribute ((weak));
unsigned int *const ptr = &__nptl_nthreads;
# endif
if (! atomic_decrement_and_test (ptr))
/* Not much left to do but to exit the thread, not the process. */
__exit_thread ();
}
#else // HAVE_CLEANUP_JMP_BUF
/* Nothing fancy, just call the function. */
result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);
#endif
exit (result);
There are a number of things that are of interest here. We can see that the
dynamic libary descructor function and fini are set using __cxa_atexit.
And we can see that init is called directly which makes sence.
Also notice that setjmp
is used to setup up the longjmp calls which allow
for returning to this poing using longjmp
and is a way to unwind the stack
to this point and allow for clean up to take place. The first time setjmp
is
called it will return 0 and enter the first if code block and run the main
function. And if longjmp
was called the else clause will be taken and the
clean up performed and __exit_thread()
called. This example might help to
clarify the setjmp/longjmp longjmp.c.
$ readelf -W --sections simple
There are 27 section headers, starting at offset 0x3850:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .interp PROGBITS 00000000004002a8 0002a8 00001c 00 A 0 0 1
[ 2] .note.ABI-tag NOTE 00000000004002c4 0002c4 000020 00 A 0 0 4
[ 3] .hash HASH 00000000004002e8 0002e8 000018 04 A 5 0 8
[ 4] .gnu.hash GNU_HASH 0000000000400300 000300 00001c 00 A 5 0 8
[ 5] .dynsym DYNSYM 0000000000400320 000320 000048 18 A 6 1 8
[ 6] .dynstr STRTAB 0000000000400368 000368 000038 00 A 0 0 1
[ 7] .gnu.version VERSYM 00000000004003a0 0003a0 000006 02 A 5 0 2
[ 8] .gnu.version_r VERNEED 00000000004003a8 0003a8 000020 00 A 6 1 8
[ 9] .rela.dyn RELA 00000000004003c8 0003c8 000030 18 A 5 0 8
[10] .init PROGBITS 0000000000401000 001000 000017 00 AX 0 0 4
[11] .text PROGBITS 0000000000401020 001020 000161 00 AX 0 0 16
[12] .fini PROGBITS 0000000000401184 001184 000009 00 AX 0 0 4
[13] .rodata PROGBITS 0000000000402000 002000 000004 04 AM 0 0 4
[14] .eh_frame_hdr PROGBITS 0000000000402004 002004 000034 00 A 0 0 4
[15] .eh_frame PROGBITS 0000000000402038 002038 0000d8 00 A 0 0 8
[16] .init_array INIT_ARRAY 0000000000403e40 002e40 000008 08 WA 0 0 8
[17] .fini_array FINI_ARRAY 0000000000403e48 002e48 000008 08 WA 0 0 8
[18] .dynamic DYNAMIC 0000000000403e50 002e50 0001a0 10 WA 6 0 8
[19] .got PROGBITS 0000000000403ff0 002ff0 000010 08 WA 0 0 8
[20] .got.plt PROGBITS 0000000000404000 003000 000018 08 WA 0 0 8
0x2fa6 + %rip is 403ff0 and we can find this in the .got section of the headers.
$ readelf -W -r simple
Relocation section '.rela.dyn' at offset 0x3c8 contains 2 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000403ff0 000100000006 R_X86_64_GLOB_DAT 0000000000000000 __libc_start_main@GLIBC_2.2.5 + 0
R_X86_64_GLOB_DAT tells the dynamic linker to find the value of symbol __libc__start_main@BLIBC_2.2.5 and put that value into address 000000403ff0 which is the address that will be use in the callq operation.
objdump -R simple
simple: file format elf64-x86-64
DYNAMIC RELOCATION RECORDS
OFFSET TYPE VALUE
0000000000403ff0 R_X86_64_GLOB_DAT __libc_start_main@GLIBC_2.2.5
0000000000403ff8 R_X86_64_GLOB_DAT __gmon_start__
When an ELF executable is run the kernel will read the ELF image into the users
virtual address space. The kernel will look for a section called .interp
:
$ readelf -l simple
Program Headers:
Type Offset Virtual Address Physical Address File Size Mem Size Flags Align
INTERP 0x00000000000002a8 0x00000000004002a8 0x00000000004002a8 0x000000000000001c 0x000000000000001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
...
TODO: take a closer look at https://github.com/torvalds/linux/blob/master/fs/binfmt_elf.c and see how this works.```
You can actually run this program directly.
$ /lib64/ld-linux-x86-64.so.2 --list /lib/x86_64-linux-gnu/libc.so.6
/lib64/ld-linux-x86-64.so.2 (0x00007f2a4e02d000)
linux-vdso.so.1 (0x00007fff57d4d000)
The kernel call this somehow and it will loads the shared library passed to it if needed (if they were not already available in memory that is). The linker will then perform the relocations for the executable we want to run. There is a linux_binfmt struct which contains a function to load libraries:
static struct linux_binfmt elf_format = {
.module = THIS_MODULE,
.load_binary = load_elf_binary,
.load_shlib = load_elf_library,
.core_dump = elf_core_dump,
.min_coredump = ELF_EXEC_PAGESIZE,
};
Relocations happen for data and for functions and there is a level of indirection here. The indirection has to do with (perhaps others as well) that we don't want to make the code segment writable, if it is writable it cannot be shared by other executables meaning that would have to include the code segment in their virtual address spaces. Instead, we can use a pointer to a mapping in the data section (which is writable) where we have this mapping. These mapping are called tables and there is one for functions named Procedure Linkage Table (PLT) and one for variables/data named Global Offset Table (GOT).
After the linker has completed (the tables have been updated) it allows any
loaded shared object optionally run some initialization code. This code is what
the .init
section if for. Likewise, when the library is unloaded terminiation
code can be run and this is found in the .fini
section.
After the .init
section has been run the linker gives control back to the
image being loaded.
Notice that
$ objdump -T -d simple
simple: file format elf64-x86-64
DYNAMIC SYMBOL TABLE:
0000000000000000 DF *UND* 0000000000000000 GLIBC_2.2.5 __libc_start_main
0000000000000000 w D *UND* 0000000000000000 __gmon_start__
So, I'm still trying to figure out how the following line works:
401044: ff 15 a6 2f 00 00 callq *0x2fa6(%rip) # 403ff0 <__libc_start_main@GLIBC_2.2.5>
We are calling a function in the libc library which is in a dynamically linked library so this would have to be resolved by the linker. But like mentioned the code section is not writable to we use a table that will be "patched" at load/runtime by the linker. And this is a function call so this would involve the Procedure Linkage Table
Disassembly of section .init:
0000000000401000 <_init>:
401000: 48 83 ec 08 sub $0x8,%rsp
401004: 48 8b 05 ed 2f 00 00 mov 0x2fed(%rip),%rax # 403ff8 <__gmon_start__>
40100b: 48 85 c0 test %rax,%rax
40100e: 74 02 je 401012 <_init+0x12>
401010: ff d0 callq *%rax
401012: 48 83 c4 08 add $0x8,%rsp
401016: c3 retq
If I'm reading this correctly we are moving/copying the address of 0x2fed(%rip) into rax. This should be a function named gmon_start if enabled/specified/exists. We then test is rax is zero (test is done instead of cmp beacuse it is shorter I think), and if zero we jump to 401012 <_init+0x12>, otherwise we call gmon_start.
Disassembly of section .text:
0000000000401020 <_start>:
401020: 31 ed xor %ebp,%ebp
401022: 49 89 d1 mov %rdx,%r9
bytes
2⁰ = 1
2¹ = 2
2² = 4
2³ = 8
2⁴ = 16
2⁵ = 32
2⁶ = 64
2⁷ = 128
2⁸ = 256
2⁹ = 512
2¹⁰ = 1024
Compiling the kernel
You can start copying an existing configuration:
$ cp -v /boot/config-$(uname -r) .config
'/boot/config-4.18.0-80.1.2.el8_0.x86_64' -> '.config'
You might need to install the following:
$ sudo yum group install "Development Tools"
$ yum install ncurses-devel bison flex elfutils-libelf-devel openssl-devel
Make configuration changes:
$ make menuconfig
Buiding
$ make -j8
Readelf
To print a section in hex:
$ readelf -x ".gcc_except_table" objectfile
You can use -W
/--wide
option to show output that does not wrap.
Call Frame Information (cfi)
This is a GNU AS extension to manage call frames.
$ gcc -S -o simple.s -g simple.c
So since we are using gcc it will be the GNU assembler that will be used so the output will be in that format.
.file "simple.c"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 8.2.1 20180905 (Red Hat 8.2.1-3)"
.section .note.GNU-stack,"",@progbits
So we can see the
A local label is any symbol beginning with a certain local label prefix.
For ELF systems the prefix is .L
. We can see above that we have local labels
named .LFB0
.cfi_startproc
is used in the beginning of each function that should have
an entry in the .eh_frame.
pushq %rbp
.cfi_def_cfa_offset 16
The call frame is identified by an address on the stack. We refer to this address as the Canonical Frame Address or CFA. Note that we pushed
Memory
When a call malloc, brk, sbrk, or mmap we are only reserving virtual memory and not physical RAM. The physical RAM will be used when a read/write occurs using a virtual address. This virtual address is passed to the MMU and it will a pagefault will occur as there will be not mapping from virtual address to the physical address. This case will be handled by the the
$ lldb -- ./mmap
(lldb) br s -n main
(lldb) r
(lldb) platform shell ps -o pid,user,vsz,rss,comm,args 213047
PID USER VSZ RSS COMMAND COMMAND
213047 danielb+ 2196 776 mmap /home/danielbevenius/work/linux/learning-linux-kernel/mmap
(lldb) platform shell pmap 213047
213047: /home/danielbevenius/work/linux/learning-linux-kernel/mmap
0000000000400000 4K r---- mmap
0000000000401000 4K r-x-- mmap
0000000000402000 4K r---- mmap
0000000000403000 4K r---- mmap
0000000000404000 4K rw--- mmap
00007ffff7dde000 148K r---- libc-2.30.so
00007ffff7e03000 1340K r-x-- libc-2.30.so
00007ffff7f52000 296K r---- libc-2.30.so
00007ffff7f9c000 4K ----- libc-2.30.so
00007ffff7f9d000 12K r---- libc-2.30.so
00007ffff7fa0000 12K rw--- libc-2.30.so
00007ffff7fa3000 24K rw--- [ anon ]
00007ffff7fcb000 16K r---- [ anon ]
00007ffff7fcf000 8K r-x-- [ anon ]
00007ffff7fd1000 8K r---- ld-2.30.so
00007ffff7fd3000 128K r-x-- ld-2.30.so
00007ffff7ff3000 32K r---- ld-2.30.so
00007ffff7ffc000 4K r---- ld-2.30.so
00007ffff7ffd000 4K rw--- ld-2.30.so
00007ffff7ffe000 4K rw--- [ anon ]
00007ffffffdd000 136K rw--- [ stack ]
ffffffffff600000 4K r-x-- [ anon ]
total 2200K
And after calling mmap:
(lldb) platform shell pmap 213115
213115: /home/danielbevenius/work/linux/learning-linux-kernel/mmap
0000000000400000 4K r---- mmap
0000000000401000 4K r-x-- mmap
0000000000402000 4K r---- mmap
0000000000403000 4K r---- mmap
0000000000404000 4K rw--- mmap
0000000000405000 132K rw--- [ anon ]
00007ffff7dde000 148K r---- libc-2.30.so
00007ffff7e03000 1340K r-x-- libc-2.30.so
00007ffff7f52000 296K r---- libc-2.30.so
00007ffff7f9c000 4K ----- libc-2.30.so
00007ffff7f9d000 12K r---- libc-2.30.so
00007ffff7fa0000 12K rw--- libc-2.30.so
00007ffff7fa3000 24K rw--- [ anon ]
00007ffff7fcb000 16K r---- [ anon ]
00007ffff7fcf000 8K r-x-- [ anon ]
00007ffff7fd1000 8K r---- ld-2.30.so
00007ffff7fd3000 128K r-x-- ld-2.30.so
00007ffff7ff3000 32K r---- ld-2.30.so
00007ffff7ffb000 4K rw--- [ anon ]
00007ffff7ffc000 4K r---- ld-2.30.so
00007ffff7ffd000 4K rw--- ld-2.30.so
00007ffff7ffe000 4K rw--- [ anon ]
00007ffffffdd000 136K rw--- [ stack ]
ffffffffff600000 4K r-x-- [ anon ]
total 2336K
And notice that the size of resident (physical RAM) has not changed:
(lldb) platform shell ps -o pid,user,vsz,rss,comm,args 129715
PID USER VSZ RSS COMMAND COMMAND
129715 danielb+ 2332 764 mmap /home/danielbevenius/work/linux/learning-linux-kernel/mmap
But after we write to this memory map the resident size will have grown:
lldb) platform shell ps -o pid,user,vsz,rss,comm,args 213399
PID USER VSZ RSS COMMAND COMMAND
213399 danielb+ 2856 568 mmap /home/danielbevenius/work/linux/learning-linux-kernel/mmap
Lets start by taking a look at a c program that is compiled and linked before looking at a c++ example
$ gcc -g -o simplec simple.c --verbose
/usr/libexec/gcc/x86_64-redhat-linux/9/collect2
-plugin /usr/libexec/gcc/x86_64-redhat-linux/9/liblto_plugin.so
-plugin-opt=/usr/libexec/gcc/x86_64-redhat-linux/9/lto-wrapper
-plugin-opt=-fresolution=/tmp/cc2Kmxls.res
-plugin-opt=-pass-through=-lgcc
-plugin-opt=-pass-through=-lgcc_s
-plugin-opt=-pass-through=-lc
-plugin-opt=-pass-through=-lgcc
-plugin-opt=-pass-through=-lgcc_s
--build-id
--no-add-needed
--eh-frame-hdr
--hash-style=gnu
-m elf_x86_64
-dynamic-linker /lib64/ld-linux-x86-64.so.2
-o simplec
/usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crt1.o
/usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crti.o
/usr/lib/gcc/x86_64-redhat-linux/9/crtbegin.o
-L/usr/lib/gcc/x86_64-redhat-linux/9
-L/usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64
-L/lib/../lib64
-L/usr/lib/../lib64
-L/usr/lib/gcc/x86_64-redhat-linux/9/../../..
/tmp/cc99b3Mr.o
-lgcc
--push-state
--as-needed
-lgcc_s
--pop-state
-lc
-lgcc
--push-state
--as-needed
-lgcc_s
--pop-state
/usr/lib/gcc/x86_64-redhat-linux/9/crtend.o
/usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crtn.o
Lets take a closer look at crt1.o
.
First, what symbols are defined in this file:
$ nm --defined-only /usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crt1.o
0000000000000035 t .annobin__dl_relocate_static_pie.end
0000000000000030 t .annobin__dl_relocate_static_pie.start
000000000000002f t .annobin_init.c
000000000000002f t .annobin_init.c_end
0000000000000000 t .annobin_init.c_end.exit
0000000000000000 t .annobin_init.c_end.hot
0000000000000000 t .annobin_init.c_end.startup
0000000000000000 t .annobin_init.c_end.unlikely
0000000000000000 t .annobin_init.c.exit
0000000000000000 t .annobin_init.c.hot
0000000000000000 t .annobin_init.c.startup
0000000000000000 t .annobin_init.c.unlikely
0000000000000030 t .annobin_static_reloc.c
0000000000000035 t .annobin_static_reloc.c_end
0000000000000000 t .annobin_static_reloc.c_end.exit
0000000000000000 t .annobin_static_reloc.c_end.hot
0000000000000000 t .annobin_static_reloc.c_end.startup
0000000000000000 t .annobin_static_reloc.c_end.unlikely
0000000000000000 t .annobin_static_reloc.c.exit
0000000000000000 t .annobin_static_reloc.c.hot
0000000000000000 t .annobin_static_reloc.c.startup
0000000000000000 t .annobin_static_reloc.c.unlikely
0000000000000000 D __data_start
0000000000000000 W data_start
0000000000000030 T _dl_relocate_static_pie
0000000000000000 R _IO_stdin_used
0000000000000000 T _start
0000000000000000 n .text.exit.group
0000000000000000 n .text.exit.group
0000000000000000 n .text.hot.group
0000000000000000 n .text.hot.group
0000000000000000 n .text.startup.group
0000000000000000 n .text.startup.group
0000000000000000 n .text.unlikely.group
0000000000000000 n .text.unlikely.group
All the symbols with t
/T
type mean that the symbols is in the text section.
_data_start
is in the initialized data section.
data_start
is a weak symbol.
_IO_stdin_used
is in the read only section.
The ones in of type n
are debugging symbols.
So what symbols are references in crt1.o but not defined in it (that is they use the externa c keywork):
$ nm --extern-only /usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crt1.o
0000000000000000 D __data_start
0000000000000000 W data_start
0000000000000030 T _dl_relocate_static_pie
U _GLOBAL_OFFSET_TABLE_
0000000000000000 R _IO_stdin_used
U __libc_csu_fini
U __libc_csu_init
U __libc_start_main
U main
0000000000000000 T _start
Notice that most of these are undefined U
and especially note that
__libc_csu_fini
, __libc_csu_init
, __libc_start_main
, and main
are here.
So, we can see that _start
is defined in crt1.o and if we dump the content
we find:
$ objdump -drwC /usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crt1.o
/usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crt1.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_start>:
0: f3 0f 1e fa endbr64
4: 31 ed xor %ebp,%ebp
6: 49 89 d1 mov %rdx,%r9
9: 5e pop %rsi
a: 48 89 e2 mov %rsp,%rdx
d: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
11: 50 push %rax
12: 54 push %rsp
13: 4c 8b 05 00 00 00 00 mov 0x0(%rip),%r8 # 1a <_start+0x1a> 16: R_X86_64_REX_GOTPCRELX __libc_csu_fini-0x4
1a: 48 8b 0d 00 00 00 00 mov 0x0(%rip),%rcx # 21 <_start+0x21> 1d: R_X86_64_REX_GOTPCRELX __libc_csu_init-0x4
21: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 28 <_start+0x28> 24: R_X86_64_REX_GOTPCRELX main-0x4
28: ff 15 00 00 00 00 callq *0x0(%rip) # 2e <_start+0x2e> 2a: R_X86_64_GOTPCRELX __libc_start_main-0x4
2e: f4 hlt
000000000000002f <.annobin_init.c>:
2f: 90 nop
0000000000000030 <_dl_relocate_static_pie>:
30: f3 0f 1e fa endbr64
34: c3 retq
Notice that there are a number of values that need to be relocated by the dynamic linker when it maps this object file into a process. If we take a look at a few of the entries in the relocation table we find:
$ readelf -r /usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crt1.o
Relocation section '.rela.text' at offset 0x1df8 contains 4 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000016 00590000002a R_X86_64_REX_GOTP 0000000000000000 __libc_csu_fini - 4
00000000001d 005c0000002a R_X86_64_REX_GOTP 0000000000000000 __libc_csu_init - 4
000000000024 005d0000002a R_X86_64_REX_GOTP 0000000000000000 main - 4
00000000002a 006100000029 R_X86_64_GOTPCREL 0000000000000000 __libc_start_main - 4
Offset 0x16 is in row 13:
13: 4c 8b 05 00 00 00 00 mov 0x0(%rip),%r8 # 1a <_start+0x1a>
000000000016 00590000002a R_X86_64_REX_GOTP 0000000000000000 __libc_csu_fini - 4
So this is an instruction for the link editor to replace the entry in
0x16 with the value that is gets by doing a R_X86_64_REX_GOTP. The syntx
0x0(%rip)
looks a little strange but what it is saying is that use the value
taken from the instruction pointer register (notice that there is not offset
specified) which will be the value of
13: 4c 8b 05 00 00 00 00
↑
4c 8b 05 is the move and the register to move opcodes.
So when the code has been linked this will just
Disassembly of section .text:
0000000000401020 <_start>:
401020: f3 0f 1e fa endbr64
401024: 31 ed xor %ebp,%ebp
401026: 49 89 d1 mov %rdx,%r9
401029: 5e pop %rsi
40102a: 48 89 e2 mov %rsp,%rdx
40102d: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
401031: 50 push %rax
401032: 54 push %rsp
401033: 49 c7 c0 90 11 40 00 mov $0x401190,%r8
40103a: 48 c7 c1 20 11 40 00 mov $0x401120,%rcx
401041: 48 c7 c7 06 11 40 00 mov $0x401106,%rdi
401048: ff 15 a2 2f 00 00 callq *0x2fa2(%rip) # 403ff0 <__libc_start_main@GLIBC_2.2.5>
40104e: f4 hlt
Note that 0x401190
in little endian is 901140
which can be written
as 90 11 40
which matches 49 c7 c0 90 11 40 00
.
And the same goes for __libc_csu_init
and main
. But note that that
The info column in 00590000002a
$ readelf -r /usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crt1.o
Relocation section '.rela.text' at offset 0x1df8 contains 4 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000016 00590000002a R_X86_64_REX_GOTP 0000000000000000 __libc_csu_fini - 4
...
Now if we take the `Info` value and split it in two, the top bits is an
index into the symbol table and the lower bits is the type of reloaction.
So we we take `0059`, which is `89` in hex and look up that value in the
symbol table:
```console
$ readelf -s /usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crt1.o
Symbol table '.symtab' contains 99 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
...
89: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND __libc_csu_fini
Which seems to makes sense that this is __libc_csu_fini
.
crti.o
This is the second object file that is specified in the command earlier. The source for this can be found in ~/work/gcc/glibc/sysdeps/x86_64/crti.S:
#ifndef PREINIT_FUNCTION
# define PREINIT_FUNCTION __gmon_start__
#endif
#ifndef PREINIT_FUNCTION_WEAK
# define PREINIT_FUNCTION_WEAK 1
#endif
#if PREINIT_FUNCTION_WEAK
weak_extern (PREINIT_FUNCTION)
#else
.hidden PREINIT_FUNCTION
#endif
.section .init,"ax",@progbits
.p2align 2
.globl _init
.hidden _init
.type _init, @function
_init:
_CET_ENDBR
/* Maintain 16-byte stack alignment for called functions. */
subq $8, %rsp
#if PREINIT_FUNCTION_WEAK
movq PREINIT_FUNCTION@GOTPCREL(%rip), %rax
testq %rax, %rax
je .Lno_weak_fn
call *%rax
.Lno_weak_fn:
#else
call PREINIT_FUNCTION
#endif
.section .fini,"ax",@progbits
.p2align 2
.globl _fini
.hidden _fini
.type _fini, @function
_fini:
_CET_ENDBR
subq $8, %rsp
Lets start by looking at the symbols:
$ nm --defined-only /usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crti.o
0000000000000000 T _fini
0000000000000000 T _init
So we can see that _fini
and _init
are defined.
$ nm --extern-only /usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crti.o
0000000000000000 T _fini
U _GLOBAL_OFFSET_TABLE_
w __gmon_start__
0000000000000000 T _init
And we can take a look at the objdump:
$ objdump -d /usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crti.o
/usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crti.o: file format elf64-x86-64
Disassembly of section .init:
0000000000000000 <_init>:
0: f3 0f 1e fa endbr64
4: 48 83 ec 08 sub $0x8,%rsp
8: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # f <_init+0xf>
f: 48 85 c0 test %rax,%rax
12: 74 02 je 16 <_init+0x16>
14: ff d0 callq *%rax
Disassembly of section .fini:
0000000000000000 <_fini>:
0: f3 0f 1e fa endbr64
4: 48 83 ec 08 sub $0x8,%rsp
register_tm_clones
Is about Transacational Memory (TM) and is called from __libc_csu_init
.
$ sudo dnf install libitm
I also had to create a symbolic link to get the example working:
$ sudo ln -s /lib64/libitm.so.1.0.0 /lib64/libitm.so
Next we compile the tm.c example using:
$ gcc --verbose -L/usr/lib64 -o tm -fgnu-tm tm.c -Wl,-verbose
If we inspect the objdump of tm
we find
$ objdump -d tm
tm: file format elf64-x86-64
Disassembly of section .plt:
0000000000401030 <_ITM_deregisterTMCloneTable@plt>:
401030: ff 25 e2 2f 00 00 jmpq *0x2fe2(%rip) # 404018 <_ITM_deregisterTMCloneTable@LIBITM_1.0>
401036: 68 00 00 00 00 pushq $0x0
40103b: e9 e0 ff ff ff jmpq 401020 <.plt>
0000000000401040 <_ITM_registerTMCloneTable@plt>:
401040: ff 25 da 2f 00 00 jmpq *0x2fda(%rip) # 404020 <_ITM_registerTMCloneTable@LIBITM_1.0>
401046: 68 01 00 00 00 pushq $0x1
40104b: e9 d0 ff ff ff jmpq 401020 <.plt>
Disassembly of section .text:
00000000004010d0 <register_tm_clones>:
4010d0: be 40 40 40 00 mov $0x404040,%esi
4010d5: 48 81 ee 30 40 40 00 sub $0x404030,%rsi
4010dc: 48 89 f0 mov %rsi,%rax
4010df: 48 c1 ee 3f shr $0x3f,%rsi
4010e3: 48 c1 f8 03 sar $0x3,%rax
4010e7: 48 01 c6 add %rax,%rsi
4010ea: 48 d1 fe sar %rsi
4010ed: 74 11 je 401100 <register_tm_clones+0x30>
4010ef: b8 40 10 40 00 mov $0x401040,%eax
4010f4: 48 85 c0 test %rax,%rax
4010f7: 74 07 je 401100 <register_tm_clones+0x30>
4010f9: bf 30 40 40 00 mov $0x404030,%edi
4010fe: ff e0 jmpq *%rax
401100: c3 retq
401101: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
401108: 00 00 00 00
40110c: 0f 1f 40 00 nopl 0x0(%rax)
frame_dummy
This function can be found in /work/gcc/gcc/libgcc/crtstuff.c
and looks like
this:
static void __attribute__((used)) frame_dummy (void)
{
#ifdef USE_EH_FRAME_REGISTRY
static struct object object;
#ifdef CRT_GET_RFIB_DATA
void *tbase, *dbase;
tbase = 0;
CRT_GET_RFIB_DATA (dbase);
if (__register_frame_info_bases)
__register_frame_info_bases (__EH_FRAME_BEGIN__, &object, tbase, dbase);
#else
if (__register_frame_info)
__register_frame_info (__EH_FRAME_BEGIN__, &object);
#endif /* CRT_GET_RFIB_DATA */
#endif /* USE_EH_FRAME_REGISTRY */
#if USE_TM_CLONE_REGISTRY
register_tm_clones ();
#endif /* USE_TM_CLONE_REGISTRY */
}
The used
attribute can be specified when the compiler might otherwise ignore
if, for example if it was not called anywhere.
Lets set a breakpoint and see what is happening this this function.
$ lldb -- ./simplec
(lldb) br s -n frame_dummy
(lldb) r
(lldb) disassemble -n frame_dummy
simplec`frame_dummy:
-> 0x401100 <+0>: endbr64
0x401104 <+4>: jmp 0x401090 ; register_tm_clones
So we can see that the USE_EH_FRAME_REGISTRY
was not set and the only
thing that is happening is that register_tm_clones
is getting called.
eh_frame
Languages that support exceptions, like C++, and is used to describe how to set registers to restore the previous call frame at runtime.
$ g++ -g -o eh_frame eh_frame.cc
$ ./ef_frame
$ echo $?
2
0000000000401176 <main>:
401176: 55 push %rbp
401177: 48 89 e5 mov %rsp,%rbp
40117a: 53 push %rbx
40117b: 48 83 ec 28 sub $0x28,%rsp
40117f: 89 7d dc mov %edi,-0x24(%rbp)
401182: 48 89 75 d0 mov %rsi,-0x30(%rbp)
401186: bf 04 00 00 00 mov $0x4,%edi
40118b: e8 b0 fe ff ff callq 401040 <__cxa_allocate_exception@plt>
401190: c7 00 02 00 00 00 movl $0x2,(%rax)
401196: ba 00 00 00 00 mov $0x0,%edx
40119b: be e0 3d 40 00 mov $0x403de0,%esi
4011a0: 48 89 c7 mov %rax,%rdi
4011a3: e8 c8 fe ff ff callq 401070 <__cxa_throw@plt>
4011a8: 48 83 fa 01 cmp $0x1,%rdx
4011ac: 74 08 je 4011b6 <main+0x40>
4011ae: 48 89 c7 mov %rax,%rdi
4011b1: e8 ca fe ff ff callq 401080 <_Unwind_Resume@plt>
4011b6: 48 89 c7 mov %rax,%rdi
4011b9: e8 72 fe ff ff callq 401030 <__cxa_begin_catch@plt>
4011be: 8b 00 mov (%rax),%eax
4011c0: 89 45 ec mov %eax,-0x14(%rbp)
4011c3: 8b 5d ec mov -0x14(%rbp),%ebx
4011c6: e8 85 fe ff ff callq 401050 <__cxa_end_catch@plt>
4011cb: 89 d8 mov %ebx,%eax
4011cd: 48 83 c4 28 add $0x28,%rsp
4011d1: 5b pop %rbx
4011d2: 5d pop %rbp
4011d3: c3 retq
4011d4: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4011db: 00 00 00
4011de: 66 90 xchg %ax,%ax
When we use throw
first an exception is allocated to be thrown:
40118b: e8 b0 fe ff ff callq 401040 <__cxa_allocate_exception@plt>
Followed by __cxa_throw@plt
which start the exception handling:
4011a3: e8 c8 fe ff ff callq 401070 <__cxa_throw@plt>
The eh_frame contains Call Frame Information (CFI) and this is the information required to be generated by the compiler to enable stack unwinding.
Disabling -fno-unwind-tables
REL vs RELA
There are two different structures for relocations, one with two members, and
one with an extra addend
member:
typedef struct {
Elf32_Addr r_offset;
Elf32_Word r_info;
} Elf32_Rel;
typedef struct {
Elf32_Addr r_offset;
Elf32_Word r_info;
Elf32_Sword r_addend;
} Elf32_Rela;
DWARF section
Are used for debugging information. These sections can be viewed using readelf:
Lets take a look at the section headers releated to debugging:
$ objdump -wh dwarf
dwarf: file format elf64-x86-64
Sections:
Idx Name Size VMA LMA File off Algn Flags
...
26 .debug_aranges 00000030 0000000000000000 0000000000000000 000040f8 2**0 CONTENTS, READONLY, DEBUGGING
27 .debug_info 00000380 0000000000000000 0000000000000000 00004128 2**0 CONTENTS, READONLY, DEBUGGING
28 .debug_abbrev 00000137 0000000000000000 0000000000000000 000044a8 2**0 CONTENTS, READONLY, DEBUGGING
29 .debug_line 00000119 0000000000000000 0000000000000000 000045df 2**0 CONTENTS, READONLY, DEBUGGING
30 .debug_str 0000028d 0000000000000000 0000000000000000 000046f8 2**0 CONTENTS, READONLY, DEBUGGING
So these sections would be something that the debugger looks at for example.
.debug_aranges
is a lookup table of addresses to compilation units.
$ objdump --dwarf=info dwarf
<1><329>: Abbrev Number: 18 (DW_TAG_subprogram)
<32a> DW_AT_external : 1
<32a> DW_AT_name : (indirect string, offset: 0x18): something
<32e> DW_AT_decl_file : 1
<32f> DW_AT_decl_line : 3
<330> DW_AT_decl_column : 6
<331> DW_AT_prototyped : 1
<331> DW_AT_low_pc : 0x401126
<339> DW_AT_high_pc : 0x41
<341> DW_AT_frame_base : 1 byte block: 9c (DW_OP_call_frame_cfa)
<343> DW_AT_GNU_all_tail_call_sites: 1
<2><343>: Abbrev Number: 19 (DW_TAG_formal_parameter)
<344> DW_AT_name : x
<346> DW_AT_decl_file : 1
<347> DW_AT_decl_line : 3
<348> DW_AT_decl_column : 20
<349> DW_AT_type : <0x65>
<34d> DW_AT_location : 2 byte block: 91 5c (DW_OP_fbreg: -36)
<2><350>: Abbrev Number: 20 (DW_TAG_variable)
<351> DW_AT_name : (indirect string, offset: 0x10f): local
<355> DW_AT_decl_file : 1
<356> DW_AT_decl_line : 4
<357> DW_AT_decl_column : 7
<358> DW_AT_type : <0x65>
<35c> DW_AT_location : 2 byte block: 91 68 (DW_OP_fbreg: -24)
Notice that there is a DW_TAG_subprogram
for the something function, and
the DW_AT_low_pc
is 0x401126
which is the start address of the something
function.
The leading numbers in angle brackets declare a scope, so something is in the
first scope and the parameter x and the local variable are nested in that scope
hence the 2.
$ objdump --disassemble=something dwarf
dwarf: file format elf64-x86-64
Disassembly of section .init:
Disassembly of section .plt:
Disassembly of section .text:
0000000000401126 <something>:
401126: 55 push %rbp
401127: 48 89 e5 mov %rsp,%rbp
40112a: 48 83 ec 20 sub $0x20,%rsp
40112e: 89 7d ec mov %edi,-0x14(%rbp)
401131: 8b 45 ec mov -0x14(%rbp),%eax
401134: 83 c0 0a add $0xa,%eax
401137: 89 45 f8 mov %eax,-0x8(%rbp)
40113a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
401141: eb 18 jmp 40115b <something+0x35>
401143: 8b 45 fc mov -0x4(%rbp),%eax
401146: 89 c6 mov %eax,%esi
401148: bf 10 20 40 00 mov $0x402010,%edi
40114d: b8 00 00 00 00 mov $0x0,%eax
401152: e8 d9 fe ff ff callq 401030 <printf@plt>
401157: 83 45 fc 01 addl $0x1,-0x4(%rbp)
40115b: 8b 45 fc mov -0x4(%rbp),%eax
40115e: 3b 45 f8 cmp -0x8(%rbp),%eax
401161: 7c e0 jl 401143 <something+0x1d>
401163: 90 nop
401164: 90 nop
401165: c9 leaveq
401166: c3 retq
And also notice that the int parameter to something is specified as the type
DW_TAG_formal_parameter
and that DW_AT_decl_line
specifies the line in the
source code file.
$ readelf -w dwarf
Contents of the .eh_frame section:
00000000 0000000000000014 00000000 CIE
Version: 1
Augmentation: "zR"
Code alignment factor: 1
Data alignment factor: -8
Return address column: 16
Augmentation data: 1b
DW_CFA_def_cfa: r7 (rsp) ofs 8
DW_CFA_offset: r16 (rip) at cfa-8
DW_CFA_nop
DW_CFA_nop
Linker script
Before we dig into and step through the startup progres we need to consider what the linker does with out object code. If we only inspect the object file using objdump we don't see the complete linked object which we see when it is loaded into the debugger.
Details about how linker scripts work can be found here.
We can pass -verbose
to the linker to see the linker script it uses:
$ g++ -O0 -g -o ctor ctor.cc -Wl,-verbose
Linker script
using internal linker script:
==================================================
/* Script for -z combreloc -z separate-code: combine and sort reloc sections with separate code segment */
/* Copyright (C) 2014-2019 Free Software Foundation, Inc.
Copying and distribution of this script, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved. */
OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
OUTPUT_ARCH(i386:x86-64)
ENTRY(_start)
SEARCH_DIR("=/usr/x86_64-redhat-linux/lib64");
SEARCH_DIR("=/usr/lib64");
SEARCH_DIR("=/usr/local/lib64");
SEARCH_DIR("=/lib64");
SEARCH_DIR("=/usr/x86_64-redhat-linux/lib");
SEARCH_DIR("=/usr/local/lib");
SEARCH_DIR("=/lib");
SEARCH_DIR("=/usr/lib");
SECTIONS
{
PROVIDE (__executable_start = SEGMENT_START("text-segment", 0x400000)); . = SEGMENT_START("text-segment", 0x400000) + SIZEOF_HEADERS;
.interp : { *(.interp) }
.note.gnu.build-id : { *(.note.gnu.build-id) }
.hash : { *(.hash) }
.gnu.hash : { *(.gnu.hash) }
.dynsym : { *(.dynsym) }
.dynstr : { *(.dynstr) }
.gnu.version : { *(.gnu.version) }
.gnu.version_d : { *(.gnu.version_d) }
.gnu.version_r : { *(.gnu.version_r) }
.rela.dyn :
{
*(.rela.init)
*(.rela.text .rela.text.* .rela.gnu.linkonce.t.*)
*(.rela.fini)
*(.rela.rodata .rela.rodata.* .rela.gnu.linkonce.r.*)
*(.rela.data .rela.data.* .rela.gnu.linkonce.d.*)
*(.rela.tdata .rela.tdata.* .rela.gnu.linkonce.td.*)
*(.rela.tbss .rela.tbss.* .rela.gnu.linkonce.tb.*)
*(.rela.ctors)
*(.rela.dtors)
*(.rela.got)
*(.rela.bss .rela.bss.* .rela.gnu.linkonce.b.*)
*(.rela.ldata .rela.ldata.* .rela.gnu.linkonce.l.*)
*(.rela.lbss .rela.lbss.* .rela.gnu.linkonce.lb.*)
*(.rela.lrodata .rela.lrodata.* .rela.gnu.linkonce.lr.*)
*(.rela.ifunc)
}
.rela.plt :
{
*(.rela.plt)
PROVIDE_HIDDEN (__rela_iplt_start = .);
*(.rela.iplt)
PROVIDE_HIDDEN (__rela_iplt_end = .);
}
. = ALIGN(CONSTANT (MAXPAGESIZE));
.init :
{
KEEP (*(SORT_NONE(.init)))
}
.plt : { *(.plt) *(.iplt) }
.plt.got : { *(.plt.got) }
.plt.sec : { *(.plt.sec) }
.text :
{
*(.text.unlikely .text.*_unlikely .text.unlikely.*)
*(.text.exit .text.exit.*)
*(.text.startup .text.startup.*)
*(.text.hot .text.hot.*)
*(.text .stub .text.* .gnu.linkonce.t.*)
/* .gnu.warning sections are handled specially by elf32.em. */
*(.gnu.warning)
}
.fini :
{
KEEP (*(SORT_NONE(.fini)))
}
PROVIDE (__etext = .);
PROVIDE (_etext = .);
PROVIDE (etext = .);
. = ALIGN(CONSTANT (MAXPAGESIZE));
/* Adjust the address for the rodata segment. We want to adjust up to
the same address within the page on the next page up. */
. = SEGMENT_START("rodata-segment", ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1)));
.rodata : { *(.rodata .rodata.* .gnu.linkonce.r.*) }
.rodata1 : { *(.rodata1) }
.eh_frame_hdr : { *(.eh_frame_hdr) *(.eh_frame_entry .eh_frame_entry.*) }
.eh_frame : ONLY_IF_RO { KEEP (*(.eh_frame)) *(.eh_frame.*) }
.gcc_except_table : ONLY_IF_RO { *(.gcc_except_table .gcc_except_table.*) }
.gnu_extab : ONLY_IF_RO { *(.gnu_extab*) }
/* These sections are generated by the Sun/Oracle C++ compiler. */
.exception_ranges : ONLY_IF_RO { *(.exception_ranges*) }
/* Adjust the address for the data segment. We want to adjust up to
the same address within the page on the next page up. */
. = DATA_SEGMENT_ALIGN (CONSTANT (MAXPAGESIZE), CONSTANT (COMMONPAGESIZE));
/* Exception handling */
.eh_frame : ONLY_IF_RW { KEEP (*(.eh_frame)) *(.eh_frame.*) }
.gnu_extab : ONLY_IF_RW { *(.gnu_extab) }
.gcc_except_table : ONLY_IF_RW { *(.gcc_except_table .gcc_except_table.*) }
.exception_ranges : ONLY_IF_RW { *(.exception_ranges*) }
/* Thread Local Storage sections */
.tdata :
{
PROVIDE_HIDDEN (__tdata_start = .);
*(.tdata .tdata.* .gnu.linkonce.td.*)
}
.tbss : { *(.tbss .tbss.* .gnu.linkonce.tb.*) *(.tcommon) }
.preinit_array :
{
PROVIDE_HIDDEN (__preinit_array_start = .);
KEEP (*(.preinit_array))
PROVIDE_HIDDEN (__preinit_array_end = .);
}
.init_array :
{
PROVIDE_HIDDEN (__init_array_start = .);
KEEP (*(SORT_BY_INIT_PRIORITY(.init_array.*) SORT_BY_INIT_PRIORITY(.ctors.*)))
KEEP (*(.init_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o *crtend.o *crtend?.o ) .ctors))
PROVIDE_HIDDEN (__init_array_end = .);
}
.fini_array :
{
PROVIDE_HIDDEN (__fini_array_start = .);
KEEP (*(SORT_BY_INIT_PRIORITY(.fini_array.*) SORT_BY_INIT_PRIORITY(.dtors.*)))
KEEP (*(.fini_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o *crtend.o *crtend?.o ) .dtors))
PROVIDE_HIDDEN (__fini_array_end = .);
}
.ctors :
{
/* gcc uses crtbegin.o to find the start of
the constructors, so we make sure it is
first. Because this is a wildcard, it
doesn't matter if the user does not
actually link against crtbegin.o; the
linker won't look for a file to match a
wildcard. The wildcard also means that it
doesn't matter which directory crtbegin.o
is in. */
KEEP (*crtbegin.o(.ctors))
KEEP (*crtbegin?.o(.ctors))
/* We don't want to include the .ctor section from
the crtend.o file until after the sorted ctors.
The .ctor section from the crtend file contains the
end of ctors marker and it must be last */
KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o ) .ctors))
KEEP (*(SORT(.ctors.*)))
KEEP (*(.ctors))
}
.dtors :
{
KEEP (*crtbegin.o(.dtors))
KEEP (*crtbegin?.o(.dtors))
KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o ) .dtors))
KEEP (*(SORT(.dtors.*)))
KEEP (*(.dtors))
}
.jcr : { KEEP (*(.jcr)) }
.data.rel.ro : { *(.data.rel.ro.local* .gnu.linkonce.d.rel.ro.local.*) *(.data.rel.ro .data.rel.ro.* .gnu.linkonce.d.rel.ro.*) }
.dynamic : { *(.dynamic) }
.got : { *(.got) *(.igot) }
. = DATA_SEGMENT_RELRO_END (SIZEOF (.got.plt) >= 24 ? 24 : 0, .);
.got.plt : { *(.got.plt) *(.igot.plt) }
.data :
{
*(.data .data.* .gnu.linkonce.d.*)
SORT(CONSTRUCTORS)
}
.data1 : { *(.data1) }
_edata = .; PROVIDE (edata = .);
. = .;
__bss_start = .;
.bss :
{
*(.dynbss)
*(.bss .bss.* .gnu.linkonce.b.*)
*(COMMON)
/* Align here to ensure that the .bss section occupies space up to
_end. Align after .bss to ensure correct alignment even if the
.bss section disappears because there are no input sections.
FIXME: Why do we need it? When there is no .bss section, we do not
pad the .data section. */
. = ALIGN(. != 0 ? 64 / 8 : 1);
}
.lbss :
{
*(.dynlbss)
*(.lbss .lbss.* .gnu.linkonce.lb.*)
*(LARGE_COMMON)
}
. = ALIGN(64 / 8);
. = SEGMENT_START("ldata-segment", .);
.lrodata ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1)) :
{
*(.lrodata .lrodata.* .gnu.linkonce.lr.*)
}
.ldata ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1)) :
{
*(.ldata .ldata.* .gnu.linkonce.l.*)
. = ALIGN(. != 0 ? 64 / 8 : 1);
}
. = ALIGN(64 / 8);
_end = .; PROVIDE (end = .);
. = DATA_SEGMENT_END (.);
/* Stabs debugging sections. */
.stab 0 : { *(.stab) }
.stabstr 0 : { *(.stabstr) }
.stab.excl 0 : { *(.stab.excl) }
.stab.exclstr 0 : { *(.stab.exclstr) }
.stab.index 0 : { *(.stab.index) }
.stab.indexstr 0 : { *(.stab.indexstr) }
.comment 0 : { *(.comment) }
.gnu.build.attributes : { *(.gnu.build.attributes .gnu.build.attributes.*) }
/* DWARF debug sections.
Symbols in the DWARF debugging sections are relative to the beginning
of the section so we begin them at 0. */
/* DWARF 1 */
.debug 0 : { *(.debug) }
.line 0 : { *(.line) }
/* GNU DWARF 1 extensions */
.debug_srcinfo 0 : { *(.debug_srcinfo) }
.debug_sfnames 0 : { *(.debug_sfnames) }
/* DWARF 1.1 and DWARF 2 */
.debug_aranges 0 : { *(.debug_aranges) }
.debug_pubnames 0 : { *(.debug_pubnames) }
/* DWARF 2 */
.debug_info 0 : { *(.debug_info .gnu.linkonce.wi.*) }
.debug_abbrev 0 : { *(.debug_abbrev) }
.debug_line 0 : { *(.debug_line .debug_line.* .debug_line_end) }
.debug_frame 0 : { *(.debug_frame) }
.debug_str 0 : { *(.debug_str) }
.debug_loc 0 : { *(.debug_loc) }
.debug_macinfo 0 : { *(.debug_macinfo) }
/* SGI/MIPS DWARF 2 extensions */
.debug_weaknames 0 : { *(.debug_weaknames) }
.debug_funcnames 0 : { *(.debug_funcnames) }
.debug_typenames 0 : { *(.debug_typenames) }
.debug_varnames 0 : { *(.debug_varnames) }
/* DWARF 3 */
.debug_pubtypes 0 : { *(.debug_pubtypes) }
.debug_ranges 0 : { *(.debug_ranges) }
/* DWARF Extension. */
.debug_macro 0 : { *(.debug_macro) }
.debug_addr 0 : { *(.debug_addr) }
.gnu.attributes 0 : { KEEP (*(.gnu.attributes)) }
/DISCARD/ : { *(.note.GNU-stack) *(.gnu_debuglink) *(.gnu.lto_*) }
}
The script contains an ENTRY which directive? that specifies the first instruction to execute:
ENTRY(_start)
This is what would be overwritten if the -e new_entry
was specified.
Take a look at the .init_array
section:
.init_array :
{
PROVIDE_HIDDEN (__init_array_start = .);
KEEP (*(SORT_BY_INIT_PRIORITY(.init_array.*) SORT_BY_INIT_PRIORITY(.ctors.*)))
KEEP (*(.init_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o *crtend.o *crtend?.o ) .ctors))
PROVIDE_HIDDEN (__init_array_end = .);
}
First the .init_array
is specifying a section that should be created in the
output object file. Next this is doing it is defining a new symbol named
__init_array_start
and assigning it to the current address using special
location counter '.'. So this is really just an assignement,
__init_array_start = .
which is wrapped in the PROVIDE_HIDDEN command with will
make it non-exported. PROVIDE means that it will only if it is referenced but
not defined.
The KEEP
command will prevent section from being garbage collected at
link-time if --gc-sections
is specified.
Normally, the linker will place files and sections matched by wildcards in the
order in which they are seen during the link.
SORT_BY_INIT_PRIORITY will sort sections into ascending numerical order of the
GCC init_priority attribute encoded in the section name before placing them in
the output file. In .init_array.NNNNN and .fini_array.NNNNN, NNNNN is the
init_priority. In .ctors.NNNNN and .dtors.NNNNN, NNNNN is 65535 minus the
init_priority. So all the .init_array.* sections from all the input object
files will be added to this .init_array section, as well as all .ctors.*
sections.
And we also add all .init_array sections and .ctors sections but not from
the *crtbegin.o, *crtbegin?.o, *crtend.o, or *crtend?.o object files.
And finally we add another symbol named __init_array_end
and set it's address
to current address. So __init_array_start
will mark the start of these included
sections and __init_array_end
the end.
Take the following:
int x = 10;
This will create an entry in the symbol table which holds the address of an int sized block of memory where the value 10 is stored. When this symbol is referenced the compiler generates code that first accesses the symbol table to find the address of the symbol's memory block and then code to read from that value.
execve
Is a system call that loads a new program into a process's memory and replaces the calling program.
#include <unistd.h>
int execve(const char* pathname, char* const argv[], char* const envp[]);
This call will never return if successfull, remember that it will replace
the current process with the new application, and -1
upon failure.
[execve]https://github.com/torvalds/linux/blob/575966e080270b7574175da35f7f7dd5ecd89ff4/fs/exec.c#L1955):
SYSCALL_DEFINE3(execve,
const char __user *, filename,
const char __user *const __user *, argv,
const char __user *const __user *, envp) {
return do_execve(getname(filename), argv, envp);
}
int do_execve(struct filename *filename,
const char __user *const __user *__argv,
const char __user *const __user *__envp) {
struct user_arg_ptr argv = { .ptr.native = __argv };
struct user_arg_ptr envp = { .ptr.native = __envp };
return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
}
static int do_execveat_common(int fd, struct filename *filename,
struct user_arg_ptr argv,
struct user_arg_ptr envp,
int flags) {
return __do_execve_file(fd, filename, argv, envp, flags, NULL);
}
Now __do_execve_file contains the bulk of work as far as I can tell:
static int __do_execve_file(int fd, struct filename *filename,
struct user_arg_ptr argv,
struct user_arg_ptr envp,
int flags, struct file *file)
{
retval = bprm_mm_init(bprm);
if (retval)
goto out_unmark;
retval = prepare_arg_pages(bprm, argv, envp);
if (retval < 0)
goto out;
retval = prepare_binprm(bprm);
if (retval < 0)
goto out;
retval = copy_strings_kernel(1, &bprm->filename, bprm);
if (retval < 0)
goto out;
bprm->exec = bprm->p;
retval = copy_strings(bprm->envc, envp, bprm);
if (retval < 0)
goto out;
retval = copy_strings(bprm->argc, argv, bprm);
if (retval < 0)
goto out;
would_dump(bprm, bprm->file);
retval = exec_binprm(bprm);
}
static int exec_binprm(struct linux_binprm *bprm)
{
pid_t old_pid, old_vpid;
int ret;
/* Need to fetch pid before load_binary changes it */
old_pid = current->pid;
rcu_read_lock();
old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
rcu_read_unlock();
ret = search_binary_handler(bprm);
if (ret >= 0) {
audit_bprm(bprm);
trace_sched_process_exec(current, old_pid, bprm);
ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
proc_exec_connector(current);
}
return ret;
}
I'm guessing proc_exec_connetor
somehow calls _start
_start
Now, in our case execve
was set up the stack with argc, argv, envp etc.
When stopping in main an displaying the backtrace in lldb we get:
(lldb) bt
* thread #1, name = 'ctor', stop reason = breakpoint 1.1
* frame #0: 0x0000000000401131 ctor`main(argc=1, argv=0x00007fffffffd1d8) at ctor.cc:14:1
frame #1: 0x00007ffff7e051a3 libc.so.6`.annobin_libc_start.c + 243
frame #2: 0x000000000040106e ctor`.annobin_init.c.hot + 46
In gdb we get:
(gdb) set backtrace past-main on
(gdb) bt
#0 main (argc=1, argv=0x7ffd6e0f1978) at ctor.cc:14
#1 0x00007fe0ceb441a3 in __libc_start_main () from /lib64/libc.so.6
#2 0x000000000040106e in _start ()
Notice the names are different but the addresses are the same and you can also verify that the assembly code is the same for these functions. I'm not sure why this is but it's worth mentioning.
ctor`.annobin_init.c.hot/_start() is the first frame because the execve process was replaced with this one.
Notice that if we set a breakpoint in _start in lldb it will be as:
(lldb) br s -n _start
Breakpoint 1: where = ctor`.annobin_init.c.hot, address = 0x0000000000401040
_start be found in the gcc source tree, on my local machine it's in ~/work/gcc/glibc/sysdeps/x86_64/start.S
(lldb) disassemble ctor`.annobin_init.c.hot: -> 0x401040 <+0>: endbr64 0x401044 <+4>: xor ebp, ebp 0x401046 <+6>: mov r9, rdx 0x401049 <+9>: pop rsi 0x40104a <+10>: mov rdx, rsp 0x40104d <+13>: and rsp, -0x10 0x401051 <+17>: push rax 0x401052 <+18>: push rsp 0x401053 <+19>: mov r8, 0x401220 0x40105a <+26>: mov rcx, 0x4011b0 0x401061 <+33>: mov rdi, 0x401126 0x401068 <+40>: call qword ptr [rip + 0x2f82] 0x40106e <+46>: hlt
First we have the `endbr64` instruction which is about stack frame protection.
Next we have the
```console
-> 0x401044 <+4>: xor ebp, ebp
This is clearning ebp (the stack base pointer) as suggested by the ABI to be done by the outermost frame.
-> 0x401046 <+6>: mov r9, rdx
This is moving the value in rdx in to r9. So what is in rdx?
(lldb) register read rdx
rdx = 0x00007ffff7fe2100 ld-2.30.so`.annobin_dl_fini.c
-> 0x401049 <+9>: pop rsi
This instruction is poping the topmost value from the stack and storing it in rsi:
(lldb) register read rsi
rsi = 0x0000000000000001
This is argc.
-> 0x40104a <+10>: mov rdx, rsp
So we are moving the value in rsp into rdx.
(lldb) register read rdx
rdx = 0x00007fffffffd1d8
(lldb) memory read -f x -s 8 -c 1 0x00007fffffffd1d8
(lldb) memory read -f s 0x00007fffffffd5ab
0x7fffffffd5ab: "/home/danielbevenius/work/assembly/learning-assembly/ctor"
So we can see that rdx was holding char** argv.
-> 0x40104d <+13>: and rsp, -0x10
This (I think) is aligning the stack on 16 byte boundry.
-> 0x401051 <+17>: push rax
This will copy the value in rax onto the stack:
(lldb) register read rax
rax = 0x00007ffff7ffdfa0 ld-2.30.so`__GI__dl_starting_up
-> 0x401052 <+18>: push rsp
This will push the current value of the stackpointer onto the stack.
-> 0x401053 <+19>: mov r8, 0x401220
Next we will move the value 0x401220 into r8:
(lldb) disassemble -s 0x401220
ctor`__libc_csu_fini:
0x401220 <+0>: endbr64
0x401224 <+4>: ret
0x401225: add byte ptr [rax], al
0x401227: add bl, dh
So we are placing the memory address of __libc_csu_fini into r8.
-> 0x40105a <+26>: mov rcx, 0x4011b0
And next we move libc_csu_init into rcx:
(lldb) disassemble -s 0x4011b0
ctor`__libc_csu_init:
0x4011b0 <+0>: endbr64
0x4011b4 <+4>: push r15
0x4011b6 <+6>: lea r15, [rip + 0x2c4b] ; __frame_dummy_init_array_entry
0x4011bd <+13>: push r14
0x4011bf <+15>: mov r14, rdx
0x4011c2 <+18>: push r13
0x4011c4 <+20>: mov r13, rsi
0x4011c7 <+23>: push r12
0x4011c9 <+25>: mov r12d, edi
0x4011cc <+28>: push rbp
-> 0x401061 <+33>: mov rdi, 0x401126
And this placing the address of main into rdi:
(lldb) disassemble -s 0x401126
ctor`main:
0x401126 <+0>: push rbp
0x401127 <+1>: mov rbp, rsp
0x40112a <+4>: mov dword ptr [rbp - 0x4], edi
0x40112d <+7>: mov qword ptr [rbp - 0x10], rsi
0x401131 <+11>: mov eax, 0x0
0x401136 <+16>: pop rbp
0x401137 <+17>: ret
-> 0x401068 <+40>: call qword ptr [rip + 0x2f82]
This will call .annobin_libc_start.c/__libc_start_main
:
libc.so.6`.annobin_libc_start.c:
-> 0x7ffff7e050b0 <+0>: endbr64
0x7ffff7e050b4 <+4>: push r14
0x7ffff7e050b6 <+6>: xor eax, eax
0x7ffff7e050b8 <+8>: push r13
So lets take a look at this function in a new section and take some notes before continuing the debugging session.
.annobin_libc_start.c/__libc_start_main
Can be found in /work/gcc/glibc/csu/libc-start.c. And recall that _start has placed all the parameters in the correct registers.
STATIC int
LIBC_START_MAIN (int (*main) (int, char **, char ** MAIN_AUXVEC_DECL),
int argc, char **argv,
ElfW(auxv_t) *auxvec,
__typeof (main) init,
void (*fini) (void),
void (*rtld_fini) (void), void *stack_end)
{
...
if (init)
(*init) (argc, argv, __environ MAIN_AUXVEC_PARAM);
Now, init
is passed in as an argument and has the same siguature as main, so
and it get linked in and the source can be found in
/work/gcc/glibc/csu/elf-init.c in the function __libc_csu_init.
extern void _init (void);
extern void _fini (void);
extern void (*__preinit_array_start []) (int, char **, char **) attribute_hidden;
extern void (*__preinit_array_end []) (int, char **, char **) attribute_hidden;
extern void (*__init_array_start []) (int, char **, char **) attribute_hidden;
extern void (*__init_array_end []) (int, char **, char **) attribute_hidden;
extern void (*__fini_array_start []) (void) attribute_hidden;
extern void (*__fini_array_end []) (void) attribute_hidden;
void __libc_csu_init (int argc, char **argv, char **envp) {
_init ();
const size_t size = __init_array_end - __init_array_start;
for (size_t i = 0; i < size; i++)
(*__init_array_start [i]) (argc, argv, envp);
}
Remeber that __init_array_end
and __init_array_start
were added to the
object file by the link script (see details above).
Notice that _init
is an external function which returns void and does not
take any arguments. I think _init can be different for dynamically linked
and statically linked programs.
After that the number (size) of functions specified in the .array
Debugging session continued:
(lldb) f
frame #0: 0x00007ffff7e050b0 libc.so.6`.annobin_libc_start.c
libc.so.6`.annobin_libc_start.c:
-> 0x7ffff7e050b0 <+0>: endbr64
0x7ffff7e050b4 <+4>: push r14
0x7ffff7e050b6 <+6>: xor eax, eax
0x7ffff7e050b8 <+8>: push r13
annobin
There is a project named Annobin which is about adding extra information to binary files. This information is held in ELF notes section and is created by a plugin to GCC
+----------------+ +--------+ +--------+
| pre-init-array |<----> | Loader | -----> | _start |
+----------------+ +--------+ +--------+
_start -> __libc_start_main -> main -> exit(exit_value) -> run_exit_handlers -> _exit()
.init
When a program starts the system will execute the code in this section before calling the main program entry point.
An example of this can be see in init.c:
$ lldb -- init
(lldb) br s -n some_constructor
(lldb) bt 10
* thread #1, name = 'init', stop reason = breakpoint 1.1
* frame #0: 0x000000000040112a init`some_constructor at init.c:4:3
frame #1: 0x00000000004011ad init`__libc_csu_init + 77
frame #2: 0x00007ffff7e0512e libc.so.6`.annobin_libc_start.c + 126
frame #3: 0x000000000040106e init`.annobin_init.c.hot + 46
.fini
When the program exists normally the system will execute code in this section.
deregister_tm_clones
00000000000060f0 <deregister_tm_clones>:
60f0: 48 8d 3d 39 3f 04 00 lea 0x43f39(%rip),%rdi # 4a030 <__TMC_END__>
60f7: 48 8d 05 32 3f 04 00 lea 0x43f32(%rip),%rax # 4a030 <__TMC_END__>
60fe: 48 39 f8 cmp %rdi,%rax
6101: 74 15 je 6118 <deregister_tm_clones+0x28>
6103: 48 8b 05 06 39 04 00 mov 0x43906(%rip),%rax # 49a10 <_ITM_deregisterTMCloneTable>
610a: 48 85 c0 test %rax,%rax
610d: 74 09 je 6118 <deregister_tm_clones+0x28>
610f: ff e0 jmpq *%rax
6111: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
6118: c3 retq
6119: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
register_tm_clones
0000000000006120 <register_tm_clones>:
6120: 48 8d 3d 09 3f 04 00 lea 0x43f09(%rip),%rdi # 4a030 <__TMC_END__>
6127: 48 8d 35 02 3f 04 00 lea 0x43f02(%rip),%rsi # 4a030 <__TMC_END__>
612e: 48 29 fe sub %rdi,%rsi
6131: 48 89 f0 mov %rsi,%rax
6134: 48 c1 ee 3f shr $0x3f,%rsi
6138: 48 c1 f8 03 sar $0x3,%rax
613c: 48 01 c6 add %rax,%rsi
613f: 48 d1 fe sar %rsi
6142: 74 14 je 6158 <register_tm_clones+0x38>
6144: 48 8b 05 f5 3d 04 00 mov 0x43df5(%rip),%rax # 49f40 <_ITM_registerTMCloneTable>
614b: 48 85 c0 test %rax,%rax
614e: 74 08 je 6158 <register_tm_clones+0x38>
6150: ff e0 jmpq *%rax
6152: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
6158: c3 retq
6159: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
__do_global_dtors_aux
The example use below is simplec.cc.
lldb -- ./simplec++
(lldb) br s -n __do_global_dtors_aux
(lldb) r
(lldb) f
frame #0: 0x0000000000401110 simplec++`__do_global_dtors_aux
simplec++`__do_global_dtors_aux:
-> 0x401110 <+0>: endbr64
0x401114 <+4>: cmp byte ptr [rip + 0x2f25], 0x0 ; simplec++.PT_LOAD[3] + 615
0x40111b <+11>: jne 0x401130 ; <+32>
0x40111d <+13>: push rbp
(lldb) bt
* thread #1, name = 'simplec++', stop reason = breakpoint 1.1
* frame #0: 0x0000000000401110 simplec++`__do_global_dtors_aux
frame #1: 0x00007ffff7fe230b ld-2.30.so`.annobin_dl_fini.c + 523
frame #2: 0x00007ffff7ac3e87 libc.so.6`.annobin_exit.c + 247
frame #3: 0x00007ffff7ac4040 libc.so.6`__GI_exit + 32
frame #4: 0x00007ffff7aac1aa libc.so.6`.annobin_libc_start.c + 250
frame #5: 0x000000000040108e simplec++`.annobin_init.c.hot + 46
So what this section enables is to clean up global data.
C++ constructors
This section contains notes about constructors and descructors that are
run for global instances in c++ programs.
These constructors are called before main
is called:
$ lldb -- ctor
(lldb) br s -n Something
(lldb) bt 10
* thread #1, name = 'ctor', stop reason = breakpoint 1.1
* frame #0: 0x0000000000401194 ctor`Something::Something(this=0x0000000000404025) at ctor.cc:5:3
frame #1: 0x000000000040115f ctor`::__static_initialization_and_destruction_0(__initialize_p=1, __priority=65535) at ctor.cc:10:11
frame #2: 0x0000000000401189 ctor`::_GLOBAL__sub_I_s() at ctor.cc:14:1
frame #3: 0x00000000004011fd ctor`__libc_csu_init + 77
frame #4: 0x00007ffff7aac12e libc.so.6`.annobin_libc_start.c + 126
frame #5: 0x000000000040106e ctor`.annobin_init.c.hot + 46
annobin
When I link using gcc gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
I'm seeing
symbols that are in a section named .annobin. As I understand it these are
the link to Anno plugin project. I think these are included at link time
from crt1.o:
$ nm /usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/crt1.o
0000000000000035 t .annobin__dl_relocate_static_pie.end
0000000000000030 t .annobin__dl_relocate_static_pie.start
000000000000002f t .annobin_init.c
000000000000002f t .annobin_init.c_end
0000000000000000 t .annobin_init.c_end.exit
0000000000000000 t .annobin_init.c_end.hot
0000000000000000 t .annobin_init.c_end.startup
0000000000000000 t .annobin_init.c_end.unlikely
0000000000000000 t .annobin_init.c.exit
0000000000000000 t .annobin_init.c.hot
0000000000000000 t .annobin_init.c.startup
0000000000000000 t .annobin_init.c.unlikely
0000000000000030 t .annobin_static_reloc.c
0000000000000035 t .annobin_static_reloc.c_end
0000000000000000 t .annobin_static_reloc.c_end.exit
0000000000000000 t .annobin_static_reloc.c_end.hot
0000000000000000 t .annobin_static_reloc.c_end.startup
0000000000000000 t .annobin_static_reloc.c_end.unlikely
0000000000000000 t .annobin_static_reloc.c.exit
0000000000000000 t .annobin_static_reloc.c.hot
0000000000000000 t .annobin_static_reloc.c.startup
0000000000000000 t .annobin_static_reloc.c.unlikely
What I find strange is that if I debug with lldb set a break point in _start of just disassemble _start lldb will show:
(lldb) disassemble --name _start
ctor`.annobin_init.c.hot:
0x401040 <+0>: endbr64
0x401044 <+4>: xor ebp, ebp
0x401046 <+6>: mov r9, rdx
0x401049 <+9>: pop rsi
0x40104a <+10>: mov rdx, rsp
0x40104d <+13>: and rsp, -0x10
0x401051 <+17>: push rax
0x401052 <+18>: push rsp
0x401053 <+19>: mov r8, 0x401220
0x40105a <+26>: mov rcx, 0x4011b0
0x401061 <+33>: mov rdi, 0x401126
0x401068 <+40>: call qword ptr [rip + 0x2f82]
0x40106e <+46>: hlt
(lldb) disassemble --name .annobin_init.c.hot
ctor`.annobin_init.c.hot:
0x401040 <+0>: endbr64
0x401044 <+4>: xor ebp, ebp
0x401046 <+6>: mov r9, rdx
0x401049 <+9>: pop rsi
0x40104a <+10>: mov rdx, rsp
0x40104d <+13>: and rsp, -0x10
0x401051 <+17>: push rax
0x401052 <+18>: push rsp
0x401053 <+19>: mov r8, 0x401220
0x40105a <+26>: mov rcx, 0x4011b0
0x401061 <+33>: mov rdi, 0x401126
0x401068 <+40>: call qword ptr [rip + 0x2f82]
0x40106e <+46>: hlt
$ readelf --syms ctor | grep _start
86: 0000000000401040 47 FUNC GLOBAL DEFAULT 13 _start
$ readelf --syms ctor | grep annobin_init.c.hot
36: 0000000000401040 0 NOTYPE LOCAL HIDDEN 13 .annobin_init.c.hot
Notice that these thow symbols point to the same address
ELF
Sections:
$ readelf -W -S ctor
$ readelf -W -t ctor
There are 36 section headers, starting at offset 0x5628:
Section Headers:
[Nr] Name Type Address Off Size ES Lk Inf Al Flags
[ 0] NULL NULL 0000000000000000 000000 000000 00 0 0 0 [0000000000000000]:
[ 1] .interp PROGBITS 00000000004002a8 0002a8 00001c 00 0 0 1 [0000000000000002]: ALLOC
[ 2] .note.gnu.build-id NOTE 00000000004002c4 0002c4 000024 00 0 0 4 [0000000000000002]: ALLOC
[ 3] .note.ABI-tag NOTE 00000000004002e8 0002e8 000020 00 0 0 4 [0000000000000002]: ALLOC
[ 4] .gnu.hash GNU_HASH 0000000000400308 000308 00001c 00 5 0 8 [0000000000000002]: ALLOC
[ 5] .dynsym DYNSYM 0000000000400328 000328 000060 18 6 1 8 [0000000000000002]: ALLOC
[ 6] .dynstr STRTAB 0000000000400388 000388 00006c 00 0 0 1 [0000000000000002]: ALLOC
[ 7] .gnu.version VERSYM 00000000004003f4 0003f4 000008 02 5 0 2 [0000000000000002]: ALLOC
[ 8] .gnu.version_r VERNEED 0000000000400400 000400 000020 00 6 1 8 [0000000000000002]: ALLOC
[ 9] .rela.dyn RELA 0000000000400420 000420 000030 18 5 0 8 [0000000000000002]: ALLOC
[10] .rela.plt RELA 0000000000400450 000450 000018 18 5 22 8 [0000000000000042]: ALLOC, INFO LINK
[11] .init PROGBITS 0000000000401000 001000 00001b 00 0 0 4 [0000000000000006]: ALLOC, EXEC
[12] .plt PROGBITS 0000000000401020 001020 000020 10 0 0 16 [0000000000000006]: ALLOC, EXEC
[13] .text PROGBITS 0000000000401040 001040 0001e5 00 0 0 16 [0000000000000006]: ALLOC, EXEC
[14] .fini PROGBITS 0000000000401228 001228 00000d 00 0 0 4 [0000000000000006]: ALLOC, EXEC
[15] .rodata PROGBITS 0000000000402000 002000 000010 00 0 0 8 [0000000000000002]: ALLOC
[16] .eh_frame_hdr PROGBITS 0000000000402010 002010 00005c 00 0 0 4 [0000000000000002]: ALLOC
[17] .eh_frame PROGBITS 0000000000402070 002070 000168 00 0 0 8 [0000000000000002]: ALLOC
[18] .init_array INIT_ARRAY 0000000000403dd8 002dd8 000010 08 0 0 8 [0000000000000003]: WRITE, ALLOC
[19] .fini_array FINI_ARRAY 0000000000403de8 002de8 000008 08 0 0 8 [0000000000000003]: WRITE, ALLOC
[20] .dynamic DYNAMIC 0000000000403df0 002df0 000200 10 6 0 8 [0000000000000003]: WRITE, ALLOC
[21] .got PROGBITS 0000000000403ff0 002ff0 000010 08 0 0 8 [0000000000000003]: WRITE, ALLOC
[22] .got.plt PROGBITS 0000000000404000 003000 000020 08 0 0 8 [0000000000000003]: WRITE, ALLOC
[23] .data PROGBITS 0000000000404020 003020 000004 00 0 0 1 [0000000000000003]: WRITE, ALLOC
[24] .bss NOBITS 0000000000404024 003024 000004 00 0 0 1 [0000000000000003]: WRITE, ALLOC
[25] .comment PROGBITS 0000000000000000 003024 000058 01 0 0 1 [0000000000000030]: MERGE, STRINGS
[26] .gnu.build.attributes 0000000000406028 00307c 00107c 00 0 0 4 [0000000000000000]:
[27] .debug_aranges PROGBITS 0000000000000000 0040f8 000050 00 0 0 1 [0000000000000000]:
[28] .debug_info PROGBITS 0000000000000000 004148 0001d7 00 0 0 1 [0000000000000000]:
[29] .debug_abbrev PROGBITS 0000000000000000 00431f 00014b 00 0 0 1 [0000000000000000]:
[30] .debug_line PROGBITS 0000000000000000 00446a 000079 00 0 0 1 [0000000000000000]:
[31] .debug_str PROGBITS 0000000000000000 0044e3 000164 01 0 0 1 [0000000000000030]: MERGE, STRINGS
[32] .debug_ranges PROGBITS 0000000000000000 004647 000040 00 0 0 1 [0000000000000000]:
[33] .symtab SYMTAB 0000000000000000 004688 000930 18 34 75 8 [0000000000000000]:
[34] .strtab STRTAB 0000000000000000 004fb8 000502 00 0 0 1 [0000000000000000]:
[35] .shstrtab STRTAB 0000000000000000 0054ba 000167 00 0 0 1 [0000000000000000]:
Notice that the Lk
are links to other Section Nr
s. So we can see that
.symtab
links to 34
which is .strtab
.
Could it be that when we set a break point in lldb and specify _start that will work and the address used will be 0000000000401040, but when lldb later breaks it will use the first name in the .symtab with that address:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[13] .text PROGBITS 0000000000401040 001040 0001e5 00 AX 0 0 16
Symbol table '.symtab' contains 98 entries:
Num: Value Size Type Bind Vis Ndx Name
13: 0000000000401040 0 SECTION LOCAL DEFAULT 13
36: 0000000000401040 0 NOTYPE LOCAL HIDDEN 13 .annobin_init.c.hot
86: 0000000000401040 47 FUNC GLOBAL DEFAULT 13 _start
symbol table
0000000000401106 <main>:
401106: 55 push %rbp
401107: 48 89 e5 mov %rsp,%rbp
40110a: 89 7d fc mov %edi,-0x4(%rbp)
40110d: 48 89 75 f0 mov %rsi,-0x10(%rbp)
401111: 48 c7 c0 20 40 40 00 mov $0x404020,%rax
Notice the value of this move which is 404020
which we can find in the
symbol table.
$ readelf -s offset
22: 000000000040401c 0 SECTION LOCAL DEFAULT 22
...
68: 0000000000404020 4 OBJECT GLOBAL DEFAULT 22 something
...
So the entry something
refers/links to entry 22
which is in the .bss
section and notice that the address matches 000000000040401c
:
$ objdump -d -j .bss offset
offset: file format elf64-x86-64
Disassembly of section .bss:
000000000040401c <__bss_start>:
40401c: 00 00 add %al,(%rax)
...
0000000000404020 <something>:
...
Just to be clear this file was linked into an executable so the linker would do this work. If we only compile the file we will see the following in the .text section:
$ gcc -fPIC -o offset -c offset.c
$ objdump -d -j .text offset
offset: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 89 7d fc mov %edi,-0x4(%rbp)
7: 48 89 75 f0 mov %rsi,-0x10(%rbp)
b: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 12 <main+0x12>
12: c7 00 12 00 00 00 movl $0x12,(%rax)
18: b8 00 00 00 00 mov $0x0,%eax
1d: 5d pop %rbp
1e: c3 retq
So there should be an entry in the relocation table for this:
$ readelf -r offset
Relocation section '.rela.text' at offset 0x200 contains 1 entry:
Offset Info Type Sym. Value Sym. Name + Addend
00000000000e 00080000002a R_X86_64_REX_GOTP 0000000000000004 something - 4
Relocation section '.rela.eh_frame' at offset 0x218 contains 1 entry:
Offset Info Type Sym. Value Sym. Name + Addend
000000000020 000200000002 R_X86_64_PC32 0000000000000000 .text + 0
And we can inspect the symbol table for this object file using:
$ readelf -s offset
Symbol table '.symtab' contains 11 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS offset.c
2: 0000000000000000 0 SECTION LOCAL DEFAULT 1
3: 0000000000000000 0 SECTION LOCAL DEFAULT 3
4: 0000000000000000 0 SECTION LOCAL DEFAULT 4
5: 0000000000000000 0 SECTION LOCAL DEFAULT 6
6: 0000000000000000 0 SECTION LOCAL DEFAULT 7
7: 0000000000000000 0 SECTION LOCAL DEFAULT 5
8: 0000000000000004 4 OBJECT GLOBAL DEFAULT COM something
9: 0000000000000000 31 FUNC GLOBAL DEFAULT 1 main
10: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND _GLOBAL_OFFSET_TABLE_
Call Frame Information (CFI)
These are assembler directives that get generated to support debugging/exception handling by storing information about the base/stack pointer to enable stack unwinding. One might think that this would be possible without having this information by just following the stack base pointer that is store/reset but there is no guarantee that function do this, or one might want to use rpb for something else. So this enable stack unwinding without depending on the function prologue/epiloge.
For example:
$ gcc -g -S simple.c
.LFB0:
.file 1 "simple.c"
.loc 1 1 33
.cfi_startproc
Where to these come from?
As far as I can tell these originate from ../gcc/gcc/dwarf2out.c
static int maybe_emit_file (struct dwarf_file_data * fd) {
...
if (output_asm_line_debug_info ()){
fprintf (asm_out_file, "\t.file %u ", fd->emitted_number);
output_quoted_string (asm_out_file, remap_debug_filename (fd->filename));
fputc ('\n', asm_out_file);
}
}
For example:
.loc 1 1 33
would be outputted by the following function:
static void dwarf2out_source_line (unsigned int line, unsigned int column,
const char *filename,
int discriminator, bool is_stmt)
...
if (output_asm_line_debug_info ())
{
fputs ("\t.loc ", asm_out_file);
fprint_ul (asm_out_file, file_num);
putc (' ', asm_out_file);
fprint_ul (asm_out_file, line);
putc (' ', asm_out_file);
fprint_ul (asm_out_file, column);
cfi_startproc
is specified for each function that should have an entry in
.eh_frame (for frame unwinding) and should be closed with a .cfi_endproc
. So
this would be in the generated assembly file without the -g
flag.
This will create a Call Frame Information (CFI) table for this function.
pushq %rbp
.cfi_def_cfa_offset 16
What I think this is doing is that it is adjusting the register that is used for the canonical frame address (CFA) because we have pushed a rbp onto the stack.
Next we have
.cfi_offset 6, -16
The first argument is a register number 6, which is rbp and what this is doing is noting that rbp is being saved on the stack and how to find it.
The following are register names to register numbers:
General Purpose Register RAX 0 %rax
General Purpose Register RDX 1 %rdx
General Purpose Register RCX 2 %rcx
General Purpose Register RBX 3 %rbx
General Purpose Register RSI 4 %rsi
General Purpose Register RDI 5 %rdi
Frame Pointer Register RBP 6 %rbp
Stack Pointer Register RSP 7 %rsp
Extended Integer Registers 8-15 8-15 %r8–%r15
So what information is generated for these directives in the object file created?
Well, we can inspect the CFI using objdump:
$ objdump -W simplec
simplec: file format elf64-x86-64
Contents of the .eh_frame section:
00000000 0000000000000014 00000000 CIE
Version: 1
Augmentation: "zR"
Code alignment factor: 1
Data alignment factor: -8
Return address column: 16
Augmentation data: 1b
DW_CFA_def_cfa: r7 (rsp) ofs 8
DW_CFA_offset: r16 (rip) at cfa-8
DW_CFA_nop
DW_CFA_nop
00000040 000000000000001c 00000044 FDE cie=00000000 pc=0000000000401106..0000000000401118
DW_CFA_advance_loc: 1 to 0000000000401107
DW_CFA_def_cfa_offset: 16
DW_CFA_offset: r6 (rbp) at cfa-16
DW_CFA_advance_loc: 3 to 000000000040110a
DW_CFA_def_cfa_register: r6 (rbp)
DW_CFA_advance_loc: 13 to 0000000000401117
DW_CFA_def_cfa: r7 (rsp) ofs 8
DW_CFA_nop
DW_CFA_nop
DW_CFA_nop
FDE stands for Frame Description Entry and the range that this entry covers is specified by the DW_CFA_advance_loc properties. Notice that these address of DW_CFA_advance_loc match instructions that modify the base or stack pointer:
0000000000401106 <main>:
401106: 55 push %rbp
401107: 48 89 e5 mov %rsp,%rbp
40110a: 89 7d fc mov %edi,-0x4(%rbp)
40110d: 48 89 75 f0 mov %rsi,-0x10(%rbp)
401111: b8 00 00 00 00 mov $0x0,%eax
401116: 5d pop %rbp
401117: c3 retq
401118: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
40111f: 00
Common Information Entry (CIE).
Just a note about the label .LFB0
where L is just a prefix and FB is
Function Begin, followed by a number. There can alse be Function End (.LFE0).
Canonical Frame Address (CFA)
This is the value of the stack pointer (rsp) before the called function. This is what we would normally use in the function prolouge.
IRQ (Interrupt Requests)
There are two types of these requests, long and short ones.
Auxiliary vector
An example can be found in auxv.c.
$ env LD_SHOW_AUXV=1 ./init
AT_SYSINFO_EHDR: 0x7ffff7fcf000
AT_HWCAP: bfebfbff
AT_PAGESZ: 4096
AT_CLKTCK: 100
AT_PHDR: 0x400040
AT_PHENT: 56
AT_PHNUM: 11
AT_BASE: 0x7ffff7fd1000
AT_FLAGS: 0x0
AT_ENTRY: 0x401040
AT_UID: 1000
AT_EUID: 1000
AT_GID: 1000
AT_EGID: 1000
AT_SECURE: 0
AT_RANDOM: 0x7fffffffd619
AT_HWCAP2: 0x0
AT_EXECFN: ./init
AT_PLATFORM: x86_64
some_constructor
main
The same information can also be found in /proc/pid/auxv.
AT_PHDR
Is the location of the program header.
AT_ENTRY
Is the entry point address for this executable.
AT_SECURE
Recall that the real user id is the uid of the user that started the process.
$ sudo setcap cap_setuid=ep ./auxv
$ getcap ./auxv
$ ./auxv = cap_setuid+ep
e=effecitve, p=permitted,
$ env LD_SHOW_AUXV=1 ./auxv
uid: 1000
gid: 1000
AT_SECURE: 1
Notice that we don't get any output from the environment variables.
$ env -i NODE_EXTRA=bajja ./auxv
AT_SECURE: 1
env vars:
NODE_EXTRA=bajja
uid: 1000
gid: 1000
So that we setting the setuid capability, but will setting any capabilitiy set AT_SECURE?
$ env -i NODE_EXTRA=bajja ./auxv
AT_SECURE: 0
uid: 1000
euid: 1000
gid: 1000
gid: 1000
env vars:
NODE_EXTRA=bajja
$ sudo setcap cap_net_bind_service+ep ./auxv
$ env -i NODE_EXTRA=bajja ./auxv
AT_SECURE: 1
uid: 1000
euid: 1000
gid: 1000
gid: 1000
not allowed to show env vars
$ sudo chown root:root auxv
[sudo] password for danielbevenius:
$ ls -l auxv
-rwxrwxr-x. 1 root root 25192 Mar 12 08:40 auxv
$ sudo chmod u+s auxv
$ ls -l auxv
-rwsrwxr-x. 1 root root 25192 Mar 12 08:40 auxv
$ env -i NODE_EXTRA=bajja ./auxv
AT_SECURE: 1
uid: 1000
euid: 0
gid: 1000
gid: 1000
not allowed to show env vars
LD tokens
There are a few tokens that the ld will expend. For example $ORIGIN will expand to the directory of the where the compiled application lives.
LD secure execution mode
A binary is said to execute in secure-execution mode if AT_SECURE is set.
Real user id
The logged in user
$ id -ru
1000
$ id -run
danielbevenius
$ logname
danielbevenius
Effective user id
If we switch users then we can check after the switch the current user id
using whoami
which shows the same as the command id -un
.
So this would be the user id reported after using the substitute user and group
command su
.
$ su -
Password:
[root@localhost ~]# id -un
root
[root@localhost ~]# whoami
root
[root@localhost ~]# logname
danielbevenius
Notice that logname
will always show the real user id.
Capabilities
Where introduced to give more fine grained control to processes that need to
have higher permissions without having to be setuid.
So setuid can be set using chmod u+s
on a binary will make that the effective
user of when a process starts and the executable object file is loaded into
memory and executed. This is an all or nothing thing, we get all the permissions
or none.
Remember that when a new process is to be created fork
is called
which will make a copy of the current process. During this process of forking
the capabilities will be copied. Normally execve
will be called which will
replace the copied process (the address spaces) with the image read from the
executable object file. If the binary has the setuid set all permitted and
effective capabilities are enabled.
+------------------------------------+
| forked process |
| Effective set [0000000000000000] |
| Permitted set [0000000000000000] |
| Inherited set [0000000000000000] |r----+
| Ambient set [0000000000000000] |-----|---+
+------------------------------------+ | |
execve | |
↓ | |
+------------------------------------+ | |
| forked process | | |
| Effective set [0000000000000000]←+ | | |
| Permitted set [0000000000000000]←+ | | |
| Inherited set [0000000000000000] | |<----+ |
| Ambient set [0000000000000000]→+ |<--------+
+------------------------------------+
The Effective set
is the set that is checked by the kernel to allow or deny
system calls.
00000000 00000000 = 2 bytes, 16 bits So these are used as a bit pattern, and there are macros available for the values (which can be found in /usr/include/linux/capability.h), for example CAP_NET_BIND_SERVICE is defined as:
#define CAP_NET_BIND_SERVICE 10
Now if one would like to check if a set contains one of the macros it would be easy to think that it is just a matter of using & to find out. But this is not the case and one has to use CAP_TO_MASK(CAP_NET_BIND_SERVICE) to get the correct value before using AND. See example in cap.c.
$ make cap
$ sudo setcap cap_net_broadcast,cap_net_bind_service+p ./cap
$ getcap ./cap
./cap = cap_net_bind_service,cap_net_broadcast+p
$ ./cap
Effective set: 0000000000000c00
Permitted set: 0000000000000c00
Inherited set: 0000000000000000
CAP_TO_MASK(CAP_NET_BIND_SERVICE): 0000000000000400
Has CAP_NET_BIND_SERVICE: 0000000000000400
One thing to note here is that while the kernel checks the Effective Set
an
executable would normally be set to have a permitted capability, that is using
+p and not +ep. The executable itself must be capabilities aware and will set
the capability it needs before executing a syscall. For example to bind to a
socket it would do so setting the effective set and that would only work if that
option is in the permitted set. After the call the program will unset the
effective set.
Remove capabilities:
$ sudo setcap -r /path/to/file
If you have a hex value and want fo find the cababilities for them one can
use capsh
:
$ capsh --decode=0000000000000400
0x0000000000000400=cap_net_bind_service
Where are capabilities actually checked in the kernel? (TODO)
long sys_execve(const char __user *name,
const char __user *const __user *argv,
const char __user *const __user *envp, struct pt_regs *regs)
{
long error;
char *filename;
filename = getname(name);
error = PTR_ERR(filename);
if (IS_ERR(filename))
return error;
error = do_execve(filename, argv, envp, regs);
#ifdef CONFIG_X86_32
if (error == 0) {
/* Make sure we don't return using sysenter.. */
set_thread_flag(TIF_IRET);
}
#endif
putname(filename);
return error;
}
install libcap
$ sudo yum install libcap-devel
This can be dynamically linked with an executable using lcap
but this might
not always be desirabe as it requires that the system has this shared library.
Another option is to use a static library, libcap.a:
$ sudo dnf install libcap-static
And in this case we can specify that this library should be linked statically and not dynmically as the rest (libc.so etc):
${CC} -o $@ $< -Wl,-Bstatic -lcap -Wl,-Bdynamic
One thing to note is about the order here, notice in the above case we have the source, $<, before the libraries. But if we don't do that the linker will not include the symbols as they are not used by anything.