Determining C memory layout

A while ago I had to write a Python program that generated C structures as C expects to find them in memory. The challenge was that the target system had a different alignment, endianness and word size than the computer running the Python scripts on. This is especially important when developing for multiple targets, like i386 and AMD64, different ARM platforms and MIPS. In this article I am referring to Python but the result can be transferred to any other programming language.

Goals

The primary goal of this effort was to represent C structures as Python objects in order to set and get fields. To access the data structures remotely I implemented a special program that I am not going to cover here. The mapping should be automatic, and correct. The C header files exist already, it is not an option to change them. Also, some structures are marked to be packed using #pragma directives.

For example, I have the following C structure:

typedef struct atype_t {
    int i;
    char c;
    short s;
}

On 32-bit machines, it is assumed that an int has a length of 32 bits, a char 8 and a short 16 bits. However, these are just assumptions and will not hold in general. As most computers today have a word size of 64 bits these assumptions are false.

On the Python side I want to be able to use the following code:

atype = atype_t()
atype.i = 1
atype.c = 4
atype.s = 8
byteArray = atype.bytes()

For completeness, I also want to convert a given byte array into a Python structure object.

Approaches

As this seems to be a common problem I evaluated different options to solve the problem.

Parsing C

The fist one was to parse the C header files for typedef/struct elements in order to extract the relevant information. However, parsing C is non-trivial as it requires first to run the C preprocessor. Only afterwards it is correct C code, which can be parsed by C parsers. For Python there exists pycparser which aims to provide a complete AST for C. While this approach works in theory, it has several drawbacks.

Information about #define's are lost.
Alignment on the target machine can only be guessed. It would be possible to implement an algorithm that generates an alignment for structures but there is no guarantee that the compiler produces the same alignment.
Overhead: I am only interested in the structures, not in the function AST.

Generate bindings

There exist several tools to generate bindings for C. However, all of them are targeted to be run on the same machine type, there is no option to set endianess, word size and alignment. An example is ctypes, many others are listed on the Python wiki.

Interfacing to the compiler

The only program that really knows the structure layout in memory is the compiler. It decides about the in-memory structure. Hence, I started investigating if there was a way to extract this information from the compiler.

The missing piece: .debug_info section

C compilers can store additional information to the compiled output in program sections. For ELF files, there is an optional .debug_info (may have a different name depending on the compiler) that contains compiler-specific data. It has no standardized structure. Internally it is used to help debugging a program, for example to determine what variable is at what memory location etc. It also stores debug information about the C structures, which is what I am looking for.

For different compilers exist different tools to access the debug section content. For GCC, there is objdump, for IAR there exists ielfdump. Both allow to print the debug section in a human-readable form. However, the structure is not documented and requires the programmer to reverse engineer it. (For GCC it is open-source, which is not the case for the IAR C compiler).

Example program

Throughout the article, we will be using the following sample program saved as struct.c:

#include <stdio.h>
 
typedef struct atype_t {
    unsigned int i;
    unsigned short s;
    char c;
} atype_t;
 
int main(char **argv, int argl) {
    atype_t atype;
    int i;
    atype.i = 0x12345678;
    atype.s = 0x90AB;
    atype.c = 0xCD;
 
    const unsigned char *c = (const unsigned char*) &atype;
    for (i = 0; i < sizeof(atype); i++) {
        printf("%02x ", c[i]);
    }
    printf("\n");
    return 0;
}

Running it on my machine produces the following output:

78 56 34 12 ab 90 cd 00

My machine is obviously a little-endian machine.

Obtaining the debug section data

In order to examine the debug section, the program first needs to be compiled with debug symbols. When using GCC, this can be done by adding the -g option. This will cause GCC to record all kind of information in different ELF sections, all prefixed by debug_.

Next, we use the objdump utility to extract the debug section and print its content. Assuming that the program has been compiled to the binary struct, the following command reveals the debug section contents:

objdump --dwarf=info struct

Here is the output on my machine:

struct:     file format elf64-x86-64

Contents of the .debug_info section:

  Compilation Unit @ offset 0x0:
   Length:        0x12c (32-bit)
   Version:       2
   Abbrev Offset: 0
   Pointer Size:  8
 <0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
    <c>   DW_AT_producer    : (indirect string, offset: 0x1e): GNU C 4.7.1	
    <10>   DW_AT_language    : 1	(ANSI C)
    <11>   DW_AT_name        : (indirect string, offset: 0x7d): struct.c	
    <15>   DW_AT_comp_dir    : (indirect string, offset: 0x60): /home/moritz/dev/struct_test	
    <19>   DW_AT_low_pc      : 0x40055c	
    <21>   DW_AT_high_pc     : 0x4005ce	
    <29>   DW_AT_stmt_list   : 0x0	
 <1><2d>: Abbrev Number: 2 (DW_TAG_base_type)
    <2e>   DW_AT_byte_size   : 8	
    <2f>   DW_AT_encoding    : 7	(unsigned)
    <30>   DW_AT_name        : (indirect string, offset: 0x2a): long unsigned int	
 <1><34>: Abbrev Number: 2 (DW_TAG_base_type)
    <35>   DW_AT_byte_size   : 1	
    <36>   DW_AT_encoding    : 8	(unsigned char)
    <37>   DW_AT_name        : (indirect string, offset: 0x44): unsigned char	
 <1><3b>: Abbrev Number: 2 (DW_TAG_base_type)
    <3c>   DW_AT_byte_size   : 2	
    <3d>   DW_AT_encoding    : 7	(unsigned)
    <3e>   DW_AT_name        : (indirect string, offset: 0x0): short unsigned int	
 <1><42>: Abbrev Number: 2 (DW_TAG_base_type)
    <43>   DW_AT_byte_size   : 4	
    <44>   DW_AT_encoding    : 7	(unsigned)
    <45>   DW_AT_name        : (indirect string, offset: 0x2f): unsigned int	
 <1><49>: Abbrev Number: 2 (DW_TAG_base_type)
    <4a>   DW_AT_byte_size   : 1	
    <4b>   DW_AT_encoding    : 6	(signed char)
    <4c>   DW_AT_name        : (indirect string, offset: 0x46): signed char	
 <1><50>: Abbrev Number: 2 (DW_TAG_base_type)
    <51>   DW_AT_byte_size   : 2	
    <52>   DW_AT_encoding    : 5	(signed)
    <53>   DW_AT_name        : (indirect string, offset: 0x86): short int	
 <1><57>: Abbrev Number: 3 (DW_TAG_base_type)
    <58>   DW_AT_byte_size   : 4	
    <59>   DW_AT_encoding    : 5	(signed)
    <5a>   DW_AT_name        : int	
 <1><5e>: Abbrev Number: 2 (DW_TAG_base_type)
    <5f>   DW_AT_byte_size   : 8	
    <60>   DW_AT_encoding    : 5	(signed)
    <61>   DW_AT_name        : (indirect string, offset: 0x57): long int	
 <1><65>: Abbrev Number: 2 (DW_TAG_base_type)
    <66>   DW_AT_byte_size   : 8	
    <67>   DW_AT_encoding    : 7	(unsigned)
    <68>   DW_AT_name        : (indirect string, offset: 0x90): sizetype	
 <1><6c>: Abbrev Number: 4 (DW_TAG_pointer_type)
    <6d>   DW_AT_byte_size   : 8	
    <6e>   DW_AT_type        : <0x72>	
 <1><72>: Abbrev Number: 2 (DW_TAG_base_type)
    <73>   DW_AT_byte_size   : 1	
    <74>   DW_AT_encoding    : 6	(signed char)
    <75>   DW_AT_name        : (indirect string, offset: 0x4d): char	
 <1><79>: Abbrev Number: 5 (DW_TAG_structure_type)
    <7a>   DW_AT_name        : (indirect string, offset: 0x3c): atype_t	
    <7e>   DW_AT_byte_size   : 8	
    <7f>   DW_AT_decl_file   : 1	
    <80>   DW_AT_decl_line   : 3	
    <81>   DW_AT_sibling     : <0xaa>	
 <2><85>: Abbrev Number: 6 (DW_TAG_member)
    <86>   DW_AT_name        : i	
    <88>   DW_AT_decl_file   : 1	
    <89>   DW_AT_decl_line   : 4	
    <8a>   DW_AT_type        : <0x42>	
    <8e>   DW_AT_data_member_location: 2 byte block: 23 0 	(DW_OP_plus_uconst: 0)
 <2><91>: Abbrev Number: 6 (DW_TAG_member)
    <92>   DW_AT_name        : s	
    <94>   DW_AT_decl_file   : 1	
    <95>   DW_AT_decl_line   : 5	
    <96>   DW_AT_type        : <0x3b>	
    <9a>   DW_AT_data_member_location: 2 byte block: 23 4 	(DW_OP_plus_uconst: 4)
 <2><9d>: Abbrev Number: 6 (DW_TAG_member)
    <9e>   DW_AT_name        : c	
    <a0>   DW_AT_decl_file   : 1	
    <a1>   DW_AT_decl_line   : 6	
    <a2>   DW_AT_type        : <0x72>	
    <a6>   DW_AT_data_member_location: 2 byte block: 23 6 	(DW_OP_plus_uconst: 6)
 <1><aa>: Abbrev Number: 7 (DW_TAG_typedef)
    <ab>   DW_AT_name        : (indirect string, offset: 0x3c): atype_t	
    <af>   DW_AT_decl_file   : 1	
    <b0>   DW_AT_decl_line   : 7	
    <b1>   DW_AT_type        : <0x79>	
 <1><b5>: Abbrev Number: 8 (DW_TAG_subprogram)
    <b6>   DW_AT_external    : 1	
    <b7>   DW_AT_name        : (indirect string, offset: 0x52): main	
    <bb>   DW_AT_decl_file   : 1	
    <bc>   DW_AT_decl_line   : 9	
    <bd>   DW_AT_prototyped  : 1	
    <be>   DW_AT_type        : <0x57>	
    <c2>   DW_AT_low_pc      : 0x40055c	
    <ca>   DW_AT_high_pc     : 0x4005ce	
    <d2>   DW_AT_frame_base  : 0x0	(location list)
    <d6>   DW_AT_GNU_all_tail_call_sites: 1	
    <d7>   DW_AT_sibling     : <0x11e>	
 <2><db>: Abbrev Number: 9 (DW_TAG_formal_parameter)
    <dc>   DW_AT_name        : (indirect string, offset: 0x99): argv	
    <e0>   DW_AT_decl_file   : 1	
    <e1>   DW_AT_decl_line   : 9	
    <e2>   DW_AT_type        : <0x11e>	
    <e6>   DW_AT_location    : 2 byte block: 91 48 	(DW_OP_fbreg: -56)
 <2><e9>: Abbrev Number: 9 (DW_TAG_formal_parameter)
    <ea>   DW_AT_name        : (indirect string, offset: 0x19): argl	
    <ee>   DW_AT_decl_file   : 1	
    <ef>   DW_AT_decl_line   : 9	
    <f0>   DW_AT_type        : <0x57>	
    <f4>   DW_AT_location    : 2 byte block: 91 44 	(DW_OP_fbreg: -60)
 <2><f7>: Abbrev Number: 10 (DW_TAG_variable)
    <f8>   DW_AT_name        : (indirect string, offset: 0x13): atype	
    <fc>   DW_AT_decl_file   : 1	
    <fd>   DW_AT_decl_line   : 10	
    <fe>   DW_AT_type        : <0xaa>	
    <102>   DW_AT_location    : 2 byte block: 91 50 	(DW_OP_fbreg: -48)
 <2><105>: Abbrev Number: 11 (DW_TAG_variable)
    <106>   DW_AT_name        : i	
    <108>   DW_AT_decl_file   : 1	
    <109>   DW_AT_decl_line   : 11	
    <10a>   DW_AT_type        : <0x57>	
    <10e>   DW_AT_location    : 2 byte block: 91 6c 	(DW_OP_fbreg: -20)
 <2><111>: Abbrev Number: 11 (DW_TAG_variable)
    <112>   DW_AT_name        : c	
    <114>   DW_AT_decl_file   : 1	
    <115>   DW_AT_decl_line   : 16	
    <116>   DW_AT_type        : <0x124>	
    <11a>   DW_AT_location    : 2 byte block: 91 60 	(DW_OP_fbreg: -32)
 <1><11e>: Abbrev Number: 4 (DW_TAG_pointer_type)
    <11f>   DW_AT_byte_size   : 8	
    <120>   DW_AT_type        : <0x6c>	
 <1><124>: Abbrev Number: 4 (DW_TAG_pointer_type)
    <125>   DW_AT_byte_size   : 8	
    <126>   DW_AT_type        : <0x12a>	
 <1><12a>: Abbrev Number: 12 (DW_TAG_const_type)
    <12b>   DW_AT_type        : <0x34>

This is a lot of output. Let's explain the different parts.

Each line represents either the start of a new entity or adds an attribute to one. All entities and attributes have a unique address, encoded in hex. Each entity has a type, which determines the attributes it has. The entity line consists of a nesting level, the unique address followed by the entity's type ID and a human readable name of the type. Attribute definitions start with the unique ID followed by the attribute name to be defined and a colon. After the colon, the actual value of the attribute follows. It can be a number, a string or a pointer to another entity. The pointer is encoded as <0x34> which references the entity with unique ID 34. This way, the debug info tree can be traversed.

Finding the size information

Fist, it tells us what format the file has. In this case it is a elf64-x86-64. We know that pointers are 64 bits ling on this platform.

Next, it shows the contents of the debug info section. There can be different compilation units, here it is just one. It tells us that pointers indeed have a length of 8 bits. The total length of the debug info section is 0x12c.

The first compilation unit is marked with <0><b>: Abbrev Number: 1 (DW_TAG_compile_unit). It means that at nesting level 0, address 0xb in the section is an entity of type 1, which is a DW_TAG_compile_unit. It has several members which are shown on the next few lines. There is nothing interesting for us.

The first children of the compile unit are DW_TAG_base_types. The DW_AT_byte_size member tells us how long the base type is, what encoding it has (DW_AT_encoding) and its name (DW_AT_name).

Skipping to address 0x79 it gets more interesting: The definition of our structure.

 <1><79>: Abbrev Number: 5 (DW_TAG_structure_type)

It shows that our structure occupies 8 bytes in memory, which we can verify using the output of our sample program. Nested under the structure type, we find several DW_TAG_member children which represent the structure members. Most interesting is the DW_AT_data_member_location which tells us at what position inside the structure a member is positioned. The member i is at position 0, s is at 4 and c at 6. This leads us to the following in-memory structure:

i3 i2 i1 i0 s1 s0 c0 00

Here, iN represent the different bytes of the integer i, same holds for s and c. There is one byte padding at the end. The size of each member type can be determined by following the DW_AT_type pointer inside the debug info section.

Problems

This approach has the downside that it requires a certain format of the debug dump. Also, it does not specify the endianness of the types.

IAR

The same can be done with the tools that come with the IAR C/C++ compiler. The ELF dump utility is called ielfdump.

wiki.antiguru.de

Table of Contents