星期三, 5月 16, 2012

ARM Linux booting

arm linux Kernel 啟動流程:

(* arch/arm/boot/bootp/init.S - 有人說會先執行這支程式, 目前不清楚這支程式在什麼情形下會執行, 在我的平台上沒有.)



要瞭解一支程式架構,先從目錄結構與 Makefile 下手是最快的方法,kernel 也不例外,與 arm 相關的開機程式碼,會放在 arch/arm/boot 裡面。

linux kernel image 大致上來說可分為無壓縮跟有壓縮兩種,通常無壓縮格式是沒在使用的。

Linux kernel image 常見格式表
FormatDescription
zImagezipped image
xipImageexecute in place image
uImageuboot wrapped image

在 ARM 平台裡,zImage 為最常見的格式,從 boot 目錄裡的 Makefile,可以知道與 zImage 相關的程式是放在 arch/arm/boot/compressed 裡面,從 compressed 目錄裡的 Makefile 可以知道 zImage 主要是由 head.o + misc.o + piggy.o 還有一些與平台相關的程式所組成,其中比較重要的檔案為:

FileDescription
head.okernel 進入點
misc.o解壓縮副函式
piggy.o壓縮的 kernel image
*.o其他與平台相關的程式碼



參考 Documentation/arm/booting.txt 文件,
可以知道 boot loader 在呼叫 kernel 之前,必需有以下條件:

- 關畢所有 DMA 相關的設備,以免記憶體被無意義的網路封包或資碟資料干擾。這會省下你許多小時的除錯時間。

- CPU 暫存器設定
r0 = 0
r1 = machine type ID
r2 = 指向 tagged list 在 RAM 的實體位址

- CPU 模式
所有中斷都必需關畢 (IRQs and FIQs)
CPU 必須在 SVC mode (Angel 是一個特別的例外)

- Caches, MMUs
MMU 必須關畢
Instruction cache (I-cache) 可以開啟或關畢
Data cache (D-cache) 必須關畢

- boot loader 被預期呼叫 kernel image 的方法是,直接跳到 kernel image 的第一個指令。



以下是各檔案的執行順序與大致上會做的事:
  • arch/arm/boot/compressed/head.S
    1. 從 boot loader 跳過來後,第一個執行的位置會是在 start: 這個 label。
    2. 為了保留不使用 boot loader 能開機的能力,會先 nop 八次來保留 ARM 的中斷向量表。
    3. 保存 r1、r2 的數值到 r7 與 r8。
    4. 關畢 FIQ 與 IRQ。
    5. link 與平台相關的程式來執行,如 "head-xscale.S"。
    6. 記算計憶體裡的偏移量並修正各節區。
    7. 配置 C 的 runtime environment。
    8. 設置與開啟 mmu。
    9. 檢查 compressed image size.
    10. 設置參數到 r0 到 r3, 呼叫 arch/arm/boot/compressed/misc.c 裡的 decompress_kernel.
  • arch/arm/boot/compressed/misc.c
    1. 此檔案主要提供 decompress_kernel 函式模式供 arch/arm/boot/compressed/head.S 呼叫
    2. 宣告印字元與印字串等函式, 在 decompress_kernel 中透過這幾個函式印出 decompress kernel 等字串.
    3. 呼叫 lib/inflate.c 裡的 gunzip 函式來解壓縮 kernel image
    4. 返回 arch/arm/boot/compressed/head.S
  • arch/arm/kernel/head.S
      ---> 判斷 processor & machine ID
  • arch/arm/kernel/head-common.S
      ---> 判斷 machine ID 錯誤顯示函式 (low level debug 開啟時)
  • init/main.c
      ---> start_kernel


詳細的追蹤內容:
arch/arm/boot/compressed/head.S

  .section ".start", #alloc, #execinstr
/*
 * sort out different calling conventions
 */
  .align
start:
  .type start,#function
  .rept 8
  mov r0, r0
  .endr

  b 1f
  .word 0x016f2818  @ Magic numbers to help the loader
  .word start   @ absolute load/run zImage address
  .word _edata   @ zImage end address
1:  mov r7, r1   @ save architecture ID
  mov r8, r2   @ save atags pointer


註解說為了不同的呼叫慣例, 所以要 .align 以供對齊,
參考: http://wiki.debian.org/ArmEabiPort#Struct_packing_and_alignment

一開始程式會執行 mov r0, r0 (即 nop)八次 (.rept = repeat, .endr = end of repeat),
這部份有兩種說法,一個是因為 ARM11 有 8 stages pipeline,另一種說法是保留中斷向量表(interrupt vector table)。
我個人是認為要保留 Linux kernel 能直接開機的能力而保留中斷向量表的空間,因為要 flush pipeline 不需要用到八個 nop。

接下來指令是 b 1f (f = forward) 跳到後面的 lable "1:"

在 lable "1:" 之前, 有三個 .word 資訊如下:

zImage 檔頭資訊
zImage 偏移位置內容說明
0x240x016F2818用來識別 ARM Linux zImage 的 Magic number
0x28startzImage 開始位置(通常為0)
0x2C_edatazImage 結束位置(通常為檔案大小)

參考 http://www.simtec.co.uk/products/SWLINUX/files/booting_article.html

從 label "1:" 開始會將 r1 (architecture ID) 跟 r2 (atags pointer) 的值保存在 r7 跟 r8.

#ifndef __ARM_ARCH_2__
  /*
   * Booting from Angel - need to enter SVC mode and disable
   * FIQs/IRQs (numeric definitions from angel arm.h source).
   * We only do this if we were in user mode on entry.
   */
  mrs r2, cpsr  @ get current mode
  tst r2, #3   @ not user?
  bne not_angel
  mov r0, #0x17  @ angel_SWIreason_EnterSVC
  swi 0x123456  @ angel_SWI_ARM
not_angel:
  mrs r2, cpsr  @ turn off interrupts to
  orr r2, r2, #0xc0  @ prevent angel from running
  msr cpsr_c, r2
#else
  teqp pc, #0x0c000003  @ turn off interrupts
#endif

這段程式主要是用來關畢所有中斷.

關於 Angel 可以參考這個網址:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0066d/Babdcdih.html

參考 arm architecture reference manual:

32-bit 的 cpsr 暫存器:

NameDescBits
cpsr_fflags field[31:24]
cpsr_sstatus field[23:16]
cpsr_xextension field[15:8]
cpsr_ccontrol field[7:0]



cpsr bit field
Bit313029282726-----876543210
 NZCVQDNM(RAZ)IFTM4M3M2M1M0



32-bit 模式設定
M[4:0]Mode
0b10000User
0b10001FIQ
0b10010IRQ
0b10011Supervisor (SVC)
0b10111Abort
0b11011Undefined
0b11111System



26-bit 的 pc 暫存器:

Bit 31 30 29 28 27 26 25------------2 1 0
N Z C V I F program counter M1 M0

26-bit 模式設定:
M[1:0] Mode
0b00 User
0b01 FIQ
0b10 IRQ
0b11 Supervisor

N Negative
Z Zero
C Carry
O Overflow
I IRQ
F FIQ
T Thumb

  /*
   * Note that some cache flushing and other stuff may
   * be needed here - is there an Angel SWI call for this?
   */

  /*
   * some architecture specific code can be inserted
   * by the linker here, but it should preserve r7, r8, and r9.
   */

  .text
  adr r0, LC0
  ldmia r0, {r1, r2, r3, r4, r5, r6, ip, sp}
  subs r0, r0, r1  @ calculate the delta offset

      @ if delta is zero, we are
  beq not_relocated  @ running at the address we
      @ were linked at.

  /*
   * We're running at a different address.  We need to fix
   * up various pointers:
   *   r5 - zImage base address
   *   r6 - GOT start
   *   ip - GOT end
   */
  add r5, r5, r0
  add r6, r6, r0
  add ip, ip, r0

註解提到一些與平台相關的程式會被 link 加在這裡(如 head-xscale.S), 但必需保留 r7, r8 與 r9.
這段程式碼主要是在計算 offset delta 並重新定址 zImage 的 base 跟 GOT(global offset table) address.

r0: LC0 runtime address
r1: LC0 linked address
r0 - r1: delta offset

adr 是一個假指令, 會自動幫你算出偏移量並將自動換成 add, sub, mov 或 mvn 等指令。
這邊是算出 LC0 offset 然後 load 到各暫存器裡。

LC0 如下:

  .type LC0, #object
LC0:  .word LC0   @ r1
  .word __bss_start  @ r2
  .word _end   @ r3
  .word zreladdr  @ r4
  .word _start   @ r5
  .word _got_start  @ r6
  .word _got_end  @ ip
  .word user_stack+4096  @ sp
LC1:  .word reloc_end - reloc_start
  .size LC0, . - LC0

r0: LC0 runtime address
r1: LC0 linked address
r2: BSS start
r3: BSS end
r4: memory physical address
r5: kernel start address
r6: GOT start
ip: GOT end
sp: stack pointer

stack pointer 要加 4096,這是因為 stack 是高位往低位倒著長的,所以先指到 stack 最尾端。
在檔案最後一行可以看到:
user_stack: .space 4096
其他暫存器後面會用到,如 r4 在設置 mmu 時會用到。

#ifndef CONFIG_ZBOOT_ROM
  /*
   * If we're running fully PIC === CONFIG_ZBOOT_ROM = n,
   * we need to fix up pointers into the BSS region.
   *   r2 - BSS start
   *   r3 - BSS end
   *   sp - stack pointer
   */
  add r2, r2, r0
  add r3, r3, r0
  add sp, sp, r0

  /*
   * Relocate all entries in the GOT table.
   */
1:  ldr r1, [r6, #0]  @ relocate entries in the GOT
  add r1, r1, r0  @ table.  This fixes up the
  str r1, [r6], #4  @ C references.
  cmp r6, ip
  blo 1b
#else

  /*
   * Relocate entries in the GOT table.  We only relocate
   * the entries that are outside the (relocated) BSS region.
   */
1:  ldr r1, [r6, #0]  @ relocate entries in the GOT
  cmp r1, r2   @ entry < bss_start ||
  cmphs r3, r1   @ _end < entry
  addlo r1, r1, r0  @ table.  This fixes up the
  str r1, [r6], #4  @ C references.
  cmp r6, ip
  blo 1b
不知道 ZBOOT 是什麼,不過可以知道這段是在重新定址(relocate) BSS 與 GOT。 GOT (Global Offset Table) 是用來存放 object 的指針,因為 linker 不會知道程式在執行時,object 會放在記憶體哪裡,所以必須透過 GOT 來取得 object 的位址。 GOT 參考: http://bottomupcs.sourceforge.net/csbu/x3824.htm 最後的 blo 就是 bcc,這段程式碼等於 if (r6 < ip) goto 1: (b = before 往上跳)。
not_relocated: mov r0, #0
1:  str r0, [r2], #4  @ clear bss
  str r0, [r2], #4
  str r0, [r2], #4
  str r0, [r2], #4
  cmp r2, r3
  blo 1b

這一段就很簡單了,把 BSS 節區全部設為 0,因為 C 語言規範裡,BSS (uninitialized data) 應該為 0。

  /*
   * The C runtime environment should now be setup
   * sufficiently.  Turn the cache on, set up some
   * pointers, and start decompressing.
   */
  bl cache_on

C 的 runtime environment 設置到這應該足夠了。把 cache 打開,設置一些指標,並開始解壓縮。
這裡只有一行 bl cache_on,bl 會把現在 pc 存到 lr 裡以供返回使用。

/*
 * Turn on the cache.  We need to setup some page tables so that we
 * can have both the I and D caches on.
 *
 * We place the page tables 16k down from the kernel execution address,
 * and we hope that nothing else is using it.  If we're using it, we
 * will go pop!
 *
 * On entry,
 *  r4 = kernel execution address
 *  r6 = processor ID
 *  r7 = architecture number
 *  r8 = atags pointer
 *  r9 = run-time address of "start"  (???)
 * On exit,
 *  r1, r2, r3, r9, r10, r12 corrupted
 * This routine must preserve:
 *  r4, r5, r6, r7, r8
 */
  .align 5
cache_on: mov r3, #8   @ cache_on function
  b call_cache_fn

為了後面的 table,以 .align 5 來對齊,r3 = #8 指到 cache_on 函式。

/*
 * Here follow the relocatable cache support functions for the
 * various processors.  This is a generic hook for locating an
 * entry and jumping to an instruction at the specified offset
 * from the start of the block.  Please note this is all position
 * independent code.
 *
 *  r1  = corrupted
 *  r2  = corrupted
 *  r3  = block offset
 *  r6  = corrupted
 *  r12 = corrupted
 */

call_cache_fn: adr r12, proc_types
#ifdef CONFIG_CPU_CP15
  mrc p15, 0, r6, c0, c0 @ get processor ID
#else
  ldr r6, =CONFIG_PROCESSOR_ID
#endif
1:  ldr r1, [r12, #0]  @ get value
  ldr r2, [r12, #4]  @ get mask
  eor r1, r1, r6  @ (real ^ match)
  tst r1, r2   @       & mask
  addeq pc, r12, r3  @ call cache function
  add r12, r12, #4*5
  b 1b

(1) 取得 CPU ID。
(2) 從 proc_types 開始查表比對 CPU 以提供相應的程式。
(3) 透過直接改 pc 暫存器的方式來執行相應的程式。

r12: 目前 proc_types 表格位址
r1: CPU ID
r2: CPU ID 的遮罩
r3: #8 (r12 + #8 則是 cache on)

/*
 * Table for cache operations.  This is basically:
 *   - CPU ID match
 *   - CPU ID mask
 *   - 'cache on' method instruction
 *   - 'cache off' method instruction
 *   - 'cache flush' method instruction
 *
 * We match an entry using: ((real_id ^ match) & mask) == 0
 *
 * Writethrough caches generally only need 'on' and 'off'
 * methods.  Writeback caches _must_ have the flush method
 * defined.
 */
  .type proc_types,#object
proc_types:

其它表格省略,只列出會跳躍到的,PXA3xx CPU ID 為 0x690568xx 屬於 ARMv5TE 系列。
Note: 參考 Datasheet Valume I 的 2.16.6.2 裡的 Table 9。

  .word 0x00050000  @ ARMv5TE
  .word 0x000f0000
  b __armv4_mmu_cache_on
  b __armv4_mmu_cache_off
  b __armv4_mmu_cache_flush

所以接下來會執行 __armv4_mmu_cache_on

__armv4_mmu_cache_on:
  mov r12, lr
  bl __setup_mmu
  mov r0, #0
  mcr p15, 0, r0, c7, c10, 4 @ drain write buffer
  mcr p15, 0, r0, c8, c7, 0 @ flush I,D TLBs
  mrc p15, 0, r0, c1, c0, 0 @ read control reg
  orr r0, r0, #0x5000  @ I-cache enable, RR cache replacement
  orr r0, r0, #0x0030
  bl __common_mmu_cache_on
  mov r0, #0
  mcr p15, 0, r0, c8, c7, 0 @ flush I,D TLBs
  mov pc, r12

把 lr 保存在 r12 後用 bl 跳 __setup_mmu,所以先看 __setup_mmu:

__setup_mmu: sub r3, r4, #16384  @ Page directory size
  bic r3, r3, #0xff  @ Align the pointer
  bic r3, r3, #0x3f00
/*
 * Initialise the page tables, turning on the cacheable and bufferable
 * bits for the RAM area only.
 */
  mov r0, r3
  mov r9, r0, lsr #18
  mov r9, r9, lsl #18  @ start of RAM
  add r10, r9, #0x10000000 @ a reasonable RAM size
  mov r1, #0x12
  orr r1, r1, #3 << 10
  add r2, r3, #16384
1:  cmp r1, r9   @ if virt > start of RAM
  orrhs r1, r1, #0x0c  @ set cacheable, bufferable
  cmp r1, r10   @ if virt > end of RAM
  bichs r1, r1, #0x0c  @ clear cacheable, bufferable
  str r1, [r0], #4  @ 1:1 mapping
  add r1, r1, #1048576
  teq r0, r2
  bne 1b

這個迴圈會為整個 4G memory 配置 translation table 每個 word (4-bytes) 代表 1M memory section,
所以 translation table size 算法:

total size (4G) / section size (1M) = 4K * word(4) = 16K (Bytes)

ARM MMU 要求暫存器裡的數值要對齊 16K 邊界 (boundary align),所以才需要把 0x3fff 設成 0,
但是目前還不清楚為什麼要先清 0xff 再清 0x3f00。

記得前面的 LC0 吧,r4 一直都沒被改過,所以 r4: zreladdr,註解說是 kernel execute address,其實就是 physical memory address,
那這個 zreladdr 是在哪邊決定呢?答案是在 Makefile 裡,透過 arch/arm/boot 與 arch/arm/boot/compressed 裡的 Makefile,
最後可以在 arch/arm/mach-xxx/Makefile.boot 裡找到真正的設定。

用到的暫存器意義:

r0: translation table current (variable)
r1: table descriptor,0x12 表示 type 為 section。
r2: translation table end
r3: translation table start (constant)
r4: zreladdr (equal to r2)
r9: physical memory base address
r10: resonable memory end address (size: 256M)


以我的實例來說,physical memory 是從 0x80000000 開始,用一張圖來表示 first-level section 的記憶體配置會是這樣:

    (R9) 0x80000000 +------------------+ \  Start of RAM
                    |Reserved          |  \
    (R3) 0x80004000 +------------------+\  \
                    |Translation Table |  +0x4000 (16K)
    (R2) 0x80008000 +------------------+/    \
                    |                  |      \
                    |                  |    +0x10000000 (256M resonable memory)
                   ----------------------     /
                        Cut Short            /
                   ----------------------   /
                    |                  |   /
                    |                  |  /
   (R10) 0x90000000 +------------------+ /

其他部份請參考 ARM reference manual 的 MMU 一節有詳細說明。

/*
 * If ever we are running from Flash, then we surely want the cache
 * to be enabled also for our execution instance...  We map 4MB of it
 * so there is no map overlap problem for up to 1 MB compressed kernel.
 * If the execution is in RAM then we would only be duplicating the above.
 */
  mov r1, #0x1e
  orr r1, r1, #3 << 10
  mov r2, pc, lsr #20
  orr r1, r1, r2, lsl #20
  add r0, r3, r2, lsl #2
  str r1, [r0], #4
  add r1, r1, #1048576
  str r1, [r0], #4
  add r1, r1, #1048576
  str r1, [r0], #4
  add r1, r1, #1048576
  str r1, [r0]
  mov pc, lr
ENDPROC(__setup_mmu)
這一段程式利用 PC 來取得現在執行位置,然後再將現在執行位置開始的 4M 空間設置 cached/buffered, 這是為了 kernel 在 flash 上執行時,也可以 cache,當然現在大部份 kernel 都在 RAM 裡面跑,這樣也只是重覆上面迴圈重設四個 descriptor 而已。
__common_mmu_cache_on:
#ifndef DEBUG
  orr r0, r0, #0x000d  @ Write buffer, mmu
#endif
  mov r1, #-1
  mcr p15, 0, r3, c2, c0, 0 @ load page table pointer
  mcr p15, 0, r1, c3, c0, 0 @ load domain access control
  b 1f
  .align 5   @ cache line aligned
1:  mcr p15, 0, r0, c1, c0, 0 @ load control register
  mrc p15, 0, r0, c1, c0, 0 @ and read it back to
  sub pc, lr, r0, lsr #32 @ properly flush pipeline
這一段就只是透過 cooperator 來設置 MMU 而已,p15 細節一樣請參考 ARM reference manual MMU 機制。 *會先返回 __armv4_mmu_cache_on:
  mov r1, sp   @ malloc space above stack
  add r2, sp, #0x10000 @ 64k max

/*
 * Check to see if we will overwrite ourselves.
 *   r4 = final kernel address
 *   r5 = start of this image
 *   r2 = end of malloc space (and therefore this image)
 * We basically want:
 *   r4 >= r2 -> OK
 *   r4 + image length <= r5 -> OK
 */
  cmp r4, r2
  bhs wont_overwrite
  sub r3, sp, r5  @ > compressed kernel size
  add r0, r4, r3, lsl #2 @ allow for 4x expansion
  cmp r0, r5
  bls wont_overwrite

  mov r5, r2   @ decompress after malloc space
  mov r0, r5
  mov r3, r7
  bl decompress_kernel

  add r0, r0, #127 + 128 @ alignment + stack
  bic r0, r0, #127  @ align the kernel length
/*
 * r0     = decompressed kernel length
 * r1-r3  = unused
 * r4     = kernel execution address
 * r5     = decompressed kernel start
 * r6     = processor ID
 * r7     = architecture ID
 * r8     = atags pointer
 * r9-r14 = corrupted
 */
  add r1, r5, r0  @ end of decompressed kernel
  adr r2, reloc_start
  ldr r3, LC1
  add r3, r2, r3
1:  ldmia r2!, {r9 - r14}  @ copy relocation code
  stmia r1!, {r9 - r14}
  ldmia r2!, {r9 - r14}
  stmia r1!, {r9 - r14}
  cmp r2, r3
  blo 1b
  add sp, r1, #128  @ relocate the stack

  bl cache_clean_flush
  add pc, r5, r0  @ call relocation code

/*
 * We're not in danger of overwriting ourselves.  Do this the simple way.
 *
 * r4     = kernel execution address
 * r7     = architecture ID
 */
wont_overwrite: mov r0, r4
  mov r3, r7
  bl decompress_kernel
  b call_kernel
cache 開完後,返回 226 行,接下來就是執行 decompress_kernel 跟 call_kernel 了。
call_kernel: bl cache_clean_flush
  bl cache_off
  mov r0, #0   @ must be zero
  mov r1, r7   @ restore architecture number
  mov r2, r8   @ restore atags pointer
  mov pc, r4   @ call kernel

沒有留言: