ARM processor organization

Size: px

Start display at page:

Download "ARM processor organization"

Suzan Davidson
5 years ago
Views:

1 ARM processor organization P. Bakowski

2 ARM register bank The register bank,, which stores the processor state. r00 r01 r14 r15 P. Bakowski 2

3 ARM register bank It has two read ports and one write port which can each be used to access any register, plus an additional read port and an additional write port that give special access to r15, the program counter. r00 r01 r14 r15 P. Bakowski 3

4 ARM register bank It has two read ports and one write port which can each be used to access any register, plus an additional read port and an additional write port that give special access to r15, the program counter. r00 r01 r14 r15 P. Bakowski 4

5 ARM register bank It has two read ports and one write port which can each be used to access any register, plus an additional read port and an additional write port that give special access to r15, the program counter. r00 r01 r14 r15 P. Bakowski 5

6 ARM register bank It has two read ports and one write port which can each be used to access any register, plus an additional read port and an additional write port that give special access to r15, the program counter. r00 r01 r14 r15 P. Bakowski 6

7 ARM barrel shifter The barrel shifter,, which can shift or rotate one operand by any number of bits. number of bits P. Bakowski 7

8 ARM ALU The ALU, which performs the arithmetic and logic functions required by the instruction set. operands functions P. Bakowski 8

9 ARM 3-stage 3 pipeline The address register and incrementer,, which select and hold addresses and generate sequential addresses when required. address register incrementer P. Bakowski 9

10 ARM 3-stage 3 pipeline The data registers, which hold data passing to and from memory. data instructions data out register data in register to/from memory d[31:0] P. Bakowski 10

11 ARM 3-stage 3 pipeline The instruction decoder and associated control logic. control path control signals data path instructions data in register P. Bakowski 11

12 Three stage pipeline : ARM 1,2,3 FETCH; ; the instruction is fetched from memory and placed in the instruction pipeline. to instruction register address register data in register memory fetch clock cycle P. Bakowski 12

13 Three stage pipeline : ARM 1,2,3 DECODE; ; the instruction is decoded and the datapath control signals prepared for the next cycle. instruction register control path control signals fetch decode fetch P. Bakowski 13

14 Three stage pipeline : ARM 1,2,3 EXECUTE; ; the instruction controls the datapath; ; the register bank is read, an operand shifted the ALU result generated and written back into a destination register. control signals data path fetch decode fetch execute decode P. Bakowski 14

15 Three stage pipeline : ARM 1,2,3 instruction throughput : 1 instruction per clock cycle instruction latency : 3 clock cycles fetch decode execute fetch decode execute clock cycle fetch decode execute P. Bakowski 15

16 Three stage pipeline : ARM 1,2,3 instruction throughput : 1 instruction per clock cycle instruction latency : 3 clock cycles fetch decode fetch execute decode fetch execute decode execute P. Bakowski 16

17 ARM 1,2,3 architecture a[31:0] instruction register control path control signals Bbus data in register address register incrementer incrementer PC register bank M BS Abus ALUbus ALU M multiplier, BS barrel shifter data out register d[31:0] P. Bakowski 17

18 Multi cycle execution fetch decode STR fetch STR store instruction needs two execution cycles: address calculation cycle and data transfer cycle P. Bakowski 18

19 Multi cycle execution fetch decode STR execute decode fetch address fetch address calculation cycle and data transfer cycle P. Bakowski 19

20 Multi cycle execution 4 clock cycles fetch decode execute STR decode address transfer fetch decode execute fetch decode address calculation cycle and data transfer cycle P. Bakowski 20

21 Multi cycle execution 4 clock cycles fetch decode execute STR decode address transfer fetch decode execute fetch decode address calculation cycle and data transfer cycle Attention: only one memory transfer per clock cycle P. Bakowski 21

22 Processor performance The time T,, required to execute a given program is given by: T=(N inst *CPI)/f clk N inst - number of instructions in the program CPI - clock cycles per instruction fclk - clock frequency P. Bakowski 22

23 Processor performance The time T,, required to execute a given program is given by: T=(N inst *CPI)/f clk N inst - number of instructions in the program CPI - clock cycles per instruction fclk - clock frequency P. Bakowski 23

24 Processor performance The time T,, required to execute a given program is given by: T=(N inst *CPI)/f clk N inst - number of instructions in the program CPI - clock cycles per instruction (throughput) fclk - clock frequency P. Bakowski 24

25 Processor performance The time T,, required to execute a given program is given by: T=(N inst *CPI)/f clk N inst - number of instructions in the program CPI - clock cycles per instruction fclk - clock frequency P. Bakowski 25

26 Processor performance Since N insi is constant for a given program there are only two ways to increase performance: increase the clock rate, clk f. reduce the average number of clock cycles per instruction, CPI. P. Bakowski 26

27 Clock rate increase Increase the clock rate, clk f. This requires the logic in each pipeline stage to be simplified and, therefore, the number of pipeline stages to be increased. stage1 stage2 clock cycle 3 stages stage3 P. Bakowski 27

28 Clock rate increase Increase the clock rate, clk f. This requires the logic in each pipeline stage to be simplified and, therefore, the number of pipeline stages to be increased. stage1 stage2 clock cycle 3 stages stage3 stage1 stage2 stage3 clock cycle 5 stages stage4 stage5 P. Bakowski 28

29 Clock rate increase Note that the clock rate to be used depends heavily on the implementation technology. stage1 stage1 stage2 stage2 clock cycle 3 stages stage3 clock cycle 3 stages stage3 new implementation technology P. Bakowski 29

30 More hardware resources Reduce the average number of clock cycles per instruction, CPI. This requires the introduction of more parallelism that means more hardware resources to be used in a given clock cycle. D/I M DM IM data read or instruction fetch data read and instruction fetch P. Bakowski 30

31 5-stage pipeline organization Higher performance ARM cores employ a 5-stage 5 pipeline and have separate instruction and data memories. Breaking instruction execution down into five stages rather than three reduces the maximum work which must be completed in a clock cycle, and hence allows a higher clock frequency to be used. The separate instruction and data memories seen as separate caches connected to a unified instruction and data main memory allow a significant reduction in the core's CPI. P. Bakowski 31

32 5-stage pipeline organization Higher performance ARM cores employ a 5-stage 5 pipeline and have separate instruction and data memories. Breaking instruction execution down into five stages rather than three reduces the maximum work which must be completed in a clock cycle, and hence allows a higher clock frequency to be used. The separate instruction and data memories seen as separate caches connected to a unified instruction and data main memory allow a significant reduction in the core's CPI. P. Bakowski 32

33 5-stage pipeline organization Higher performance ARM cores employ a 5-stage 5 pipeline and have separate instruction and data memories. Breaking instruction execution down into five stages rather than three reduces the maximum work which must be completed in a clock cycle, and hence allows a higher clock frequency to be used. The separate instruction and data memories seen as separate caches connected to a unified instruction and data main memory allow a significant reduction in the core's CPI. P. Bakowski 33

34 incrementer incrementer Fetch stage fetch decode execute buffer write FETCH - the instruction is fetched from memory and placed in the instruction cache. next PC I cache to decoder P. Bakowski 34

35 Decode stage fetch decode execute buffer write DECODE - the instruction is decoded and register operands read from the register file. decode I - decode P. Bakowski 35

36 Decode stage fetch decode execute buffer write There are three operand read ports in the register file, so most instructions can obtain all their operands in one cycle. decode I - decode register file P. Bakowski 36

37 Execute stage fetch decode execute buffer write EXECUTE - an operand is shifted and the ALU result generated. register file M BS ALU ALUbus +4 P. Bakowski 37

38 Execute stage fetch decode execute buffer write EXECUTE - an operand is shifted and the ALU result generated. register file M BS ALU ALUbus +4 P. Bakowski 38

39 Execute stage fetch decode execute buffer write If the instruction is a load or store the memory address is computed in the ALU. register file M BS ALU ALUbus +4 P. Bakowski 39

40 Buffer stage fetch decode execute buffer write BUFFER data - data memory is accessed if required. Otherwise the ALU result is simply buffered for one clock cycle to give the same pipeline flow for all instructions. byte replication +4 D cache rotation/sign extension P. Bakowski 40

41 Write back stage fetch decode execute buffer write WRITE-back; the results generated by the instruction are written back to the register file, including any data loaded from memory. ALU ALUbus register file D cache rotation/sign extension P. Bakowski 41

42 Data forwarding In the 5-stage 5 pipeline instruction execution is spread across three pipeline stages,, the only way to resolve data dependencies without stalling the pipeline is to introduce forwarding paths. fetch decode execute buffer write P. Bakowski 42

43 Data forwarding to register file fetch decode execute buffer write fetch decode execute buffer write Data dependencies arise when an instruction needs to use the result of one of its predecessors before that result has returned to the register file. P. Bakowski 43

44 Data forwarding Forwarding paths (by-pass) allow the intermediate results to be passed between stages as soon as they are available, in the 5-stage 5 ARM pipeline each of the three source operands can be forwarded from any of three intermediate result registers to register file fetch decode execute buffer write by-pass paths fetch decode execute buffer write P. Bakowski 44

45 PC organization - compatibility The programming behavior of the PC implemented through r15 is based on the operational characteristics of the 3-stage 3 ARM pipeline. Basically the 5-stage 5 pipeline reads the instruction operands one stage earlier and that is incompatible with 3-stage design. P. Bakowski 45

46 PC organization - compatibility The programming behavior of the PC implemented through r15 is based on the operational characteristics of the 3-stage 3 ARM pipeline. Basically the 5-stage 5 pipeline reads the instruction operands one stage earlier and that is incompatible with 3-stage design. P. Bakowski 46

47 PC organisation - solution This problem is resolved by the incrementation of the PC value from the fetch stage in the decode stage, bypassing the pipeline register between the two stages. PC+4 for the next instruction is equal to PC+8 for the current instruction (4 bytes farther), so the correct r15 value is obtained without additional hardware. P. Bakowski 47

48 PC organisation - solution PC+4 for the next instruction is equal to PC+8 for the current instruction (4 bytes farther), so the correct r15 value is obtained without additional hardware. incrementer incrementer I cache next PC+4 to decoder register file next PC+8 r15 P. Bakowski 48

49 ARM programming model The Instruction Set Architecture (ISA) defines the operations that the programmer can use to change the state of the system incorporating the processor. This state usually comprises the values of the data items in the visible registers and the memory. Each instruction performs a defined transformation from the state before the instruction is executed to the state after it has completed. P. Bakowski 49

50 ARM programming model The Instruction Set Architecture (ISA) defines the operations that the programmer can use to change the state of the system incorporating the processor. This state usually comprises the values of the data items in the visible registers and the memory. Each instruction performs a defined transformation from the state before the instruction is executed to the state after it has completed. P. Bakowski 50

51 ARM programming model The Instruction Set Architecture (ISA) defines the operations that the programmer can use to change the state of the system incorporating the processor. This state usually comprises the values of the data items in the visible registers and the memory. Each instruction performs a defined transformation from the state before the instruction is executed to the state after it has completed. P. Bakowski 51

52 ARM memory subsystem ARM memory may be viewed as a linear array of bytes numbered from zero up to linear array of bytes 0 P. Bakowski 52

53 ARM memory subsystem Data items may be 8-bit 8 bytes,, 16-bit half-words or 32-bit words bytes words P. Bakowski 53

54 ARM memory subsystem Words are always aligned on 4-byte boundaries (the two least significant address bits are zero) ) and half- words are aligned on even byte boundaries byte number word address P. Bakowski 54

55 ARM memory subsystem Words are always aligned on 4-byte boundaries (the two least significant address bits are zero) ) and half- words are aligned on even byte boundaries little endian organization P. Bakowski 55

56 ARM load-store architecture The processing instruction (add, subtract, and so on) take the values from the registers and always place the results into a register. register file M BS ALU ALUbus +4 P. Bakowski 56

57 ARM load-store architecture The processing instruction (add, subtract, and so on) take the values from the registers and always place the results into a register. register file M BS ALU ALUbus +4 P. Bakowski 57

58 ARM load-store architecture The only instructions which apply to memory state are ones which copy memory values into register (load instructions) or copy register values into memory (store instructions). register file D -cache memory P. Bakowski 58

59 ARM load-store architecture The only instructions which apply to memory state are ones which copy memory values into register (load instructions) or copy register values into memory (store instructions). register file D -cache memory P. Bakowski 59

60 ARM instructions In general the ARM instructions fall into one of the following three categories: data processing instructions data transfer instructions control flow instructions P. Bakowski 60

61 ARM instructions In general the ARM instructions fall into one of the following three categories: data processing instructions data transfer instructions control flow instructions P. Bakowski 61

62 ARM instructions In general the ARM instructions fall into one of the following three categories: data processing instructions data transfer instructions control flow instructions P. Bakowski 62

63 ARM instructions In general the ARM instructions fall into one of the following three categories: data processing instructions data transfer instructions control flow instructions P. Bakowski 63

64 ARM data processing Data processing instructions: these use and change only register values; P. Bakowski 64

65 ARM data processing For example, an instruction can add two registers and place the result in a register. register file M BS ALU ALUbus +4 P. Bakowski 65

66 ARM data transfer Data transfer instructions copy memory values into registers (load( instructions) or copy register values into memory (store( instructions); An additional form, useful only in systems code, exchanges a memory value with a register value. P. Bakowski 66

67 ARM data transfer Data transfer instructions copy memory values into registers (load instructions) or copy register values into memory (store instructions); An additional form, useful only in systems code, exchanges a memory value with a register value. P. Bakowski 67

68 ARM data transfer Data transfer instructions copy memory values into registers (load instructions) or copy register values into memory (store instructions); An additional form, useful only in systems code, exchanges a memory value with a register value. e.g. test and set instruction P. Bakowski 68

69 ARM control flow Control flow instructions cause execution to switch to a different address,, either permanently (branch instructions) ) or saving a return address to resume the original sequence (branch and link instructions) or trapping into system code (supervisor calls). P. Bakowski 69

70 ARM control flow Control flow instructions cause execution to switch to a different address, either permanently (branch instructions) or saving a return address to resume the original sequence (branch and link instructions) or trapping into system code (supervisor calls). link address P. Bakowski 70

71 ARM control flow Control flow instructions cause execution to switch to a different address, either permanently (branch instructions) or saving a return address to resume the original sequence (branch and link instructions) or trapping into system code (supervisor calls). link address system code P. Bakowski 71

72 ARM supervisor mode The ARM processor supports a protected supervisor mode.. The protection mechanism ensures that user code cannot gain supervisor privileges without appropriate checks being carried out to ensure that the code is not attempting illegal operations. I/O driver illegal operation? user code P. Bakowski 72

73 ARM supervisor mode The upshot of this for the user-level programmer is that system-level functions can only be accessed through specified supervisor call. system/supervisor call I/O driver system code user code P. Bakowski 73

74 ARM I/O programming The ARM handles I/0 (input/output) peripherals (such as disk controllers, network interfaces, and so on) as memory-mapped mapped devices with interrupt support. memory-mapped mapped devices P. Bakowski 74

75 ARM I/O programming The internal registers in these devices appear as addressable locations within the ARM's memory map and may be read and written using the same (load( load- store) ) instructions as any other memory locations. store memory locations load P. Bakowski 75

76 ARM I/O interruptions Peripherals may attract the processor's attention by making an interrupt request using either the normal interrupt (IRQ)) or the fast interrupt (FIQ) input. to CPU - FIQ to CPU - IRQ P. Bakowski 76

77 ARM I/O interruptions Both interrupt inputs are level-sensitive and maskable. Normally most interrupt sources share the IRQ input, with just one or two time-critical sources connected to the higher-priority FIQ input. IRQ FIQ P. Bakowski 77

78 ARM I/O interruptions Some systems may include direct memory access (DMA) hardware external to the processor to handle high-bandwidth traffic. system bus DMA traffic P. Bakowski 78

79 ARM exceptions The ARM architecture supports a range of : interrupts traps supervisor calls all grouped under the general heading of exceptions. P. Bakowski 79

80 ARM exceptions The ARM architecture supports a range of : interrupts traps supervisor calls all grouped under the general heading of exceptions. P. Bakowski 80

81 ARM exceptions The ARM architecture supports a range of : interrupts traps supervisor calls all grouped under the general heading of exceptions. P. Bakowski 81

82 ARM exceptions The ARM architecture supports a range of : interrupts traps supervisor calls all grouped under the general heading of exceptions. P. Bakowski 82

83 ARM exceptions The ARM architecture supports a range of : interrupts traps supervisor calls all grouped under the general heading of exceptions. P. Bakowski 83

84 ARM exceptions The general way of exception handling is the same in all cases: the current state is saved by copying the PC into rl4_exc and the CPSR into SPSR_exc (where exc stands for the exception type); the processor operating mode is changed to the appropriate exception mode; the PC is forced to a value between and 1C 16, the particular value depending on the type of exception. P. Bakowski 84

85 ARM exceptions The general way of exception handling is the same in all cases: the current state is saved by copying the PC into rl4_exc and the CPSR into SPSR_exc (where exc stands for the exception type); the processor operating mode is changed to the appropriate exception mode; the PC is forced to a value between and 1C 16, the particular value depending on the type of exception. P. Bakowski 85

86 ARM exceptions The general way of exception handling is the same in all cases: the current state is saved by copying the PC into rl4_exc and the CPSR into SPSR_exc (where exc stands for the exception type); the processor operating mode is changed to the appropriate exception mode; the PC is forced to a value between and 1C 16, the particular value depending on the type of exception. P. Bakowski 86

87 ARM exceptions The general way of exception handling is the same in all cases: the current state is saved by copying the PC into rl4_exc and the CPSR into SPSR_exc (where exc stands for the exception type); the processor operating mode is changed to the appropriate exception mode; the PC is forced to a value between and 1C 16, the particular value depending on the type of exception. P. Bakowski 87

88 ARM instruction execution a[31:0] instruction register control path control signals Bbus data in register address register incrementer incrementer PC register bank M BS Abus ALUbus ALU M multiplier, BS barrel shifter data out register d[31:0] P. Bakowski 88

89 incrementer incrementer Data processing instruction address register address register data out register data out register data in register data in register register register operations d[31:0] Bbus PC register bank M BS Abus ALUbus ALU P. Bakowski 89 a[31:0]

90 incrementer incrementer Data processing instruction address register address register data out register data out register ir register ir register data in register data in register register immediate operations d[31:0] Bbus PC register bank M BS Abus ALUbus ALU P. Bakowski 90 a[31:0]

91 incrementer incrementer Store instruction address register address register data out register data out register data in register data in register d[31:0] ir register ir register compute address operation Bbus PC register bank M ALU P. Bakowski 91 BS Abus ALUbus new address a[31:0]

92 incrementer incrementer Store instruction address register address register data out register data out register ir register ir register data in register data in register store data auto-index d[31:0] Bbus PC register bank M BS Abus ALUbus ALU auto-index P. Bakowski 92 a[31:0]

93 incrementer incrementer Branch instruction address register address register data out register data out register ir register ir register data in register data in register compute branch address d[31:0] Bbus PC register bank M BS Abus ALUbus ALU branch address P. Bakowski 93 a[31:0]

94 incrementer incrementer Branch instruction address register address register data out register data out register ir register ir register data in register data in register store return address d[31:0] Bbus PC register bank R14 M BS Abus ALUbus ALU return address P. Bakowski 94 a[31:0]

95 Summary ARM register bank ARM barrel shifter and ALU ARM 3-stage 3 and 5-stage 5 pipelines ARM programming model ARM instructions P. Bakowski 95

96 Summary ARM register bank ARM barrel shifter and ALU ARM 3-stage 3 and 5-stage 5 pipelines ARM programming model ARM instructions P. Bakowski 96

97 Summary ARM register bank ARM barrel shifter and ALU ARM 3-stage and 5-stage pipelines ARM programming model ARM instructions P. Bakowski 97

98 Summary ARM register bank ARM barrel shifter and ALU ARM 3-stage 3 and 5-stage 5 pipelines ARM programming model ARM instructions P. Bakowski 98

99 Summary ARM register bank ARM barrel shifter and ALU ARM 3-stage 3 and 5-stage 5 pipelines ARM programming model ARM instructions P. Bakowski 99

Samsung S3C4510B. Hsung-Pin Chang Department of Computer Science National Chung Hsing University

Samsung S3C4510B. Hsung-Pin Chang Department of Computer Science National Chung Hsing University Samsung S3C4510B Hsung-Pin Chang Department of Computer Science National Chung Hsing University S3C4510B A 16/32-bit RISC microcontroller is a cost-effective, highperformance microcontroller 16/32-bit