【go从入门到精通】for和for range的区别

作者简介：

高科，先后在 IBM PlatformComputing从事网格计算，淘米网，网易从事游戏服务器开发，拥有丰富的C++，go等语言开发经验，mysql，mongo，redis等数据库，设计模式和网络库开发经验，对战棋类，回合制，moba类页游，手游有丰富的架构设计和开发经验。（谢谢你的关注）
————————————————

for 和 for range有什么区别?

for可以遍历array和slice，遍历key为整型递增的map，遍历string

for range可以完成所有for可以做的事情，却能做到for不能做的，包括遍历key为string类型的map并同时获取key和value，遍历channel

所以除此之外还有其他区别吗？我们来用几个代码块说明他们的区别不仅仅是上面的这几点

测试代码

让我们用切片和数组对for range i、for range v 和for i循环进行一些测试：

package main_test

import "testing"

var intsSlice = []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100}
var intsArray = [...]int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100}

func BenchmarkForRangeI_Slice(b *testing.B) {
 sum := 0
 for n := 0; n < b.N; n++ {
  for i := range intsSlice {
   sum += intsSlice[i]
  }
 }
}

func BenchmarkForRangeV_Slice(b *testing.B) {
 sum := 0
 for n := 0; n < b.N; n++ {
  for _, v := range intsSlice {
   sum += v
  }
 }
}

func BenchmarkForI_Slice(b *testing.B) {
 sum := 0
 for n := 0; n < b.N; n++ {
  for i := 0; i < len(intsSlice); i++ {
   sum += intsSlice[i]
  }
 }
}

func BenchmarkForRangeI_Array(b *testing.B) {
 sum := 0
 for n := 0; n < b.N; n++ {
  for i := range intsArray {
   sum += intsArray[i]
  }
 }
}

func BenchmarkForRangeV_Array(b *testing.B) {
 sum := 0
 for n := 0; n < b.N; n++ {
  for _, v := range intsArray {
   sum += v
  }
 }
}

func BenchmarkForI_Array(b *testing.B) {
 sum := 0
 for n := 0; n < b.N; n++ {
  for i := 0; i < len(intsArray); i++ {
   sum += intsArray[i]
  }
 }
}

运行结果如下：

go test -bench=. for_test.go -benchtime 100000000x
goos: windows
goarch: amd64
cpu: 11th Gen Intel(R) Core(TM) i5-11400H @ 2.70GHz
BenchmarkForRangeI_Slice-12 100000000 33.87 ns/op
BenchmarkForRangeV_Slice-12 100000000 33.91 ns/op
BenchmarkForI_Slice-12 100000000 40.68 ns/op
BenchmarkForRangeI_Array-12 100000000 28.47 ns/op
BenchmarkForRangeV_Array-12 100000000 28.57 ns/op
BenchmarkForI_Array-12 100000000 28.40 ns/op
PASS
ok command-line-arguments 19.439s

正如我们所看到的，对于切片来说， for i循环比for range 循环要慢一些，但对于数组来说没有区别……但是为什么呢？

首先让我们看一下github.com上的切片结构：

type slice struct { 
array unsafe.Pointer    // 数组数据位于 (slice + 0) 地址
 len    int               // 数组长度位于 (slice + 8) 地址
 cap    int               // 数组容量位于 (slice + 16) 地址
}

反汇编

然后，让我们通过运行go tool objdump命令深入了解反汇编程序，并尝试找出 Go 编译器为我们做了什么：

for 循环遍历切片：

sum := 0
for i := 0; i < len(intsSlice); i++ {
    sum += intsSlice[i]
}

反汇编：

0x48dd34 XORL AX, AX
0x48dd36 XORL CX, CX
0x48dd38 JMP 0x48dd48 # jump to the 5-th instruction of the loop
######################## loop start ##########################
0x48dd3a LEAQ 0x1(AX), BX # store AX (index counter) + 1 in BX
0x48dd3e MOVQ 0(DX)(AX*8), DX # store quadword (8 bytes) from DX (data pointer) + AX (index counter) * 8 address to DX
0x48dd42 ADDQ DX, CX # add DX value to CX (our sum accumulator)
0x48dd45 MOVQ BX, AX # set BX (previously incremented AX by 1) value to AX (index counter)
0x48dd48 MOVQ main.intsSlice(SB), DX # store slice data pointer in DX (from static address)
0x48dd4f CMPQ AX, main.intsSlice+8(SB) # compare to slice data size (static address)
0x48dd56 JG 0x48dd3a # jump back to start if slice size is greater than AX (index counter)
######################## loop end ##########################

for range循环遍历slice：

sum := 0 
for i := range intsSlice { 
    sum += intsSlice[i] 
}

以及反汇编：

0x48dd34 MOVQ main.intsSlice(SB), CX # store slice data pointer in CX (from static address)
0x48dd3b MOVQ main.intsSlice+8(SB), DX # store slice data size in DX (from static address)
0x48dd42 XORL AX, AX
0x48dd44 XORL BX, BX
0x48dd46 JMP 0x48dd56 # jump to the 5-th instruction of the loop
######################## loop start ##########################
0x48dd48 LEAQ 0x1(AX), SI # store AX (index counter) + 1 in SI
0x48dd4c MOVQ 0(CX)(AX*8), DI # store quadword (8 bytes) from CX (data pointer) + AX (index counter) * 8 address to DI
0x48dd50 ADDQ DI, BX # add DI value to BX (our sum accumulator)
0x48dd53 MOVQ SI, AX # move SI (previously incremented AX by 1) value to AX (index counter)
0x48dd56 CMPQ DX, AX # compare DX (slice data size) to AX (index counter)
0x48dd59 JL 0x48dd48 # jump back to start if AX (index counter) is less than DX (slice size)
######################## loop end ##########################

因此，这里的主要区别在于，在for 循环的情况下，我们通过切片结构的静态地址访问切片数据指针，并在每次迭代时将其存储在某个通用寄存器中。比较指令被调用为切片数据大小值，我们也是通过切片结构静态地址访问该值。

但在for range循环的情况下，切片数据指针和大小都预先存储在通用寄存器中。所以这里我们每个周期丢失了一条指令。另外，我们不需要每次迭代时从 RAM 或 CPU 缓存中读取切片数据大小。

所以for range循环肯定比for i in slices更快，而且更“安全”。因为如果 slice 在循环迭代期间改变其大小和数据地址（例如来自另一个 goroutine），我们仍然会访问旧的“有效”数据。但当然我们不应该依赖这种行为并消除代码中的任何竞争条件；）

如果当查看for 循环数组：

sum := 0
for i := 0; i < len(intsArray); i++ {
    sum += intsArray[i]
}

以及反汇编：

0x48dd34 XORL AX, AX
0x48dd36 XORL CX, CX
0x48dd38 JMP 0x48dd4f
######################## loop start ##########################
0x48dd3a LEAQ 0x1(AX), DX
0x48dd3e LEAQ main.intsArray(SB), BX # store the address of array in BX
0x48dd45 MOVQ 0(BX)(AX*8), SI
0x48dd49 ADDQ SI, CX
0x48dd4c MOVQ DX, AX
0x48dd4f CMPQ $0x64, AX # here the array size is pre determined at compile time
0x48dd53 JL 0x48dd3a
######################## loop end ##########################

for range循环遍历数组：

sum := 0
for i := range intsArray {
    sum += intsArray[i]
}

以及反汇编：

0x48dd34 XORL AX, AX
0x48dd36 XORL CX, CX
0x48dd38 JMP 0x48dd4f
######################## loop start ##########################
0x48dd3a LEAQ 0x1(AX), DX
0x48dd3e LEAQ main.intsArray(SB), BX
0x48dd45 MOVQ 0(BX)(AX*8), SI
0x48dd49 ADDQ SI, CX
0x48dd4c MOVQ DX, AX
0x48dd4f CMPQ $0x64, AX
0x48dd53 JL 0x48dd3a
######################## loop end ##########################

我们会发现它们是完全相同的。在这两种情况下，我们每次迭代都会从内存中读取数组的地址并将其存储在 BX 寄存器中。但看起来效率不太高。

但这个怎么样：

sum := 0 
for _, v := range intsArray { 
    sum += v 
}

反汇编之后：

0x48dd49 LEAQ 0x28(SP), DI # 0x28(SP) is the address where our array will be located on the stack
0x48dd4e LEAQ main.intsArray(SB), SI
0x48dd55 NOPW 0(AX)(AX*1)
0x48dd5e NOPW
0x48dd60 MOVQ BP, -0x10(SP)
0x48dd65 LEAQ -0x10(SP), BP
0x48dd6a CALL 0x45e8a4 # runtime.duffcopy call
0x48dd6f MOVQ 0(BP), BP
0x48dd73 XORL AX, AX
0x48dd75 XORL CX, CX
0x48dd77 JMP 0x48dd84
######################## loop start ##########################
0x48dd79 MOVQ 0x28(SP)(AX*8), DX # so now we are accessing our data copy on the stack
0x48dd7e INCQ AX
0x48dd81 ADDQ DX, CX
0x48dd84 CMPQ $0x64, AX
0x48dd88 JL 0x48dd79
######################## loop end ##########################

由于某种原因，Go 决定将数组复制到堆栈......真的吗？这是作弊。

我尝试将数组大小增加到 1000，但该死的事情仍然认为将所有内容复制到堆栈会更好:)

0x48dd51 LEAQ 0x28(SP), DI # 0x28(SP) is the address where our array will be located on the stack
0x48dd56 LEAQ main.intsArray(SB), SI # store slice data pointer in SI (from static address)
0x48dd5d MOVL $0x3e8, CX # store slice data size (1000) in CX
0x48dd62 REP; MOVSQ DS:0(SI), ES:0(DI) # Move quadword from SI to DI, repeat CX times
0x48dd65 XORL AX, AX
0x48dd67 XORL CX, CX
0x48dd69 JMP 0x48dd76
######################## loop start ##########################
0x48dd6b MOVQ 0x28(SP)(AX*8), DX
0x48dd70 INCQ AX
0x48dd73 ADDQ DX, CX
0x48dd76 CMPQ $0x3e8, AX
0x48dd7c JL 0x48dd6b
######################## loop end ##########################

主要是使用场景不同

for可以遍历array和slice,遍历key为整型递增的map,遍历string

for range可以完成所有for可以做的事情，却能做到for不能做的，包括

遍历key为string类型的map并同时获取key和value,遍历channel.

我最好的猜测是，由于 Go 多线程特性，编译器决定预先将所有数据（因为我们无论如何都会复制每个值）复制到堆栈中，以在整个for range循环期间保持其完整性并获得一些性能。因此，只有基准测试才能完全反映我们算法性能的真相；）

好的，但是边界检查在哪里呢？panic 在哪里？正如我们所看到的，没有，因为 Go 足够聪明，可以区分根本不存在越界的情况。顺便说一句，它被称为边界检查消除（BCE）

所以对代码做一个小改动：

sum := 0
for i := 0; i < len(intsSlice)-1; i++ {
    sum += intsSlice[i+1]
}

现在我们有了：

0x48dd38 XORL AX, AX
0x48dd3a XORL CX, CX
0x48dd3c JMP 0x48dd49
######################## loop start ##########################
0x48dd3e MOVQ 0x8(BX)(AX*8), DX
0x48dd43 ADDQ DX, CX
0x48dd46 MOVQ SI, AX
0x48dd49 MOVQ main.intsSlice+8(SB), DX # store slice data size in DX
0x48dd50 MOVQ main.intsSlice(SB), BX
0x48dd57 LEAQ -0x1(DX), SI
0x48dd5b NOPL 0(AX)(AX*1)
0x48dd60 CMPQ SI, AX
0x48dd63 JGE 0x48dd70 # jump out of the loop if finished
0x48dd65 LEAQ 0x1(AX), SI # SI will get AX (index counter) plus one
0x48dd69 CMPQ SI, DX # out of bounds checking
0x48dd6c JA 0x48dd3e # jump back to loop start if no out of bounds detected
######################## loop end ##########################
0x48dd6e JMP 0x48ddc1 # jump to the panic procedure call
...
0x48ddc1 MOVQ SI, AX
0x48ddc4 MOVQ DX, CX
0x48ddc7 CALL runtime.panicIndex(SB)

最后与 C gcc 编译器进行比较：

int64_t sum = 0;
for (int i = 0; i < sizeof(intsArray) / sizeof(intsArray[0]); i++)
{
    sum += intsArray[i];
}

gcc -o main.exe -O3 main.c
objdump -S main.exe > main-c-for-i.asm

100401689: lea 0x990(%rip),%rax # 100402020 <intsArray>; store intsArray address in rax
100401690: pxor %xmm0,%xmm0
100401694: lea 0x320(%rax),%rdx # store rax + 800 (array size is 100 * 8 bytes) in rdx (intsArray after end address)
10040169b: nopl 0x0(%rax,%rax,1)
######################## loop start ##########################
1004016a0: paddq (%rax),%xmm0 # adds 2 qwords from rax to xmm0 (128-bit register)
1004016a4: add $0x10,%rax # increments rax (current intsArray address) by 16 bytes
1004016a8: cmp %rax,%rdx # compare rax (current intsArray address) to (intsArray after end address)
1004016ab: jne 1004016a0 <main+0x20> # jump if current intsArray address not equals to intsArray after end address
######################## loop end ##########################
1004016ad: movdqa %xmm0,%xmm1 # copy accumulated 2 qwords to xmm1
1004016b1: psrldq $0x8,%xmm1 # shift xmm1 by 8 bytes right, so the 1-st qword will be at 2-nd qword place
1004016b6: paddq %xmm1,%xmm0 # add shifted 1-st qword from xmm1 to 2-nd qword of xmm0
1004016ba: movq %xmm0,%rax # copy final 2-nd qword to 64 bit rax, so here will be the final result

我们可以看到，循环中只有 4 条指令，并且累加执行速度快了 2 倍，因为使用了paddq指令（将第一个操作数中的2 个打包qword添加到第二个操作数中对应的 2 个打包qword）。