- A+
将彩色位图转为灰度位图,是图像处理的常用算法。本文将介绍 Bgr24彩色位图转为Gray8灰度位图的算法,除了会给出标量算法外,还会给出向量算法。且这些算法是跨平台的,同一份源代码,能在 X86及Arm架构上运行,且均享有SIMD硬件加速。
一、标量算法
1.1 算法实现
对于彩色转灰度,由于人眼对红绿蓝三种颜色的敏感程度不同,在灰度转换时,每个颜色分配的权重也是不同的。有一个很著名的心理学公式:
Gray = R*0.299 + G*0.587 + B*0.114
该公式含有浮点数,而浮点数运算一般比较慢。
于是在具体实现时,需要做一定优化。可以将小数转为定点整数,这样便能将除法转为移位。整数计算比浮点型快,移位运算和加减法比乘除法快,于是取得了比较好的效果。
但是这种方法也会带来一定的精度损失,我们可以根据实际情况选择定点整数的精度位数。
这里我们使用16位精度,源代码如下。
public static unsafe void ScalarDo(BitmapData src, BitmapData dst) { const int cbPixel = 3; // Bgr24 const int shiftPoint = 16; const int mulPoint = 1 << shiftPoint; // 0x10000 const int mulRed = (int)(0.299 * mulPoint + 0.5); // 19595 const int mulGreen = (int)(0.587 * mulPoint + 0.5); // 38470 const int mulBlue = mulPoint - mulRed - mulGreen; // 7471 int width = src.Width; int height = src.Height; int strideSrc = src.Stride; int strideDst = dst.Stride; byte* pRow = (byte*)src.Scan0.ToPointer(); byte* qRow = (byte*)dst.Scan0.ToPointer(); for (int i = 0; i < height; i++) { byte* p = pRow; byte* q = qRow; for (int j = 0; j < width; j++) { *q = (byte)((p[2] * mulRed + p[1] * mulGreen + p[0] * mulBlue) >> shiftPoint); p += cbPixel; // Bgr24 q += 1; // Gray8 } pRow += strideSrc; qRow += strideDst; } }
1.2 基准测试代码
使用 BenchmarkDotNet 进行基准测试。
可以实现分配好数据。源代码如下。
private static readonly Random _random = new Random(1); private BitmapData _sourceBitmapData = null; private BitmapData _destinationBitmapData = null; private BitmapData _expectedBitmapData = null; [Params(1024, 2048, 4096)] public int Width { get; set; } public int Height { get; set; } ~Bgr24ToGrayBgr24Benchmark() { Dispose(false); } public void Dispose() { Dispose(true); GC.SuppressFinalize(this); } private void Dispose(bool disposing) { if (_disposed) return; _disposed = true; if (disposing) { Cleanup(); } } private BitmapData AllocBitmapData(int width, int height, PixelFormat format) { const int strideAlign = 4; if (width <= 0) throw new ArgumentOutOfRangeException($"The width({width}) need > 0!"); if (height <= 0) throw new ArgumentOutOfRangeException($"The width({height}) need > 0!"); int stride = 0; switch (format) { case PixelFormat.Format8bppIndexed: stride = width * 1; break; case PixelFormat.Format24bppRgb: stride = width * 3; break; } if (stride <= 0) throw new ArgumentOutOfRangeException($"Invalid pixel format({format})!"); if (0 != (stride % strideAlign)) { stride = stride - (stride % strideAlign) + strideAlign; } BitmapData bitmapData = new BitmapData(); bitmapData.Width = width; bitmapData.Height = height; bitmapData.PixelFormat = format; bitmapData.Stride = stride; bitmapData.Scan0 = Marshal.AllocHGlobal(stride * height); return bitmapData; } private void FreeBitmapData(BitmapData bitmapData) { if (null == bitmapData) return; if (IntPtr.Zero == bitmapData.Scan0) return; Marshal.FreeHGlobal(bitmapData.Scan0); bitmapData.Scan0 = IntPtr.Zero; } [GlobalCleanup] public void Cleanup() { FreeBitmapData(_sourceBitmapData); _sourceBitmapData = null; FreeBitmapData(_destinationBitmapData); _destinationBitmapData = null; FreeBitmapData(_expectedBitmapData); _expectedBitmapData = null; } [GlobalSetup] public void Setup() { Height = Width; // Create. Cleanup(); _sourceBitmapData = AllocBitmapData(Width, Height, PixelFormat.Format24bppRgb); _destinationBitmapData = AllocBitmapData(Width, Height, PixelFormat.Format8bppIndexed); _expectedBitmapData = AllocBitmapData(Width, Height, PixelFormat.Format8bppIndexed); RandomFillBitmapData(_sourceBitmapData, _random); }
使用这些已分配好的数据,能很容易写出ScalarDo 的基准测试代码。
[Benchmark(Baseline = true)] public void Scalar() { ScalarDo(_sourceBitmapData, _destinationBitmapData); }
二、向量算法
2.1 算法思路
对于24位转8位灰度,可以使用这种办法: 每次从源位图读取3个向量,进行3-元素组的解交织运算,得到 R,G,B 平面数据。随后使用向量化的乘法与加法,来计算灰度值。结果是存储了灰度值的1个向量,可以将它存储到目标位图。
例如 Sse指令集使用的是128位向量,此时1个向量为16字节。每次从源位图读取3个向量,就是读取了48字节,即16个RGB像素。最后将1个向量存储到目标位图,就是写入了16字节,即16个灰度像素。
对于3-元素组的解交织,可以使用 shuffle 类别的指令来实现。例如对于X86架构的 128位向量,可以使用 SSSE3 的 _mm_shuffle_epi8 指令,它对应 NET 中的 Ssse3.Shuffle
方法。源代码如下。
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part0 = Vector128.Create((sbyte)0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte(); static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1).AsByte(); static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13).AsByte(); static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part0 = Vector128.Create((sbyte)1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte(); static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1).AsByte(); static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14).AsByte(); static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part0 = Vector128.Create((sbyte)2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte(); static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1).AsByte(); static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15).AsByte(); public static Vector128<byte> YGroup3Unzip(Vector128<byte> data0, Vector128<byte> data1, Vector128<byte> data2, out Vector128<byte> y, out Vector128<byte> z) { var f0A = YGroup3Unzip_Shuffle_Byte_X_Part0; var f0B = YGroup3Unzip_Shuffle_Byte_X_Part1; var f0C = YGroup3Unzip_Shuffle_Byte_X_Part2; var f1A = YGroup3Unzip_Shuffle_Byte_Y_Part0; var f1B = YGroup3Unzip_Shuffle_Byte_Y_Part1; var f1C = YGroup3Unzip_Shuffle_Byte_Y_Part2; var f2A = YGroup3Unzip_Shuffle_Byte_Z_Part0; var f2B = YGroup3Unzip_Shuffle_Byte_Z_Part1; var f2C = YGroup3Unzip_Shuffle_Byte_Z_Part2; var rt0 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f0A), Ssse3.Shuffle(data1, f0B)), Ssse3.Shuffle(data2, f0C)); var rt1 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f1A), Ssse3.Shuffle(data1, f1B)), Ssse3.Shuffle(data2, f1C)); var rt2 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f2A), Ssse3.Shuffle(data1, f2B)), Ssse3.Shuffle(data2, f2C)); y = rt1; z = rt2; return rt0; }
为了使跨平台编写向量算法更方便,我开发了 VectorTraits 库,已经集成了上述算法。该库提供了“Vectors.YGroup3Unzip”方法。该方法能够跨平台,它会使用各个平台的shuffle指令。
- X86 256-bit: Use
_mm256_shuffle_epi8
and other instructions. - Arm: Use
vqvtbl1q_u8
instructions. - Wasm: Use
i8x16.swizzle
instructions.
2.2 算法实现
有了“Vectors.YGroup3Unzip”方法后,便能方便的编写24位转8位灰度的算法了。灰度系数有8位精度,于是需要将 8位数据变宽为16位后,再来计算乘法与加法。最后再将 16位数据,变窄为8位。源代码如下。
public static unsafe void UseVectorsDoBatch(byte* pSrc, int strideSrc, int width, int height, byte* pDst, int strideDst) { const int cbPixel = 3; // Bgr24 const int shiftPoint = 8; const int mulPoint = 1 << shiftPoint; // 0x100 const ushort mulRed = (ushort)(0.299 * mulPoint + 0.5); // 77 const ushort mulGreen = (ushort)(0.587 * mulPoint + 0.5); // 150 const ushort mulBlue = mulPoint - mulRed - mulGreen; // 29 Vector<ushort> vmulRed = new Vector<ushort>(mulRed); Vector<ushort> vmulGreen = new Vector<ushort>(mulGreen); Vector<ushort> vmulBlue = new Vector<ushort>(mulBlue); int vectorWidth = Vector<byte>.Count; int maxX = width - vectorWidth; byte* pRow = pSrc; byte* qRow = pDst; for (int i = 0; i < height; i++) { Vector<byte>* pLast = (Vector<byte>*)(pRow + maxX * cbPixel); Vector<byte>* qLast = (Vector<byte>*)(qRow + maxX * 1); Vector<byte>* p = (Vector<byte>*)pRow; Vector<byte>* q = (Vector<byte>*)qRow; for (; ; ) { Vector<byte> r, g, b, gray; Vector<ushort> wr0, wr1, wg0, wg1, wb0, wb1; // Load. b = Vectors.YGroup3Unzip(p[0], p[1], p[2], out g, out r); // widen(r) * mulRed + widen(g) * mulGreen + widen(b) * mulBlue Vector.Widen(r, out wr0, out wr1); Vector.Widen(g, out wg0, out wg1); Vector.Widen(b, out wb0, out wb1); wr0 = Vectors.Multiply(wr0, vmulRed); wr1 = Vectors.Multiply(wr1, vmulRed); wg0 = Vectors.Multiply(wg0, vmulGreen); wg1 = Vectors.Multiply(wg1, vmulGreen); wb0 = Vectors.Multiply(wb0, vmulBlue); wb1 = Vectors.Multiply(wb1, vmulBlue); wr0 = Vector.Add(wr0, wg0); wr1 = Vector.Add(wr1, wg1); wr0 = Vector.Add(wr0, wb0); wr1 = Vector.Add(wr1, wb1); // Shift right and narrow. wr0 = Vectors.ShiftRightLogical_Const(wr0, shiftPoint); wr1 = Vectors.ShiftRightLogical_Const(wr1, shiftPoint); gray = Vector.Narrow(wr0, wr1); // Store. *q = gray; // Next. if (p >= pLast) break; p += cbPixel; ++q; if (p > pLast) p = pLast; // The last block is also use vector. if (q > qLast) q = qLast; } pRow += strideSrc; qRow += strideDst; } }
上面源代码中的“Vectors.ShiftRightLogical_Const”,是VectorTraits 库提供的方法。它能替代 NET 7.0
新增的 Vector.ShiftRightLogical
方法,使早期版本的 NET 也能使用逻辑右移位。
“Vectors.Multiply”也是VectorTraits 库提供的方法。它能避免无符号类型有时没有硬件加速的问题。
2.3 基准测试代码
[Benchmark] public void UseVectors() { UseVectorsDo(_sourceBitmapData, _destinationBitmapData, false); } [Benchmark] public void UseVectorsParallel() { UseVectorsDo(_sourceBitmapData, _destinationBitmapData, true); } public static unsafe void UseVectorsDo(BitmapData src, BitmapData dst, bool useParallel = false) { int vectorWidth = Vector<byte>.Count; int width = src.Width; int height = src.Height; if (width <= vectorWidth) { ScalarDo(src, dst); return; } int strideSrc = src.Stride; int strideDst = dst.Stride; byte* pSrc = (byte*)src.Scan0.ToPointer(); byte* pDst = (byte*)dst.Scan0.ToPointer(); int processorCount = Environment.ProcessorCount; int batchSize = height / (processorCount * 2); bool allowParallel = useParallel && (batchSize > 0) && (processorCount > 1); if (allowParallel) { int batchCount = (height + batchSize - 1) / batchSize; // ceil((double)length / batchSize) Parallel.For(0, batchCount, i => { int start = batchSize * i; int len = batchSize; if (start + len > height) len = height - start; byte* pSrc2 = pSrc + start * strideSrc; byte* pDst2 = pDst + start * strideDst; UseVectorsDoBatch(pSrc2, strideSrc, width, len, pDst2, strideDst); }); } else { UseVectorsDoBatch(pSrc, strideSrc, width, height, pDst, strideDst); } }
完整源码在 Bgr24ToGray8Benchmark.cs
随后为该算法编写基准测试代码。
三、基准测试结果
3.1 X86 架构
X86架构下的基准测试结果如下。
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3) AMD Ryzen 7 7840H w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores .NET SDK 8.0.403 [Host] : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI DefaultJob : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI | Method | Width | Mean | Error | StdDev | Ratio | RatioSD | Code Size | |--------------------- |------ |-------------:|-----------:|-----------:|------:|--------:|----------:| | Scalar | 1024 | 1,028.55 us | 12.545 us | 11.735 us | 1.00 | 0.02 | 152 B | | UseVectors | 1024 | 94.06 us | 0.606 us | 0.537 us | 0.09 | 0.00 | NA | | UseVectorsParallel | 1024 | 24.98 us | 0.390 us | 0.365 us | 0.02 | 0.00 | NA | | PeterParallelScalar | 1024 | 216.47 us | 1.719 us | 1.524 us | 0.21 | 0.00 | NA | | | | | | | | | | | Scalar | 2048 | 4,092.26 us | 21.098 us | 18.703 us | 1.00 | 0.01 | 152 B | | UseVectors | 2048 | 507.70 us | 9.626 us | 11.459 us | 0.12 | 0.00 | NA | | UseVectorsParallel | 2048 | 118.98 us | 1.025 us | 0.959 us | 0.03 | 0.00 | NA | | PeterParallelScalar | 2048 | 803.30 us | 9.226 us | 8.630 us | 0.20 | 0.00 | NA | | | | | | | | | | | Scalar | 4096 | 16,391.12 us | 121.643 us | 113.785 us | 1.00 | 0.01 | 152 B | | UseVectors | 4096 | 2,472.16 us | 32.452 us | 30.356 us | 0.15 | 0.00 | NA | | UseVectorsParallel | 4096 | 2,034.85 us | 33.074 us | 30.937 us | 0.12 | 0.00 | NA | | PeterParallelScalar | 4096 | 3,139.85 us | 32.657 us | 27.270 us | 0.19 | 0.00 | NA |
- Scalar: 标量算法。
- UseVectors: 矢量算法。
- UseVectorsParallel: 并发的矢量算法。
3.2 Arm 架构
同样的源代码可以在 Arm 架构上运行。基准测试结果如下。
BenchmarkDotNet v0.14.0, macOS Sequoia 15.0.1 (24A348) [Darwin 24.0.0] Apple M2, 1 CPU, 8 logical and 8 physical cores .NET SDK 8.0.204 [Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD [AttachedDebugger] DefaultJob : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD | Method | Width | Mean | Error | StdDev | Ratio | RatioSD | |--------------------- |------ |-------------:|----------:|----------:|------:|--------:| | Scalar | 1024 | 635.31 us | 0.537 us | 0.448 us | 1.00 | 0.00 | | UseVectors | 1024 | 127.04 us | 0.567 us | 0.474 us | 0.20 | 0.00 | | UseVectorsParallel | 1024 | 46.37 us | 0.336 us | 0.314 us | 0.07 | 0.00 | | PeterParallelScalar | 1024 | 202.19 us | 1.025 us | 0.959 us | 0.32 | 0.00 | | | | | | | | | | Scalar | 2048 | 2,625.64 us | 1.795 us | 1.402 us | 1.00 | 0.00 | | UseVectors | 2048 | 521.40 us | 0.301 us | 0.282 us | 0.20 | 0.00 | | UseVectorsParallel | 2048 | 152.11 us | 3.548 us | 10.064 us | 0.06 | 0.00 | | PeterParallelScalar | 2048 | 711.00 us | 1.806 us | 1.601 us | 0.27 | 0.00 | | | | | | | | | | Scalar | 4096 | 10,457.09 us | 5.697 us | 5.051 us | 1.00 | 0.00 | | UseVectors | 4096 | 2,058.16 us | 4.110 us | 3.643 us | 0.20 | 0.00 | | UseVectorsParallel | 4096 | 1,152.15 us | 21.134 us | 21.703 us | 0.11 | 0.00 | | PeterParallelScalar | 4096 | 2,897.94 us | 56.893 us | 91.871 us | 0.28 | 0.01 |
附录
- 完整源代码: https://github.com/zyl910/VectorTraits.Sample.Benchmarks/blob/main/VectorTraits.Sample.Benchmarks.Inc/Image/Bgr24ToGray8Benchmark.cs
- YGroup3Unzip 的文档: https://zyl910.github.io/VectorTraits_doc/api/Zyl.VectorTraits.Vectors.YGroup3Unzip.html
- VectorTraits 的NuGet包: https://www.nuget.org/packages/VectorTraits
- VectorTraits 的在线文档: https://zyl910.github.io/VectorTraits_doc/
- VectorTraits 源代码: https://github.com/zyl910/VectorTraits