SSE图像算法优化系列十五:YUV/XYZ和RGB空间相互转化的极速实现(此后老板不用再担心算法转到其他空间通道的耗时了)。


  在颜色空间系列1: RGB和CIEXYZ颜色空间的转换及相关优化和颜色空间系列3: RGB和YUV颜色空间的转换及优化算法两篇文章中我们给出了两种不同的颜色空间的相互转换之间的快速算法的实现代码,但是那个是C#版本的,为了比较方便,我们这里提供C版本的代码,以RGB转到YUV空间的代码为例:

void RGBToYUV(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride)
{
    const int Shift = 15;                            
    const int HalfV = 1 << (Shift - 1);
    const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT;
    const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT);
    const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT);
    for (int YY = 0; YY < Height; YY++)
    {
        unsigned char *LinePS = RGB + YY * Stride;
        unsigned char *LinePY = Y + YY * Width;
        unsigned char *LinePU = U + YY * Width;
        unsigned char *LinePV = V + YY * Width;
        for (int XX = 0; XX < Width; XX++, LinePS += 3)    
        {
            int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];
            LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + HalfV) >> Shift;
            LinePU[XX] = ((U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + HalfV) >> Shift) + 128;
            LinePV[XX] = ((V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + HalfV) >> Shift) + 128;
        }
    }
}

  上述代码和颜色空间系列3: RGB和YUV颜色空间的转换及优化算法中的有所不同,但是应该说更加合理,注意Y_R_WT/U_R_WT/V_R_WT 的书写方式有所不同,这主要是为了保证定点化后的系数不会放大误差,同时注意计算时每一项还增加了HalfV值,这主要是为了保证对计算结果的四舍五入。

  现代的CPU都很强劲,找了一台普通的I5电脑测试,1080P的图平均转换时间为10ms,大概能得到100fps的转换速率,但是从实际应用来讲,在很多场合这个耗时还是多了点,如果需要处理1080P这样的高清视频,考虑到综合因素,单帧的总处理时间不宜超过30ms,所以还有必要进一步提高这个算法的速度。

  凭直觉,上述代码用GPU实现应该有很高的并行性,不知道最后速度会如何。

  俺不会GPU,只会用SSE进行优化,先试一试把,一种最直接最朴实的写法如下:

void RGBToYUVSSE(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride)
{
    const int Shift = 15;
    const int HalfV = 1 << (Shift - 1);
    const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT;
    const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT);
    const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT);

    __m128i Weight_YB = _mm_set1_epi32(Y_B_WT), Weight_YG = _mm_set1_epi32(Y_G_WT), Weight_YR = _mm_set1_epi32(Y_R_WT);
    __m128i Weight_UB = _mm_set1_epi32(U_B_WT), Weight_UG = _mm_set1_epi32(U_G_WT), Weight_UR = _mm_set1_epi32(U_R_WT);
    __m128i Weight_VB = _mm_set1_epi32(V_B_WT), Weight_VG = _mm_set1_epi32(V_G_WT), Weight_VR = _mm_set1_epi32(V_R_WT);
    __m128i C128 = _mm_set1_epi32(128);
    __m128i Half = _mm_set1_epi32(HalfV);
    __m128i Zero = _mm_setzero_si128();

    const int BlockSize = 16, Block = Width / BlockSize;
    for (int YY = 0; YY < Height; YY++)
    {
        unsigned char *LinePS = RGB + YY * Stride;
        unsigned char *LinePY = Y + YY * Width;
        unsigned char *LinePU = U + YY * Width;
        unsigned char *LinePV = V + YY * Width;
        for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePS += BlockSize * 3)
        {
            __m128i Src1, Src2, Src3, Blue, Green, Red;
            
            Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0));
            Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16));
            Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32));

            Blue = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));
            Blue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1)));
            Blue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13)));

            Green = _mm_shuffle_epi8(Src1, _mm_setr_epi8(1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));
            Green = _mm_or_si128(Green, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1)));
            Green = _mm_or_si128(Green, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14)));

            Red = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));
            Red = _mm_or_si128(Red, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1)));
            Red = _mm_or_si128(Red, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15)));

            //    以上操作把16个连续像素的像素顺序由 B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R 
            //    更改为适合于SIMD指令处理的连续序列 B B B B B B B B B B B B B B B B G G G G G G G G G G G G G G G G R R R R R R R R R R R R R R R R     
            //    并分别保存于三个SSE变量中,并且没任何的冗余数据。

            __m128i Blue16L = _mm_unpacklo_epi8(Blue, Zero);            
            __m128i Blue16H = _mm_unpackhi_epi8(Blue, Zero);            
            __m128i Blue32LL = _mm_unpacklo_epi16(Blue16L, Zero);        
            __m128i Blue32LH = _mm_unpackhi_epi16(Blue16L, Zero);
            __m128i Blue32HL = _mm_unpacklo_epi16(Blue16H, Zero);
            __m128i Blue32HH = _mm_unpackhi_epi16(Blue16H, Zero);

            __m128i Green16L = _mm_unpacklo_epi8(Green, Zero);
            __m128i Green16H = _mm_unpackhi_epi8(Green, Zero);
            __m128i Green32LL = _mm_unpacklo_epi16(Green16L, Zero);
            __m128i Green32LH = _mm_unpackhi_epi16(Green16L, Zero);
            __m128i Green32HL = _mm_unpacklo_epi16(Green16H, Zero);
            __m128i Green32HH = _mm_unpackhi_epi16(Green16H, Zero);

            __m128i Red16L = _mm_unpacklo_epi8(Red, Zero);
            __m128i Red16H = _mm_unpackhi_epi8(Red, Zero);
            __m128i Red32LL = _mm_unpacklo_epi16(Red16L, Zero);
            __m128i Red32LH = _mm_unpackhi_epi16(Red16L, Zero);
            __m128i Red32HL = _mm_unpacklo_epi16(Red16H, Zero);
            __m128i Red32HH = _mm_unpackhi_epi16(Red16H, Zero);

            //    以上操作把三个SSE变量里的字节数据分别提取到12个包含4个int类型的数据的SSE变量里,以便后续的乘积操作不溢出

            __m128i LL_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_YG), _mm_mullo_epi32(Red32LL, Weight_YR))),Half), Shift);
            __m128i LH_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_YG), _mm_mullo_epi32(Red32LH, Weight_YR))), Half), Shift);
            __m128i HL_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_YG), _mm_mullo_epi32(Red32HL, Weight_YR))), Half), Shift);
            __m128i HH_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_YG), _mm_mullo_epi32(Red32HH, Weight_YR))), Half), Shift);
            _mm_storeu_si128((__m128i*)(LinePY + XX), _mm_packus_epi16(_mm_packus_epi32(LL_Y, LH_Y), _mm_packus_epi32(HL_Y, HH_Y)));    //    4个包含4个int类型的SSE变量重新打包为1个包含16个字节数据的SSE变量
            //    以上操作完成:Y[0 - 15] = (Y_B_WT * Blue[0 - 15]+ Y_G_WT * Green[0 - 15] + Y_R_WT * Red[0 - 15] + HalfV) >> Shift;    

            __m128i LL_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_UG), _mm_mullo_epi32(Red32LL, Weight_UR))), Half), Shift), C128);
            __m128i LH_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_UG), _mm_mullo_epi32(Red32LH, Weight_UR))), Half), Shift), C128);
            __m128i HL_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_UG), _mm_mullo_epi32(Red32HL, Weight_UR))), Half), Shift), C128);
            __m128i HH_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_UG), _mm_mullo_epi32(Red32HH, Weight_UR))), Half), Shift), C128);
            _mm_storeu_si128((__m128i*)(LinePU + XX), _mm_packus_epi16(_mm_packus_epi32(LL_U, LH_U), _mm_packus_epi32(HL_U, HH_U)));
            //    以上操作完成:U[0 - 15] = ((U_B_WT * Blue[0 - 15]+ U_G_WT * Green[0 - 15] + U_R_WT * Red[0 - 15] + HalfV) >> Shift) + 128;    

            __m128i LL_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_VG), _mm_mullo_epi32(Red32LL, Weight_VR))), Half), Shift), C128);
            __m128i LH_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_VG), _mm_mullo_epi32(Red32LH, Weight_VR))), Half), Shift), C128);
            __m128i HL_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_VG), _mm_mullo_epi32(Red32HL, Weight_VR))), Half), Shift), C128);
            __m128i HH_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_VG), _mm_mullo_epi32(Red32HH, Weight_VR))), Half), Shift), C128);
            _mm_storeu_si128((__m128i*)(LinePV + XX), _mm_packus_epi16(_mm_packus_epi32(LL_V, LH_V), _mm_packus_epi32(HL_V, HH_V)));
            //    以上操作完成:V[0 - 15] = ((V_B_WT * Blue[0 - 15]+ V_G_WT * Green[0 - 15] + V_R_WT * Red[0 - 15] + HalfV) >> Shift) + 128;    
        }
        for (int XX = Block * BlockSize; XX < Width; XX++, LinePS += 3)    //    每行剩余的像素按照正常的方式处理
        {
            int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];
            LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + HalfV) >> Shift;
            LinePU[XX] = ((U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + HalfV) >> Shift) + 128;
            LinePV[XX] = ((V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + HalfV) >> Shift) + 128;
        }
    }
}

  (显示器不是宽屏的朋友请下载代码在VS里浏览,不然会看晕的)。

  基本上就是按照普通的C的代码翻译到SSE版本,按F5编译执行,测试速度1080P的平均每帧约4.5ms,大概是普通C语言的2.2倍,为什么没到四倍,可能是我写的不到位,但是我认为主要的原因还是这里面有蛮多的数据拆解和数据类型的转换占用了一定的CPU周期。但总的来说还是有相当大的进步的。

  稍微分析下程序以帮助不太懂SSE的朋友。

  前面的几句加载和shffule语句主要是把BGR格式的图像数据拆解为单独的通道数据,其详细的原理见(真心描述的清楚):

  SSE图像算法优化系列八:自然饱和度(Vibrance)算法的模拟实现及其SSE优化(附源码,可作为SSE图像入门,Vibrance算法也可用于简单的肤色调整)。

  中间一部分unpack代码主要的主要作用是把字节数据扩展到32位的int类型数,因为后面的乘法结果是只能用32位才能表达完整的。

  那么后面的_mm_srai_epi32、_mm_add_epi32、_mm_mullo_epi32等就是直接的C代码的翻译了,只是把普通的晕算法用函数名代替了而已。

  从理论上说,上述代码中的 const int Shift = 15;这个常量最大可以取到23,这个与最后的计算精度有一定的关系,但是也是影响不大。

  2.2倍的提速,就这样放弃了吗,不,这不科学,也不是我Imageshop的风格,继续加油。

       每当这个时候,我总是习惯性的再翻一遍Intel的SSE帮助文档,也许那个里面还藏着其他的葵花宝典。

  当我们定位到_mm_madd_epi16这个函数时,虽然他只比普通的add函数多了一个m,但是当看到他下面对该函数功能的解释时,那真的眼前一亮,看下MSDN的说明:

  Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b.

      __m128i _mm_madd_epi16 (__m128i a, __m128i b);
https://files.cnblogs.com/files/Imageshop/YUV_RGB.rar

  基于本文的优化应用于磨皮的效果见:彩色图工程:https://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar

  

  到年底了,没钱回家过年,望各位路过的大神施舍点。