验证码预处理

技术2025-03-03 91

前言

今天看到了一个好东西，和大家分享一下，顺便翻译翻译。 github源码：https://github.com/Vykstorm/CaptchaDL kaggle地址：https://www.kaggle.com/vykstorm/extracting-words-from-images-with-opencv-part-2

具体就是对验证码做预处理，让我觉得是好东西的是验证码的切割部分。验证码样本：这种验证码使用一些简单的技巧是无法切割的，而这个大佬用OpenCV做到了，并且切割效果比较理想。

kaggle可以下载到jupyter的笔记和代码，只需要在本地创建个python的虚拟环境(推荐conda)，装上github的requeriments.txt里的包就可以本地测试效果了。（当我们接触一个新东西的时候，先不要去着急的理解原理，我们可以先使用别人的成果来满足自己的好奇心。比如将他的代码全部运行一遍得到了最终的效果，这在后面一步一步的分析的时候才会更有干劲，因为你知道终点你也可以做到，那么过程累一点也无所谓。）

jupyter内容翻译

导入相关库

import numpy as np import pandas as pd import matplotlib.pyplot as plt import os import cv2 as cv import pickle import warnings from itertools import product, repeat, permutations, combinations_with_replacement, chain from math import floor, ceil warnings.filterwarnings('ignore') %matplotlib inline

加载数据集

如果kaggle下载慢的可以下载这个：https://download.csdn.net/download/Qwertyuiop2016/12575444（免积分）因为jupyter上加载的是kaggle网站上的，我想本地测试就将验证码下载下来本地加载

import numpy as np import os import random from PIL import Image img_dir = 'I:/samples/samples' # 验证码路径 os.chdir(img_dir) width, height, img_num = 200, 50, 1000 # 本来这样选的目的是为了能指定数量和打乱顺序，不过在这里没什么用，因为不训练模型 imgs = random.sample(os.listdir(), img_num) X = np.zeros((len(imgs), height, width, 1), dtype = np.uint8) # 这种维度只是为了适应TensorFlow的图片输入格式 for index, img_name in enumerate(imgs): img = Image.open(img_name) img_gray = img.convert('L') # 转换为灰度图 pix = np.array(img_gray) pix = pix.reshape((height, width, 1)) # 将维度为(height, width)转为(height, width, 1) X[index] = pix

显示灰度图

img = X[1][:,:,0] plt.imshow(img, cmap='gray')

反转黑白色

inverted = 255 - img plt.imshow(inverted, cmap='gray');

二值化图片

ret, thresholded = cv.threshold(inverted, 140, 255, cv.THRESH_BINARY) plt.imshow(thresholded, cmap='gray'); inverted是图片灰度化的数组其中140是二值化的阈值，可以由迭代法和otsu算法得到，具体参考我以前的博客：验证码之二值化255为图片像素的最大值cv.THRESH_BINARY表示大于阈值设为255(就是第三个参数的值)，小于阈值设为0，这也就是通常所说的二值化

阈值可以自己用实现算法计算出来，其实OpenCV也内置了otsu算法。实现如下：

ret2,th2 = cv.threshold(inverted,0,255,cv.THRESH_BINARY+cv.THRESH_OTSU) plt.imshow(th2, cmap='gray')

利用中值滤波简单去噪点和干扰线

blurred = cv.medianBlur(thresholded, 3) plt.imshow(blurred, cmap='gray')

第二个参数表示滤波模板的尺寸，值必须为大于1的奇数。在验证码处理中一般为3或者5，太大容易消除验证码特征。值为3时：值为5时：我看kaggle上那位大佬选的值为3，但我看值为5时效果更佳。不过在下一步操作后，其实两个得到的结果差不了太多，感觉这一步只是顺带的，并不重要。

形态学操作消除噪点和干扰线

形态学操作：腐蚀、膨胀、开运算、闭运算等

首先进行开运算：

kernel = np.array([ [0, 0, 1, 0, 0], [0, 0, 1, 0, 0], [0, 0, 1, 0, 0], [0, 0, 1, 0, 0], [0, 0, 1, 0, 0], ]).astype(np.uint8) ex = cv.morphologyEx(blurred, cv.MORPH_OPEN, kernel) plt.imshow(ex, cmap='gray');

cv.MORPH_OPEN表示开运算（先腐蚀后膨胀），kernel的选择我搜不到相关资料。不过我换全0的效果也差不多，甚至改成全1的也是一样，我又继续试了3x3的全0和全1或者中间为1，同样看不出太大的区别。希望有懂的大佬能说一下。效果图：接着在上面操作完的图片在进行膨胀：

kernel2 = np.array([ [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [1, 1, 1, 1, 1], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0] ]).astype(np.uint8) ex2 = cv.morphologyEx(ex, cv.MORPH_DILATE, kernel2) plt.imshow(ex2, cmap='gray');

对blurred的那张图和膨胀后的图像进行与运算

与运算：即对图像（灰度图像或彩色图像均可）每个像素值进行二进制“与”操作，1&1=1，1&0=0，0&1=0，0&0=0

mask = ex2 processed = cv.bitwise_and(mask, blurred) plt.imshow(processed, cmap='gray')

寻找轮廓线

contours, hierachy = cv.findContours(processed, cv.RETR_CCOMP, cv.CHAIN_APPROX_SIMPLE) contours = [contours[k] for k in range(0, len(contours)) if hierachy[0, k, 3] == -1] contours.sort(key=lambda cnt: cv.boundingRect(cnt)[0]) plt.imshow(cv.drawContours(cv.cvtColor(img, cv.COLOR_GRAY2RGB), contours, -1, (255, 0, 0), 1, cv.LINE_4));

针对找出的轮廓线画出矩形框

contour_bboxes = [cv.boundingRect(contour) for contour in contours] img_bboxes = cv.cvtColor(img, cv.COLOR_GRAY2RGB) for bbox in contour_bboxes: left, top, width, height = bbox img_bboxes = cv.rectangle(img_bboxes, (left, top), (left+width, top+height), , 255, 0), 1) plt.imshow(img_bboxes, cmap='gray');

画了两个框的原因是上一步找出了两条轮廓线，从返回的contours这个列表有几个元素可以看出，即有len(contours)个框.

训练一个分类器来识别每个框框有多少个字符

特征有：框宽度、框高度、框面积、框面积/(框高度*框宽度)、框周长

我们训练一个根据上面五个特征来得到框框有几个字符的分类器，大佬使用的是SVC分类器。不过我并没有找到分类器代码的实现，只有一个已经训练好的分类器。提取特征：

contours_features = pd.DataFrame.from_dict({ 'bbox_width': [bbox[2] for bbox in contour_bboxes], 'bbox_height': [bbox[3] for bbox in contour_bboxes], 'area': [cv.contourArea(cnt) for cnt in contours], 'extent': [cv.contourArea(cnt) / (bbox[2] * bbox[3]) for cnt, bbox in zip(contours, contour_bboxes)], 'perimeter': [cv.arcLength(cnt, True) for cnt in contours] })

加载已经训练好的分类器：https://github.com/Vykstorm/CaptchaDL/blob/master/models/.contour-classifier

with open('I:/contour-classifier', 'rb') as file: contour_classifier = pickle.load(file)

对数据进行标准化操作（削弱值特别大的特征对结果的影响）： https://github.com/Vykstorm/CaptchaDL/blob/master/models/.contour-classifier-preprocessor

with open('I:/contour-classifier-preprocessor', 'rb') as file: contour_features_scaler = pickle.load(file) contour_features = contour_features_scaler.transform(contours_features[['bbox_width', 'bbox_height', 'area', 'extent', 'perimeter']]) # 得到的contour_features： #array([[ 2.1661931 , 1.40786863, 2.87483795, 0.11734141, 1.81692393], # [-0.62741894, -0.37829382, -0.6341117 , 0.72145767, -0.6275891 ]])

预测结果：

contour_num_chars = contour_classifier.predict(contour_features)` # array([4, 1], dtype=uint8)

符合我们人眼看到的第一个框四个字符，第二个框一个字符。

后面的一些操作就不解释了，就是将包含多个字符的框等比例切割，然后再将切割后每个字符扩充到同样的大小。

Processed: 0.018, SQL: 9