C++排雷：19.过滤英文和中文标点符号，string与wstring之间的转换

技术2022-07-13 100

想要过滤一个文本中的标点符号。

对英文标点符号可以使用cctype中的ispunct方法来识别而对于中文标点符号，则需要一定的转换：

C++用string来处理字符串。

string是窄字符串ASCII而很多Windows API函数用的是宽字符Unicode。

所以遇到中文字符问题，需要经常在ASCII字符串和Unicode字符串转换。而C++的string并么有很好的去支持这么一个转换，所以还需要我们自己去写代码转换：

string和wstring相互转换以及wstring显示中文问题传送门

这里直接上答案，方便大家直接借用，看不懂的再看后续的解说：

#include<iostream> #include<string> #include<cctype> #include <comdef.h> using namespace std; string wstring2string(wstring wstr); wstring string2wstring(string str); int main() { //去标点 string str01("a,2.ch、，。1"),str02(""); //转为w //wstring_convert<codecvt_utf8<wchar_t>> conv; //wstring wstr01 = conv.from_bytes(str01); wstring wstr01=string2wstring(str01); wstring wstr02; //识别标点 for (auto temp_c :wstr01) { if (! iswpunct(temp_c)) { wstr02.push_back(temp_c); } } //转回 //str02 = conv.to_bytes(wstr02); str02 = wstring2string(wstr02); cout << str02; return 0; } string wstring2string(wstring wstr) { string result; int len = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), wstr.size(), NULL, 0, NULL, NULL); if (len <= 0)return result; char* buffer = new char[len + 1]; if (buffer == NULL)return result; WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), wstr.size(), buffer, len, NULL, NULL); buffer[len] = '\0'; result.append(buffer); delete[] buffer; return result; } wstring string2wstring(string str) { wstring result; int len = MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.size(), NULL, 0); if (len < 0)return result; wchar_t* buffer = new wchar_t[len + 1]; if (buffer == NULL)return result; MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.size(), buffer, len); buffer[len] = '\0'; result.append(buffer); delete[] buffer; return result; }

基础知识：

字符（Character）是各种文字和符号的总称，包括各国家文字、标点符号、图形符号、数字等。

字符集（Character set）是多个字符的集合，字符集种类较多，每个字符集包含的字符个数不同，常见字符集名称：ASCII字符集、GB2312字符集、BIG5字符集、 GB18030字符集、Unicode字符集等。

计算机要准确的处理各种字符集文字，就需要进行字符编码，以便计算机能够识别和存储各种文字。

中文文字数目大，而且还分为简体中文和繁体中文两种不同书写规则的文字，而计算机最初是按英语单字节字符设计的，因此，对中文字符进行编码，是中文信息交流的技术基础。

常用的汉字字符编码有以下几种：

GB2312。

是中国国家标准的简体中文字符集。它所收录的汉字已经覆盖99.75%的使用频率，基本满足了汉字的计算机处理需要。

BIG5 。

该字符集在中国台湾使用，又称大五码或五大码，1984年由台湾财团法人信息工业策进会和五家软件公司宏碁 (Acer)、神通 (MiTAC)、佳佳、零壹 (Zero One)、大众 (FIC)创立，故称大五码。

UNICODE 。

支持现今世界各种不同语言的书面文本的交换、处理及显示。，英文和中文都需要两个字节。最新版本是2019年5月7日的Unicode 12.1.0。

UTF-8。

是Unicode的其中一个使用方式。 UTF是 Unicode Tranformation Format，即把Unicode转做某种格式的意思。 UTF-8便于不同的计算机之间使用网络传输不同语言和编码的文字，使得双字节的Unicode能够在现存的处理单字节的系统上正确传输。

GB18030。

在中国市场上发布的软件必须符合本标准，与Unicode 3.0版本兼容，填补Unicode扩展字符字汇“统一汉字扩展A”的内容。并且与以前的国家字符编码标准（GB2312，GB13000.1）兼容。

过滤英文标点实现：

#include<iostream> #include<string> #include<cctype> using namespace std; int main() { //去标点 string str01,str02(""); getline(cin, str01) ; for (auto & temp_c :str01) { if (! ispunct(temp_c))//判断是否是英文标点 { str02 += temp_c; } } cout << str02; return 0; }

找出UTF-8字符串中全部的标点（包括中英文标点）可以参考以下方法：

方法一：

1.使用codecvt中的函数，先将UTF-8编码的字符串转换成宽字节wchar_t类型的wstring 2.然后使用宽字符处理函数iswpunct和iswspace(如果把空格也当做标点)做识别

具体的代码示例，主要功能是读取in.txt中保存的原始UTF-8编码的文本，过滤掉标点和空格后，重新存入out.txt：

#include <iostream> #include <string> #include <codecvt> #include <fstream> using namespace std; int main() { //原始的UTF-8文本存放在in.txt中 ifstream infile("in.txt"); //将过滤掉标点符号的文本重新存入到out.txt中 ofstream outfile("out.txt"); //检查文件是否打开 if (!infile) { cout << "Can not open file in.txt" << endl; return -1; } if (!outfile){ cout << "Can not open file out.txt" << endl; return -1; } //定义转换对象 wstring_convert<codecvt_utf8<wchar_t>> conv; //按行读取文件 while (!infile.eof()) { string s; getline(infile, s); //转换成宽字节类型 wstring ws = conv.from_bytes(s); wstring nws; //过滤每一行中的标点和空格 for (wchar_t ch : ws){ //检查是否是标点和空格 if (!iswpunct(ch) && !iswspace(ch)){ nws.push_back(ch); } } //将过滤后的文本重新转换成UTF-8编码的多字节类型 string ns = conv.to_bytes(nws); //重新写回文件 outfile << ns; } //关闭文件 infile.close(); outfile.close(); return 0; }

方法二：

1.先将UTF8 string转为wchar_t字符类型的wstring。

可借助windows的MultiByteToWideCharC++11的codecvt等方法实现。

实现string转wstring与wstring转string函数：`

#include <comdef.h> string wstring2string(wstring wstr) { string result; int len = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), wstr.size(), NULL, 0, NULL, NULL); if( len <= 0 )return result; char* buffer = new char[len + 1]; if(buffer == NULL )return result; WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), wstr.size(), buffer, len, NULL, NULL); buffer[len] = '\0'; result.append(buffer); delete[] buffer; return result; } wstring string2wstring(string str) { wstring result; int len = MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.size(), NULL, 0); if( len < 0 )return result; wchar_t* buffer = new wchar_t[len + 1]; if( buffer == NULL )return result; MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.size(), buffer, len); buffer[len] = '\0'; result.append(buffer); delete[] buffer; return result; }

2.之后用库的ispunct模板函数来识别标点符号。

这个函数用法和C版本的ispunct差不多只是多了字符类型模板参数和一个本地环境locale参数因此可以根据locale判断出英文字符以外的符号。

在中文系统语言的windows下一般不显式设置locale就可以工作：

locale loc; wchar_t c = L'。'; cout << boolalpha << ispunct(c, loc) << endl; // true

而linux环境下如果要识别英文以外的标点可能需要指定locale：

locale loc("en_US.UTF-8"); wchar_t c = L'。'; cout << boolalpha << ispunct(c, loc) << endl; // true

由于各系统平台下locale配置都不同（有些环境下甚至没有可用的locale），所以如果对移植兼容性要求比较高的话，建议还是采用先转为UTF-16/UTF-32再正则表达式过滤的手段。

Processed: 0.021, SQL: 9