用python快速生成Trados Studio读取的XML格式QA配置文件

技术2023-08-06 90

用python快速生成Trados Studio读取的XML格式QA配置文件

背景&目的

Trados Studio中自带的QA Checker功能十分强大，可以检查英文写作中常见的空格、标点等问题，还能够实现自定义正则的检查、禁用词的检查。

但是在Trados中，正则和禁用词的检查只能逐项输入，也可以导入或导出特定格式的配置文件，但并没有将外部规则或禁用词批量导入的方法。

若能够将外部的规则或禁用词批量导入，能够降低手动操作的工作量。

工具

Trados Studio（2014以上） python环境

原理&具体方法

QA Checker的配置文件后缀时sdlqasettings，但实质就是xml文件，稍加阅读就能理解其中的标记含义和规律。比如检查禁用词的部分中，每一条都是如下结构：

上面代码中的正确词语、错误词语都是自定义的，编号则是自动生成的（从0开始）。多条检查项对应的代码依次连接，首尾有其他设置项对应的内容，可以视为固定不变的。

翻译中常会用excel来记录术语，只要将表格中的错误词语、正确词语分别填充到上述结构、再合并到一起，添加首尾部分的内容即可。

def wordlist(file): import xlrd filename = file.split('xls')[0] + 'wordlist.sdlqasettings' # 定义输出文件名 xls2txt = open(filename, 'w', encoding='utf-8') # 创建写入的文件 data = xlrd.open_workbook(file) # 打开excel表格 table = data.sheets()[0] # 读取第一个sheet rows = table.nrows # excel文件的行数 cols = table.ncols # excel文件的列数 pair_index = 0 word_list = '' for mono_row in range(0, rows): raw_wrong_word = str(table.cell(mono_row, 0)) # 得到错词 raw_correct_word = str(table.cell(mono_row, 1)) # 得到译文 wrong_word = raw_wrong_word[6:-1] correct_word = raw_correct_word[6:-1] # 各pair mono_pair: str = f'''<Setting Id="WrongWordPairs{pair_index}"><WrongWordDef xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/Sdl.Verification.QAChecker"><CorrectWord>{correct_word}</CorrectWord><WrongWord>spirit</WrongWord><_CorrectWord>{correct_word}</_CorrectWord><_WrongWord>{wrong_word}</_WrongWord></WrongWordDef></Setting>''' word_list += mono_pair pair_index += 1 word_list = f'''<?xml version="1.0" encoding="utf-8"?><SettingsBundle><SettingsGroup Id="QAVerificationSettings"><Setting Id="ExcludePerfectMatchSegments">False</Setting><Setting Id="ExcludeLocked">False</Setting><Setting Id="ElementContextExclusionValue">True</Setting><Setting Id="ExclusionStringValue">True</Setting><Setting Id="CheckInconsistencies">True</Setting><Setting Id="CheckRepeatedWords">True</Setting><Setting Id="UneditedSegments">True</Setting><Setting Id="UneditedConfirmed">False</Setting><Setting Id="UneditedNotConfirmed">True</Setting><Setting Id="AbsoluteLengthElements">True</Setting><Setting Id="CheckNumbers">True</Setting><Setting Id="CheckPunctuationDifferences">True</Setting><Setting Id="CheckSpanishPunctuation">True</Setting><Setting Id="CheckPunctuationSpace">True</Setting><Setting Id="PunctuationSpaceCharsValue">:!?;,.)/-*</Setting><Setting Id="CheckMultipleSpaces">True</Setting><Setting Id="CheckMultipleDots">True</Setting><Setting Id="ExtraEndSpace">True</Setting><Setting Id="CheckMultipleSpaceSeverity">1</Setting><Setting Id="CheckMultipleDotSeverity">1</Setting><Setting Id="CheckRegEx">True</Setting><Setting Id="RegExSeverity">0</Setting><Setting Id="RegExRules">True</Setting><Setting Id="RegExRulesCount">3</Setting><Setting Id="RegExRules0"><RegExRule xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/Sdl.Verification.QAChecker.RegEx"><Description>[参见图#，图#...]中的[参见图#，]要省略不翻</Description><IgnoreCase>false</IgnoreCase><RegExSource></RegExSource><RegExTarget>FIG. [0-9], FIG.|FIG. [0-9][0-9], FIG.</RegExTarget><RuleCondition>TargetOnly</RuleCondition></RegExRule></Setting><Setting Id="RegExRules1"><RegExRule xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/Sdl.Verification.QAChecker.RegEx"><Description>电连接用electrical</Description><IgnoreCase>false</IgnoreCase><RegExSource>连接</RegExSource><RegExTarget>electronic</RegExTarget><RuleCondition>TargetAndSource</RuleCondition></RegExRule></Setting><Setting Id="RegExRules2"><RegExRule xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/Sdl.Verification.QAChecker.RegEx"><Description>based on</Description><IgnoreCase>false</IgnoreCase><RegExSource></RegExSource><RegExTarget>based [^qo]</RegExTarget><RuleCondition>TargetOnly</RuleCondition></RegExRule></Setting><Setting Id="CheckForbiddenChar">True</Setting><Setting Id="ForbiddenCharsValue">，。；：“”、！？（）【】</Setting><Setting Id="CheckIdenticalSegmentsSeverity">2</Setting><Setting Id="CheckTargetShorterSeverity">2</Setting><Setting Id="CheckTrademarks">True</Setting><Setting Id="TrademarksSymbols">True</Setting><Setting Id="TrademarksSymbols0">®</Setting><Setting Id="TrademarksSymbols1">©</Setting><Setting Id="TrademarksSymbols2">™</Setting><Setting Id="TrademarksSymbols3">(c)</Setting><Setting Id="TrademarksSymbols4">(r)</Setting><Setting Id="TrademarksSymbols5">(tm)</Setting><Setting Id="CheckWordList">True</Setting><Setting Id="WordListIgnoreCase">True</Setting><Setting Id="WordListWholeWord">True</Setting><Setting Id="WrongWordPairs">True</Setting><Setting Id="WrongWordPairsCount">{rows}</Setting>{word_list}</SettingsGroup></SettingsBundle>''' xls2txt.write(word_list) xls2txt.close() # 重要，防止内存溢出

以上是生成wordlist配置文件的方法。要注意的是，trados中wordlist一栏没有单独的导入配置文件选项，所以只能通过导入QA Checker的全局配置文件来实现，因此，以上程序生成的配置文件也会覆盖除wordlist外的其他设置。

类似的，以下的代码是用于根据已有的excel术语表，生成trados的正则表达式配置文件。与上面wordlist不同，正则表达式可以单独导入配置文件，不会对其他设置造成影响。

import xlrd filename = file.split('xls')[0] + 'sdlqasettings' # 定义输出文件名 xls2txt = open(filename, 'w', encoding='utf-8') # 创建写入的文件 data = xlrd.open_workbook(file) # 打开excel表格 table = data.sheets()[0] # 读取第一个sheet rows = table.nrows # excel文件的行数 reg_index = 0 reg_ex = '' for row_word in range(0, rows): raw_source = str(table.cell(row_word, 0)) # 得到原文 raw_target = str(table.cell(row_word, 1)) # 得到译文 source = raw_source[6:-1] target = raw_target[6:-1] # 未翻译 reg_ex_source_not_target: str = f'''<Setting Id="RegExRules{reg_index}"><RegExRule xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/Sdl.Verification.QAChecker.RegEx"><Description>术语"{source}"未翻译</Description><IgnoreCase>true</IgnoreCase><RegExSource>{source}</RegExSource><RegExTarget>{target}</RegExTarget><RuleCondition>SourceNotTarget</RuleCondition></RegExRule></Setting>''' reg_ex += reg_ex_source_not_target reg_index += 1 # 少译 reg_ex_different_count: str = f'''<Setting Id="RegExRules{reg_index}"><RegExRule xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/Sdl.Verification.QAChecker.RegEx"><Description>术语"{source}"匹配次数不同</Description><IgnoreCase>true</IgnoreCase><RegExSource>{source}</RegExSource><RegExTarget>{target}</RegExTarget><RuleCondition>DifferentCount</RuleCondition></RegExRule></Setting>''' reg_ex += reg_ex_different_count reg_index += 1 reg_ex: str = f'''<?xml version="1.0" encoding="utf-8"?><SettingsBundle><SettingsGroup Id="QAVerificationSettings"><Setting Id="RegExRules">True</Setting>{reg_ex}</SettingsGroup></SettingsBundle>''' xls2txt.write(reg_ex) xls2txt.close() # 重要，防止内存溢出

以上。

本人python一根脚趾头入门的程度，代码中将excel内容读取并保存到txt文件的步骤是参考某位博客作者的，但写完代码忘了是谁所以没有credit。。。有部分的注释也是从那copy的，所以如果看到，请立刻提醒我。

Processed: 0.017, SQL: 9