Lecture-01-Syntax-Tree-and-Language-Model

    技术2022-07-10  137

    Lesson-01 AI Introduction: Syntax Tree and Probability Model

    如果给定一个语法,我们怎么生成语法呢?

    import random sentence = """ 句子 = 主 谓 宾 主 = 你 | 我 | 他 """ two_number = """ num* => num num* | num num => 0 | 1 | 2 | 3 | 4 """ def two_num(): return num() + num() def num(): return random.choice("0 | 1 | 2 | 3 | 4 ".split('|')) def numbers(): if random.random() < 0.5: return num() else: return num() + numbers() for i in range(10): print(numbers())

    1. 语法可以通过定义最简单的函数来实现

    2. 我们可以通过递归,来生成更复杂,“无限”长的文字

    simple_grammar = """ sentence => noun_phrase verb_phrase noun_phrase => Article Adj* noun Adj* => Adj | Adj Adj* verb_phrase => verb noun_phrase Article => 一个 | 这个 noun => 女人 | 篮球 | 桌子 | 小猫 verb => 看着 | 坐在 | 听着 | 看见 Adj => 蓝色的 | 好看的 | 小小的 """ another_grammar = """ # """ import random def adj(): return random.choice('蓝色的 | 好看的 | 小小的'.split('|')).split()[0] def adj_star(): # 为什么如果不用if-else 的random,我们需要用lambda return random.choice([lambda : '', lambda : adj() + adj_star()])() def adj_star(): return random.choice([lambda : '', lambda : adj() + adj_star()])() for i in range(10): print(adj_star())

    But the question is ?

    如果我们更换了语法,会发现所有写过的程序,都要重新写。😦

    number_ops = """ expression => expression num_op | num_op num_op => num op num op => + | - | * | / num => 0 | 1 | 2 | 3 | 4 """ def generate_grammar(grammar_str: str, target, split='=>'): grammar = {} for line in grammar_str.split('\n'): if not line: continue # two => num + num expression, formula = line.split(split) formulas = formula.split('|') formulas = [f.split() for f in formulas] grammar[expression.strip()] = formulas return grammar choice_a_expr = random.choice def generate_by_grammar(grammar: dict, target: str): if target not in grammar: return target # the above line is to test if target is a key expr = choice_a_expr(grammar[target]) return ''.join(generate_by_grammar(grammar, t) for t in expr) def generate_by_str(grammar_str, split, target): grammar = generate_grammar(grammar_str, target, split) return generate_by_grammar(grammar, target) generate_by_str(number_ops, split='=>', target='expression')

    two => num + num | num - num num => 0 | 1 | 2 | 3 | 4

    #在西部世界里,一个”人类“的语言可以定义为: human = """ human = 自己 寻找 活动 自己 = 我 | 俺 | 我们 寻找 = 找找 | 想找点 活动 = 乐子 | 玩的 """ 假如既然 = """ 句子 = if someone state , then do something if = 既然 | 如果 | 假设 someone = one 和 someone | one one = 小红 | 小蓝 | 小绿 | 白白 state = 饿了 | 醒了 | 醉了 | 癫狂了 then = 那么 | 就 | 可以 do = 去 something = 吃饭 | 玩耍 | 去浪 | 睡觉 """ #一个“接待员”的语言可以定义为 host = """ host = 寒暄 报数 询问 业务相关 结尾 报数 = 我是 数字 号 , 数字 = 单个数字 | 数字 单个数字 单个数字 = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 寒暄 = 称谓 打招呼 | 打招呼 称谓 = 人称 , 人称 = 先生 | 女士 | 小朋友 打招呼 = 你好 | 您好 询问 = 请问你要 | 您需要 业务相关 = 具体业务 具体业务 = 喝酒 | 打牌 | 打猎 | 赌博 结尾 = 吗? """ for i in range(10): print(generate_by_str(假如既然, split='=', target='句子')) for i in range(10): print(generate_by_str(host, split='=', target='host'))

    希望能够生成最合理的一句话?

    Eliza

    Data Driven

    我们的目标是,希望能做一个程序,然后,当输入的数据变化的时候,我们的程序不用重写。Generalization.

    AI? 如何能自动化解决问题,我们找到一个方法之后,输入变了,我们的这个方法,不用变。

    simpel_programming = ''' programming => if_stmt | assign | while_loop while_loop => while ( cond ) { change_line stmt change_line } if_stmt => if ( cond ) { change_line stmt change_line } | if ( cond ) { change_line stmt change_line } else { change_line stmt change_line } change_line => /N cond => var op var op => | == | < | >= | <= stmt => assign | if_stmt assign => var = var var => var _ num | words words => words _ word | word word => name | info | student | lib | database nums => nums num | num num => 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 ''' print(generate_by_str(simpel_programming, target='programming', split='=>')) def pretty_print(line): # utility tool function lines = line.split('/N') code_lines = [] for i, sen in enumerate(lines): if i < len(lines) / 2: #print() code_lines.append(i * " " + sen) else: code_lines.append((len(lines) - i) * " " + sen) return code_lines generated_programming = [] for i in range(20): generated_programming += pretty_print(generate_by_str(simpel_programming, target='programming', split='=>')) for line in generated_programming: print(line)

    Language Model

    1. 条件概率

    2. 独立概率

    Review: 1. 条件概率?

    假设你 365天,迟到30次 Pr(迟到) = 30/365

    假如,你这一年中,有60次拉肚子,其中20次迟到了

    Pr(迟到 | 拉肚子) = 20 / 60

    = Pr(迟到&拉肚子) / Pr(拉肚子) = (20 / 365) / (60 / 365) = 20 / 60

    Pr(你迟到 | 伊利诺伊发生车祸) = Pr(你迟到)

    = Pr(你迟到&伊利诺伊车祸) / Pr(伊利诺伊车祸)

    Pr(你迟到&伊利诺伊车祸) = Pr(你迟到) * Pr(伊利诺伊车祸)

    Pr(你迟到 | 肚子痛 & 伊利诺伊发生车祸) = Pr(你迟到 | 肚子痛) = Pr(你迟到&肚子痛) / Pr(肚子疼) ~ Count(你迟到且肚子痛) / Count(肚子痛)

    Pr(其实就和随机森林原理一样)

    -> Pr(其实&就和&随机森林&原理&一样) -> Pr(其实|就和&随机森林&原理&一样)Pr(就和&随机森林&原理&一样) —> Pr(其实|就和)Pr(就和&随机森林&原理&一样) -> Pr(其实|就和)Pr(就和|随机森林&原理&一样)Pr(随机森林&原理&一样) -> Pr(其实|就和)Pr(就和|随机森林)Pr(随机森林&原理&一样) -> Pr(其实|就和)Pr(就和|随机森林)Pr(随机森林|原理)Pr(原理&一样) -> Pr(其实|就和)Pr(就和|随机森林)Pr(随机森林|原理)Pr(原理|一样)Pr(一样)

    语言学家们做了一个简化

    P r ( s e n t e n c e ) = P r ( w 1 w 2 w 3 w 4 ) = ∏ i n # W i W i + 1 # W i + 1 ∗ P r ( w n ) Pr(sentence) = Pr(w_1w_2w_3w_4) = \prod_i^{n} \frac{\# W_iW_{i+1}}{\# W_{i+1}} * Pr(w_n) Pr(sentence)=Pr(w1w2w3w4)=in#Wi+1#WiWi+1Pr(wn)

    l a n g u a g e _ m o d e l ( S t r i n g ) = P r o b a b i l i t y ( S t r i n g ) ∈ ( 0 , 1 ) language\_model(String) = Probability(String) \in (0, 1) language_model(String)=Probability(String)(0,1)

    P r o ( w 1 w 2 w 3 w 4 ) = P r ( w 1 ∣ w 2 w 3 w 4 ) ∗ P ( w 2 ∣ w 3 w 4 ) ∗ P r ( w 3 ∣ w 4 ) ∗ P r ( w 4 ) Pro(w_1 w_2 w_3 w_4) = Pr(w_1 | w_2 w_3 w_ 4) * P(w2 | w_3 w_4) * Pr(w_3 | w_4) * Pr(w_4) Pro(w1w2w3w4)=Pr(w1w2w3w4)P(w2w3w4)Pr(w3w4)Pr(w4)

    P r o ( w 1 w 2 w 3 w 4 ) ∼ P r ( w 1 ∣ w 2 ) ∗ P ( w 2 ∣ w 3 ) ∗ P r ( w 3 ∣ w 4 ) ∗ P r ( w 4 ) Pro(w_1 w_2 w_3 w_4) \sim Pr(w_1 | w_2 ) * P(w2 | w_3 ) * Pr(w_3 | w_4) * Pr(w_4) Pro(w1w2w3w4)Pr(w1w2)P(w2w3)Pr(w3w4)Pr(w4)

    import random random.choice(range(100)) filename = '/Users/gaominquan/Downloads/sqlResult_1558435.csv' import pandas as pd content = pd.read_csv(filename, encoding='gb18030') content.head() articles = content['content'].tolist() len(articles) invalid articles[0] import re # 正则表达式 def token(string): # we will learn the regular expression next course. return re.findall('\w+', string) token(articles[0]) import jieba list(jieba.cut('这个是用来做汉语分词的')) from collections import Counter with_jieba_cut = Counter(jieba.cut(articles[110])) with_jieba_cut.most_common()[:10] ''.join(token(articles[110])) articles_clean = [''.join(token(str(a)))for a in articles] len(articles_clean) len(articles_clean)

    假如,你做了很久的数据预处理

    AI的问题里边,65%都是在做数据预处理

    我们要养成一个习惯,就是把重要的信息,及时保存起来

    存到硬盘里

    with open('article_9k.txt', 'w') as f: for a in articles_clean: f.write(a + '\n') !ls def cut(string): return jieba.cut(string) import jieba def cut(string): return jieba.cut(string) ALL_TOKEN = cut(open('article_9k.txt').read()) TOKEN = [] for i, t in enumerate(ALL_TOKEN): if i > 50000: break # 大家把它变成20万 if i % 1000 == 0: print(i) TOKEN.append(t) len(TOKEN) from functools import reduce from operator import add, mul reduce(add, [1, 2, 3, 4, 5, 8]) [1, 2, 3] + [3, 43, 5] from collections import Counter words_count = Counter(TOKEN) words_count.most_common(100) frequiences = [f for w, f in words_count.most_common(100)] x = [i for i in range(100)] %matplotlib inline import matplotlib.pyplot as plt plt.plot(x, frequiences)

    NLP比较重要的规律:在很大的一个text corpus,文字集合中,出现频率第二多的单词,是出现频率第一多单词的频率的1/2, 出现频率第n多的单词,是出现频率最高的单词的1/n.

    import numpy as np

    P r ( s e n t e n c e ) = P r ( w 1 w 2 w 3 w 4 ) = ∏ i n # W i W i + 1 # W i + 1 ∗ P r ( w n ) Pr(sentence) = Pr(w_1w_2w_3w_4) = \prod_i^{n} \frac{\# W_iW_{i+1}}{\# W_{i+1}} * Pr(w_n) Pr(sentence)=Pr(w1w2w3w4)=in#Wi+1#WiWi+1Pr(wn)

    words_count['我们'] def prob_1(word): return words_count[word] / len(TOKEN) prob_1('我们') TOKEN = [str(t) for t in TOKEN] TOKEN_2_GRAM = [''.join(TOKEN[i:i+2]) for i in range(len(TOKEN[:-2]))] TOKEN_2_GRAM[10:] words_count_2 = Counter(TOKEN_2_GRAM) def prob_1(word): return words_count[word] / len(TOKEN) def prob_2(word1, word2): if word1 + word2 in words_count_2: return words_count_2[word1+word2] / words_count[word2] else: # out of vocabulary problem return 1 / len(words_count) prob_2('我们', '在') prob_2('在', '吃饭') prob_2('用', '手机')

    P r ( s e n t e n c e ) = P r ( w 1 w 2 w 3 w 4 ) = ∏ i n # W i W i + 1 # W i + 1 ∗ P r ( w n ) Pr(sentence) = Pr(w_1w_2w_3w_4) = \prod_i^{n} \frac{\# W_iW_{i+1}}{\# W_{i+1}} * Pr(w_n) Pr(sentence)=Pr(w1w2w3w4)=in#Wi+1#WiWi+1Pr(wn)

    def get_probablity(sentence): words = list(cut(sentence)) sentence_pro = 1 for i, word in enumerate(words[:-1]): next_ = words[i+1] probability = prob_2(word, next_) sentence_pro *= probability sentence_pro *= prob_1(words[-1]) return sentence_pro get_probablity('小明今天抽奖抽到一台苹果手机') get_probablity('小明今天抽奖抽到一架波音飞机') get_probablity('洋葱奶昔来一杯') get_probablity('养乐多绿来一杯') host need_compared = [ "今天晚上请你吃大餐,我们一起吃日料 明天晚上请你吃大餐,我们一起吃苹果", "真事一只好看的小猫 真是一只好看的小猫", "今晚我去吃火锅 今晚火锅去吃我", "洋葱奶昔来一杯 养乐多绿来一杯" ] for s in need_compared: s1, s2 = s.split() p1, p2 = get_probablity(s1), get_probablity(s2) better = s1 if p1 > p2 else s2 print('{} is more possible'.format(better)) print('-'*4 + ' {} with probility {}'.format(s1, p1)) print('-'*4 + ' {} with probility {}'.format(s2, p2))

    1-Gram模型的原理和代码实现

    因为机器性能直播的时间问题,咱们只做了5万的单词

    如果你把数据放多一些,你会发现,这4个cases都能做对

    这个在作业中做

    More data, Better Result

    Data Driven

    Processed: 0.018, SQL: 9