自然语言处理如何检查拼写错误？用TensorFlow(2)

2017-05-22 编辑：

　　我在floydhub.com上使用GPU来训练我的模型（我强烈推荐他们的服务），这节省了我几个小时的训练时间。尽管如此，为了正确调整这个模型，运行迭代仍然需要30-60分钟的时间，这就是为什么我要限制数据，从而不需要花费更长的时间来做这件事情。这当然会降低我们的模型的准确性，但由于这只是一个个人项目，所以，我不是很在乎。

　　max_length = 92

　　min_length = 10

　　good_sentences = []

　　for sentence in int_sentences:

　　if len(sentence) <= max_length and len(sentence) >= min_length:

　　good_sentences.append(sentence)

　　为了跟踪这个模型的性能，我将把数据拆分成一个训练集和一个测试集。测试集将由数据15％的组成。

　　training, testing = train_test_split(good_sentences,

　　test_size = 0.15,

　　random_state = 2)

　　就像我最近的一些项目一样，我将按照长度来给数据进行排序。这导致一批量的句子具有相似的长度，因此只需要使用较少的填充，并且模型将训练的速度将更快。

　　training_sorted = []

　　testing_sorted = []

　　for i in range(min_length, max_length+1):

　　for sentence in training:

　　if len(sentence) == i:

　　training_sorted.append(sentence)

　　for sentence in testing:

　　if len(sentence) == i:

　　testing_sorted.append(sentence)

　　也许这个项目最有趣/最重要的部分就是将句子转换为含有错误的句子的函数，这些函数将被用作输入数据。在这个函数中创建的错误的方式将以下面三种之一的一种进行：

　　两个字符的顺序将被交换（hlelo?hello）

　　将添加一个额外的字母（heljlo?hello）

　　其中一个字符没有被打印出来（helo?hello）

　　这三个错误发生的可能性是相等的，任一个错误发生的可能性为5％。因此，平均而言，每20个字符中就会有一个包含一个错误。

　　letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m',

　　'n','o','p','q','r','s','t','u','v','w','x','y','z',]

　　def noise_maker(sentence, threshold):

　　noisy_sentence = []

　　i = 0

　　while i < len(sentence):

　　random = np.random.uniform(0,1,1)

　　if random < threshold:

　　noisy_sentence.append(sentence[i])

　　else:

　　new_random = np.random.uniform(0,1,1)

　　if new_random > 0.67:

　　if i == (len(sentence) - 1):

　　continue

　　else:

　　noisy_sentence.append(sentence[i+1])

　　noisy_sentence.append(sentence[i])

　　i += 1

　　elif new_random < 0.33:

　　random_letter = np.random.choice(letters, 1)[0]

　　noisy_sentence.append(vocab_to_int[random_letter])

　　noisy_sentence.append(sentence[i])

　　else:

　　pass

　　i += 1

　　return noisy_sentence

　　在本文中，我想向你展示的最后一件事是如何创建批次。通常，在训练他们的模型之前，会先创建他们的输入数据，这意味着他们具有固定数量的训练数据。然而，当我们训练我们的模型时，通过将noise_maker应用于每个批次，我们将要创建新的输入数据。这意味着对于每个时期，目标（正确的）句子将通过noise_maker进行反馈，并应该接收一个新的输入句子。使用这种方法的话，我们略微夸张地说，将会有无数量的训练数据。

　　def get_batches(sentences, batch_size, threshold):

　　for batch_i in range(0, len(sentences)//batch_size):

　　start_i = batch_i * batch_size

　　sentences_batch = sentences[start_i:start_i + batch_size]

　　sentences_batch_noisy = []

　　for sentence in sentences_batch:

　　sentences_batch_noisy.append(

　　noise_maker(sentence, threshold))

　　sentences_batch_eos = []

　　for sentence in sentences_batch:

　　sentence.append(vocab_to_int['<EOS>'])

　　sentences_batch_eos.append(sentence)

　　pad_sentences_batch = np.array(

　　pad_sentence_batch(sentences_batch_eos))

　　pad_sentences_noisy_batch = np.array(

　　pad_sentence_batch(sentences_batch_noisy))

　　pad_sentences_lengths = []

　　for sentence in pad_sentences_batch:

　　pad_sentences_lengths.append(len(sentence))

　　pad_sentences_noisy_lengths = []

　　for sentence in pad_sentences_noisy_batch:

　　pad_sentences_noisy_lengths.append(len(sentence))

　　yield (pad_sentences_noisy_batch,

　　pad_sentences_batch,

　　pad_sentences_noisy_lengths,

　　pad_sentences_lengths)

　　这就是整个这个项目！虽然结果是令人鼓舞的，但这种模式仍然存在着一定的局限性。我真的会很感激，如果有人可以扩大这个模型或改进其设计！如果你可以这样做，请在评论中发表一下。新设计的想法将会应用到Facebook AI实验室最新的CNN模型中去（它可以获得最先进的翻译结果）。

　　感谢你的阅读，希望你可以从中学到新的知识！

　　来源：Towards Data Science

　　作者：Dave Currie

自然语言处理如何检查拼写错误？用TensorFlow(2)

相关阅读：

相关推荐：