使用Python和NumPy构建自己的AI语言模型！

如今，人工智能无处不在，语言模型是其中的重要组成部分。我们一直想知道人工智能如何预测句子中的下一个单词，甚至是整个段落。在本教程中，我们将构建一个超级简单的语言模型，而不依赖于TensorFlow或PyTorch等花哨的框架-只是普通的Python和NumPy。

我们将创建一个二元模型。它根据当前单词预测句子中的下一个单词。我们将保持它简单明了，易于遵循，这样你就可以了解事情是如何工作的，而不会被太多的细节所淹没。

第1步：安装库

在我们开始之前，让我们确保你已经准备好了Python和NumPy。如果你还没有安装NumPy，请使用以下命令快速安装：

pip install numpy

第2步：了解基本知识

语言模型预测句子中的下一个单词。我们将保持简单，并建立一个二元模型。这只是意味着我们的模型将仅使用当前单词来预测下一个单词。

我们将从一个简短的文本开始训练模型。这里有一个我们将使用的小示例：

import numpy as np

# Sample dataset: A small text corpus

corpus = """Artificial Intelligence is the new electricity.

Machine learning is the future of AI.

AI is transforming industries and shaping the future."""

第3步：准备文本

首先，我们需要将此文本分解为单个单词并创建一个词汇表（基本上是所有唯一单词的列表）。这给了我们一些工作。

# Tokenize the corpus into words

words = corpus.lower().split()

# Create a vocabulary of unique words

vocab = list(set(words))

vocab_size = len(vocab)

print(f"Vocabulary: {vocab}")

print(f"Vocabulary size: {vocab_size}")

输出

Vocabulary: ['electricity.', 'artificial', 'future', 'industries', 'intelligence', 'ai.', 'future.', 'new', 'the', 'machine', 'is', 'learning', 'of', 'transforming', 'ai', 'shaping', 'and']

Vocabulary size: 17

在这里，我们将文本转换为文本并将其拆分为单词。之后，我们创建一个独特的单词列表作为我们的词汇表。

第4步：将单词映射到数字

计算机是用数字工作的，而不是文字。因此，我们将每个单词映射到一个索引，并创建一个反向映射（这将有助于我们稍后将它们转换回单词）。

word_to_idx = {word: idx for idx, word in enumerate(vocab)}

idx_to_word = {idx: word for word, idx in word_to_idx.items()}

# Convert the words in the corpus to indices

corpus_indices = [word_to_idx[word] for word in words]

基本上，我们只是将单词转换为我们的模型可以理解的数字。每个单词都有自己的数字，比如“AI”可能会变成0，“learning”可能会变成1，这取决于顺序。

第5步：建立模型

现在，让我们进入它的核心：构建二元模型。我们想计算出一个单词跟在另一个单词后面的概率。要做到这一点，我们将计算每个单词对（bigram）在数据集中出现的频率。

# Initialize bigram counts matrix

bigram_counts = np.zeros((vocab_size, vocab_size))

# Count occurrences of each bigram in the corpus

for i in range(len(corpus_indices) - 1):

current_word = corpus_indices[i]

next_word = corpus_indices[i + 1]

bigram_counts[current_word, next_word] += 1

# Apply Laplace smoothing by adding 1 to all bigram counts

bigram_counts += 0.01

# Normalize the counts to get probabilities

bigram_probabilities = bigram_counts / bigram_counts.sum(axis=1, keepdims=True)

print("Bigram probabilities matrix: ", bigram_probabilities)

输出

Bigram probabilities matrix: [[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.11111111 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]

[0.05555556 0.05555556 0.05555556 0.05555556 0.11111111 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]

[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.11111111 0.05555556 0.05555556 0.05555556 0.05555556]

[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.11111111]

[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.11111111 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]

[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.11111111 0.05555556 0.05555556]

[0.05882353 0.05882353 0.05882353 0.05882353 0.05882353 0.05882353

0.05882353 0.05882353 0.05882353 0.05882353 0.05882353 0.05882353

0.05882353 0.05882353 0.05882353 0.05882353 0.05882353]

[0.11111111 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]

[0.05 0.05 0.1 0.05 0.05 0.05

0.1 0.1 0.05 0.05 0.05 0.05

0.05 0.05 0.05 0.05 0.05 ]

[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.11111111

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]

[0.05 0.05 0.05 0.05 0.05 0.05

0.05 0.05 0.15 0.05 0.05 0.05

0.05 0.1 0.05 0.05 0.05 ]

[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.11111111 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]

[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.11111111

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]

[0.05555556 0.05555556 0.05555556 0.11111111 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]

[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.11111111 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]

[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.11111111 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]

[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556

0.05555556 0.05555556 0.05555556 0.11111111 0.05555556]]

现在的情况是这样的：

我们计算每个单词跟随另一个单词的频率（这就是二元语法）。然后，我们通过将这些计数归一化来将它们转换为概率。简单地说，这意味着如果“AI”后面经常跟“is”，那么这一对的概率会更高。

注意事项：当我们使用bigram_count += 0.01时，我们应用了拉普拉斯平滑，并进行了小的调整，以避免当某些单词对没有出现在语料库中时出现零概率。这确保了每个词对都有一个稍微正的概率，即使它很罕见，并有助于防止在规范化过程中出现除法错误等问题。通过使用较小的值（如0.01），我们在避免零和不过度膨胀未见过单词对的概率之间取得了平衡。

第6步：预测下一个单词

现在让我们通过让我们的模型根据任何给定的单词预测下一个单词来测试我们的模型。我们通过从下一个单词的概率分布中采样来做到这一点。

def predict_next_word(current_word, bigram_probabilities):

word_idx = word_to_idx[current_word]

next_word_probs = bigram_probabilities[word_idx]

next_word_idx = np.random.choice(range(vocab_size), p=next_word_probs)

return idx_to_word[next_word_idx]

# Test the model with a word

current_word = "ai"

next_word = predict_next_word(current_word, bigram_probabilities)

print(f"Given '{current_word}', the model predicts '{next_word}'.")

输出

Given 'ai', the model predicts 'ai.'.

这个函数接受一个单词，查找它的概率，并根据这些概率随机选择下一个单词。如果你传入“AI”，模型可能会预测下一个单词是“is”。

第7步：生成句子

最后，让我们生成一个完整的句子！我们先从一个单词开始，然后预测下一个单词几次。

def generate_sentence(start_word, bigram_probabilities, length=5):

sentence = [start_word]

current_word = start_word

for _ in range(length):

next_word = predict_next_word(current_word, bigram_probabilities)

sentence.append(next_word)

current_word = next_word

return ' '.join(sentence)

# Generate a sentence starting with "artificial"

generated_sentence = generate_sentence("artificial", bigram_probabilities, length=10)

print(f"Generated sentence: {generated_sentence}")

输出

Generated sentence: artificial ai. of electricity. ai. artificial future learning artificial of ai.

这个函数接受一个初始单词并预测下一个单词，然后用这个单词预测下一个单词，依此类推。在你知道之前，你已经得到了一个完整的句子！

总结

这就是一个简单的二元语言模型，只使用Python和NumPy从头开始构建。不需要花哨的库，现在你已经对AI如何预测文本有了基本的了解。您可以随意试验、调整代码，甚至使用更高级的模型进行扩展。

更多相关技术内容咨询欢迎前往并持续关注好学星城论坛了解详情。

想高效系统的学习Python编程语言，推荐大家关注一个微信公众号：Python编程学习圈。每天分享行业资讯、技术干货供大家阅读，关注即可免费领取整套Python入门到进阶的学习资料以及教程，感兴趣的小伙伴赶紧行动起来吧。

发表于 2024-10-22 09:57
阅读 ( 198 )
分类：Python开发

使用Python和NumPy构建自己的AI语言模型！

你可能感兴趣的文章

相关问题

0 条评论

作家榜 »