用Python玩转自然语言处理：NLTK与Spacy实战，解锁文本挖掘新技能！

那时候还是个菜鸟程序员老板突然扔给我一堆用户评论数据要我分析情感倾向。懵了啊。几千条评论数据躺在那里我盯着屏幕发呆难道要一条条人工看吗？后来才知道这叫自然语言处理简称NLP 当时连这个词都没听过呢。

刚开始接触NLTK的时候真的是一头雾水。安装倒是简单 pip install nltk 一行搞定可是下载语料库的时候坑死我了。

import nltk

# 第一次使用需要下载必要的数据包

nltk.download('punkt')

nltk.download('stopwords')

nltk.download('vader_lexicon')那个下载速度啊慢得让人怀疑人生。

最基础的分词操作其实很简单 NLTK提供了现成的函数：

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

import string

text = "我在学习自然语言处理，感觉很有趣！"

# 分词

tokens = word_tokenize(text)

print(tokens)

# 去除停用词和标点

stop_words = set(stopwords.words('english'))

filtered_tokens = [word.lower() for word in tokens

if word.lower() not in stop_words and word not in string.punctuation]可是中文处理就麻烦了。

NLTK对中文支持不够友好这时候就需要配合jieba分词库或者直接换Spacy。

Spacy简直是救星啊！

第一次用Spacy处理文本那个速度和准确率让我眼前一亮。安装稍微复杂点但绝对值得：

# 安装spacy

# pip install spacy

# python -m spacy download en_core_web_sm

import spacy

# 加载英文模型

nlp = spacy.load("en_core_web_sm")

text = "Apple Inc. is looking at buying a startup in San Francisco for $1 billion."

doc = nlp(text)

# 实体识别

for ent in doc.ents：

print(f"{ent.text} - {ent.label_}")

# 词性标注

for token in doc：

print(f"{token.text} - {token.pos_} - {token.lemma_}")这代码运行起来直接就能识别出"Apple Inc."是组织 "San Francisco"是地点 "$1 billion"是金钱。

太神奇了！

我记得第一次看到这个结果的时候激动得差点跳起来。以前需要手工标注的工作现在几行代码就搞定了。

实际项目中踩过不少坑呢。

最大的坑就是数据预处理文本清洗这块儿。用户输入的数据千奇百怪表情符号特殊字符 HTML标签什么都有。

import re

import spacy

from spacy.lang.en.stop_words import STOP_WORDS

def preprocess_text(text)：

# 去除HTML标签

text = re.sub(r'<[^>]+>'， ''， text)

# 去除特殊字符保留字母数字空格

text = re.sub(r'[^a-zA-Z0-9\s]'， ''， text)

# 转小写

text = text.lower().strip()

return text

def extract_features(text， nlp)：

# 预处理

clean_text = preprocess_text(text)

doc = nlp(clean_text)

# 提取关键词去除停用词

keywords = [token.lemma_ for token in doc

if not token.is_stop and not token.is_punct and len(token.text) > 2]

return keywords那时候还是个菜鸟程序员老板突然扔给我一堆用户评论数据要我分析情感倾向。

懵了啊。

几千条评论数据躺在那里我盯着屏幕发呆难道要一条条人工看吗？后来才知道这叫自然语言处理简称NLP 当时连这个词都没听过呢。

刚开始接触NLTK的时候真的是一头雾水。安装倒是简单 pip install nltk 一行搞定可是下载语料库的时候坑死我了。

import nltk

# 第一次使用需要下载必要的数据包

nltk.download('punkt')

nltk.download('stopwords')

nltk.download('vader_lexicon')那个下载速度啊慢得让人怀疑人生。

最基础的分词操作其实很简单 NLTK提供了现成的函数：

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

import string

text = "我在学习自然语言处理，感觉很有趣！"

# 分词

tokens = word_tokenize(text)

print(tokens)

# 去除停用词和标点

stop_words = set(stopwords.words('english'))

filtered_tokens = [word.lower() for word in tokens

if word.lower() not in stop_words and word not in string.punctuation]可是中文处理就麻烦了。

NLTK对中文支持不够友好这时候就需要配合jieba分词库或者直接换Spacy。

Spacy简直是救星啊！

第一次用Spacy处理文本那个速度和准确率让我眼前一亮。安装稍微复杂点但绝对值得：

# 安装spacy

# pip install spacy

# python -m spacy download en_core_web_sm

import spacy

# 加载英文模型

nlp = spacy.load("en_core_web_sm")

text = "Apple Inc. is looking at buying a startup in San Francisco for $1 billion."

doc = nlp(text)

# 实体识别

for ent in doc.ents：

print(f"{ent.text} - {ent.label_}")

# 词性标注

for token in doc：

print(f"{token.text} - {token.pos_} - {token.lemma_}")这代码运行起来直接就能识别出"Apple Inc."是组织 "San Francisco"是地点 "$1 billion"是金钱。

太神奇了！

我记得第一次看到这个结果的时候激动得差点跳起来。以前需要手工标注的工作现在几行代码就搞定了。

实际项目中踩过不少坑呢。

最大的坑就是数据预处理文本清洗这块儿。用户输入的数据千奇百怪表情符号特殊字符 HTML标签什么都有。

import re

import spacy

from spacy.lang.en.stop_words import STOP_WORDS

def preprocess_text(text)：

# 去除HTML标签

text = re.sub(r'<[^>]+>'， ''， text)

# 去除特殊字符保留字母数字空格

text = re.sub(r'[^a-zA-Z0-9\s]'， ''， text)

# 转小写

text = text.lower().strip()

return text

def extract_features(text， nlp)：

# 预处理

clean_text = preprocess_text(text)

doc = nlp(clean_text)

# 提取关键词去除停用词

keywords = [token.lemma_ for token in doc

if not token.is_stop and not token.is_punct and len(token.text) > 2]

return keywords

# 使用示例

nlp = spacy.load("en_core_web_sm")

text = "I <b>really</b> love this product！！！

# 使用示例

nlp = spacy.load("en_core_web_sm")

text = "I <b>really</b> love this product！！！

features = extract_features(text， nlp)

print(features) # ['love'， 'product']这套流程我用了好几个项目基本够用了。

情感分析是最实用的功能啊。

NLTK的VADER情感分析器对社交媒体文本特别友好能处理表情符号和网络用语：

from nltk.sentiment import SentimentIntensityAnalyzer

# 初始化情感分析器

sia = SentimentIntensityAnalyzer()

texts = [

"This product is amazing！"，

"I hate this service."，

"It's okay， nothing special."，

"OMG！！！ Love it so much！ "

]

for text in texts：

scores = sia.polarity_scores(text)

print(f"Text： {text}")

print(f"Positive： {scores['pos']：.2f}， Negative： {scores['neg']：.2f}， Neutral： {scores['neu']：.2f}")

print(f"Compound： {scores['compound']：.2f}\n")compound分数是关键大于0.05算正面小于-0.05算负面。

但是中文情感分析就得另想办法了可以用SnowNLP或者训练自己的模型。

选择NLTK还是Spacy 我的建议是这样的。

学习阶段用NLTK 功能全面文档详细社区活跃。但是生产环境我更推荐Spacy 速度快内存占用小而且模型质量高。

最近我在一个电商项目中用Spacy处理了几十万条商品评论提取关键词和情感倾向效果相当不错呢。

# 我的生产环境配置

import spacy

from collections import Counter

# 加载模型只使用需要的组件

nlp = spacy.load("en_core_web_sm"， disable=["ner"， "parser"])

def analyze_reviews(reviews)：

results = []

for review in reviews：

doc = nlp(review)

# 提取形容词作为关键特征

adjectives = [token.lemma_ for token in doc if token.pos_ == "ADJ"]

results.append(adjectives)

# 统计最频繁的形容词

all_adjectives = [adj for sublist in results for adj in sublist]

return Counter(all_adjectives).most_common(10)这套方案跑了半年稳定得很。

说到底工具只是手段关键还是要理解业务需求选择合适的方法。NLP这个领域变化太快了保持学习才是王道啊！

更多相关技术内容咨询欢迎前往并持续关注好学星城论坛了解详情。

想高效系统的学习Python编程语言，推荐大家关注一个微信公众号：Python编程学习圈。每天分享行业资讯、技术干货供大家阅读，关注即可免费领取整套Python入门到进阶的学习资料以及教程，感兴趣的小伙伴赶紧行动起来吧。

发表于 2025-07-10 09:24
阅读 ( 97 )
分类：Python开发

用Python玩转自然语言处理：NLTK与Spacy实战，解锁文本挖掘新技能！

你可能感兴趣的文章

相关问题

0 条评论

作家榜 »