前情提要:
使用逻辑回归对文本进行分类 To classify text content by logistic regression[达观杯1]
https://steemit.com/cn/@littlexiannv/to-classify-text-content-by-logistic-regression-1
之前提到在参加达观杯文本分类比赛 ,使用逻辑回归的模型,正确率最高达到了0.76,这次准备使用svm模型看一下能否提高正确率
上代码:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
print('start')
df_train = pd.read_csv('./train_set.csv')
df_test = pd.read_csv('./test_set.csv')
df_train.drop(columns = ['article','id'], inplace = True)
df_test.drop(columns=['article'],inplace= True)
#vctorizer = CountVectorizer(ngram_range=(1, 2), min_df=3, max_df=0.9, max_features=100000)
print('vectoerizer')
vctorizer = TfidfVectorizer(ngram_range=(1, 2), min_df=3, max_df=0.9, max_features=100000)
vctorizer.fit(df_train['word_seg'])
x_train = vctorizer.transform(df_train['word_seg'])
x_test = vctorizer.transform(df_test['word_seg'])
y_train = df_train['class'] - 1
print('trasn and predict')
svm = LinearSVC
svm.fit(x_train, y_train)
y_test = svm.predict(x_test)
print('output')
df_test['class'] = y_test.tolist()
df_test['class'] = df_test['class'] + 1
df_result = df_test.loc[:, ['id', 'class']]
df_result.to_csv('./result.csv', index=False)
print('end')
可以看到,和逻辑回归模型的代码对比,其实就改动了一小部分
svm = LinearSVC
svm.fit(x_train, y_train)
y_test = svm.predict(x_test)
为了图快,这里先用了linear svc , 大概跑了30分钟,效果怎么样呢!
居然还跌了零点几
下一步应该换别的svm模型再尝试一下。
这一篇就到这里啦,大家关于这个比赛,或者机器学习有问题的,也可以在留言里和我交流。
最后,喜欢这篇文章的话,可以给我点个赞支持哦,谢谢,Thanks♪(・ω・)ノ