1 训练模型Language Detection
### Importing the Libraries
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
import warnings
warnings.simplefilter("ignore")
# Loading the dataset
data = pd.read_csv("Language Detection.csv")
print(data.shape)
data.head(10)
# value count for each language
data["Language"].value_counts()
# separating the independent and dependant features
X = data["Text"]
y = data["Language"]
# converting categorical variables to numerical
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y[0:10]
### Text preprocessing
data_list = []
for text in X:
text = re.sub(r'[!@#$(),\n"%^*?\:;~`0-9]', ' ', text)
text = re.sub(r'[[]]', ' ', text)
text = text.lower()
data_list.append(text)
### Bag of Words
# creating bag of words using countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(data_list).toarray()
print(X.shape)
print(X[0:10])
### Train Test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
### Model creation and Prediction
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)
# prediction
y_pred = model.predict(x_test)
### Evaluating the model
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
ac = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
print("Accuracy is :",ac)
# classification report
print(cr)
# visualising the confusion matrix
plt.figure(figsize=(15,10))
sns.heatmap(cm, annot = True)
plt.show()
### Model Saving
# saving both cv and model
pickle.dump(cv, open("count_vectorizer.pkl", "wb"))
pickle.dump(model, open("MultinomialNB_model.pkl", "wb"))
2 Pycharm新建FastAPI项目
将上述count_vectorizer.pkl、MultinomialNB_model.pkl文件复制到项目文件中。
下载要用到的包
fastapi 、pydantic 、pickle、uvicorn、re
main.py
from fastapi import FastAPI
from pydantic import BaseModel
import pickle
import uvicorn
import re
import warnings
warnings.simplefilter("ignore")
# 创建FastAPI应用
app = FastAPI()
classes = [
"Arabic",
"Danish",
"Dutch",
"English",
"French",
"German",
"Greek",
"Hindi",
"Italian",
"Kannada",
"Malayalam",
"Portuguese",
"Russian",
"Spanish",
"Swedish",
"Tamil",
"Turkish"]
# 定义预测函数
def predict_pipeline(text):
# 加载模型
with open("MultinomialNB_model.pkl", "rb") as f:
model = pickle.load(f)
# 加载vectorizer转换器
with open("count_vectorizer.pkl", "rb") as f:
count_vectorizer = pickle.load(f)
text = re.sub(r'[!@#$(),\n"%^*?\:;~`0-9]', '', text)
text = re.sub(r'[[]]', '', text)
text = text.lower()
vect = count_vectorizer.transform([text]).toarray()
pred = model.predict(vect)
return classes[pred[0]]
class TextIn(BaseModel):
text: str
class PredictOut(BaseModel):
language: str
@app.get("/")
def home():
return {"health_check": "OK", "model_version": "1.0.0"}
@app.post("/predict", response_model=PredictOut)
def predict(payload: TextIn):
print("payload: ", payload.text)
language = predict_pipeline(payload.text)
output = {"language": language}
return output
if __name__ == '__main__':
uvicorn.run(app="main:app", host="127.0.0.1", port=5000, reload=True)
3测试
启动项目,并用postman测试
测试地址
http://127.0.0.1:5000/predict
测试数据
#测试数据1
{
"text":"hello,my name is Ken."
}
#测试数据2
{
"text":"Солнце восходит на востоке."
}
#测试数据3
{
"text":"Το ήλιο ανατέλλει από την ανατολή"
}
#AI生成的,可能不不准
English: The sun rises in the east.
Malayalam: കിഴക്കേനാണ് സൂര്യൻ ഉദിച്ചുകൊണ്ടിരിക്കുന്നത്.
Hindi: सूर्य पूर्व में उगता है।
Tamil: சೂರ್ଯ୍ୟன் கிழக்கில் உதிக்கிறது.
Kannada: ಸೂರ್ଯ୍ୟ ಪೂರ್ವದಲ್ಲಿ ಉದಯಿಸುತ್ತಾನೆ.
French: Le soleil se lève à l'est.
Spanish: El sol sale en el este.
Portuguese: O sol nasce no leste.
Italian: Il sole sorge a est.
Russian: Солнце восходит на востоке.
Swedish: Solen går upp i öster.
Dutch: De zon komt op in het oosten.
Arabic: الشمس تشرق من الشرق.
Turkish: Güneş doğudan doğar.
German: Die Sonne geht auf im Osten.
Danish: Solen går op i øst.
Greek: Το ήλιο ανατέλλει από την ανατολή.
参考:
【1】使用FastAPI、Docker和Heroku部署机器学习模型 - AssemblyAI
【2】Language-Detector
【3】数据集language-detection