Dengesiz bir diyabet veri setinde makine öğrenmesi yöntemlerini kullanarak diyabet hastalığının teşhisi

Bölükbaşı, İsmail Buğra

Please use this identifier to cite or link to this item: http://hdl.handle.net/11452/31034

Title:	Dengesiz bir diyabet veri setinde makine öğrenmesi yöntemlerini kullanarak diyabet hastalığının teşhisi
Other Titles:	Diagnosis of diabetes disease using machine learning methods in an imbalanced diabetes dataset
Authors:	Yağmahan, Betül Bölükbaşı, İsmail Buğra Bursa Uludağ Üniversitesi/Fen Bilimleri Enstitüsü/Endüstri Mühendisliği Anabilim Dalı. 0000-0002-9405-0900
Keywords:	Diyabet teşhisi Tip-2 diyabet Makine öğrenmesi Sınıflandırma Dengesiz veri seti Yeniden örnekleme yöntemleri Diabetes diagnosis Type-2 diabetes Machine learning Classification Imbalanced dataset Resampling methods
Issue Date:	2023
Publisher:	Bursa Uludağ Üniversitesi
Citation:	Bölükbaşı, İ. B. (2023). Dengesiz bir diyabet veri setinde makine öğrenmesi yöntemlerini kullanarak diyabet hastalığının teşhisi. Yayınlanmamış yüksek lisans tezi. Bursa Uludağ Üniversitesi Fen Bilimleri Enstitüsü.
Abstract:	Dünya Sağlık Örgütü (DSÖ) verilerine göre diyabet hastalığına sahip kişi sayısı son zamanlarda ciddi bir artış göstermektedir. Diyabet hastalığı eğer gerekli tedbirler alınmazsa ilerleyen zamanlarda vücutta kalıcı hasarlara yol açan, hatta kişinin ölümüne neden olabilecek çok önemli bir hastalıktır. Tüm bu sebeplerden dolayı diyabet hastalığının erken ve doğru şekilde tespiti için tıp dünyasındaki çalışmaların hızla arttığı görülmektedir. Bu çalışmada tip-2 diyabet hastalığının teşhisi için gerçek hayattaki bir veri setinin analizinde, makine öğrenimi yöntemlerinden biri olan sınıflandırma yöntemi kullanılmıştır. Çalışmanın amacı, iki farklı veri bölme tekniği, üç farklı yeniden örnekleme tekniği ve altı farklı sınıflandırma yöntemi kullanarak diyabet teşhisinin en doğru şekilde sınıflandırılmasıdır. Bu çalışmada sınıflandırma modelleri KNIME programında oluşturulmuştur. Veri seti eğitim ve test verisi olarak ayrıştırılırken yüzdesel bölme (%70-30) ve k-katlı (k=5) çapraz doğrulama teknikleri kullanılmıştır. Diyabet veri setindeki sınıf dengesizliğinin giderilmesi için rastgele örneklem azaltma (RUS), rastgele aşırı örnekleme (ROS) ve sentetik azınlık aşırı örnekleme (SMOTE) tekniklerinden yararlanılmıştır. Çalışmada kullanılan sınıflandırma yöntemleri lojistik regresyon (LR), naive bayes (NB), k-en yakın komşu (k-EYK), C4.5 algoritması, rastgele orman (RO) ve çok katmanlı algılayıcıdır (ÇKA). Veri bölme tekniği, yeniden örnekleme tekniği ve sınıflandırma yöntemleri ile yapılan kombinasyonlar sonucunda 48 farklı senaryo incelenmiştir. Tüm senaryolar doğruluk, kesinlik, duyarlılık, ortalama F-ölçütü, kappa istatistiği ve AUC değeri ölçütlerine göre karşılaştırılmıştır. Yapılan deneysel çalışmalar sonucunda yüzdesel bölme ile oluşturulan senaryolar arasında en iyi sonucu %99,26 doğruluk değeriyle RUS-RO, en kötü sonucu ise %80,74 doğruluk değeriyle SMOTE-k-EYK vermiştir. K-katlı çapraz doğrulama ile oluşturulan senaryolar arasında en iyi sonucu %97,55 doğruluk değeri ile RUS-C4.5, ROS-RO ve SMOTE-RO, en kötü sonucu ise %78,62 doğruluk değeriyle RUS-EYK vermiştir. According to the data of the World Health Organization (WHO), the number of people with diabetes has increased significantly in recent years. Diabetes is a very important disease that can lead to permanent damage to the body and even death of the person in the future if the necessary precautions are not taken. For all these reasons, it is seen that studies in the medical world are increasing rapidly for the early and accurate diagnosis of diabetes. In this study, the classification method, one of the machine learning methods, was used in analyzing a real-life dataset for the purpose of diagnosing type-2 diabetes. The aim of the study is the most accurate classification of the diagnosis of diabetes using two different data-splitting techniques, three different resampling techniques, and six different classification methods. In this study, classification models were created in the software KNIME. Percentage split (70-30%) and k-fold (k=5) cross-validation techniques were used when separating the data set as training and test data. Random undersampling (RUS), random oversampling (ROS), and synthetic minority oversampling (SMOTE) techniques were used to eliminate the class imbalance in the diabetes dataset. The classification methods used in the study are logistic regression (LR), naive bayes (NB), k-nearest neighbor (k-NN), C4.5 algorithm, random forest (RF), and multilayer perceptron (MLP). As a result of combinations with data-splitting techniques, resampling techniques, and classification methods, 48 different scenarios were examined. All scenarios were compared according to criteria of accuracy, precision, recall, average F measure, kappa statistic, and AUC value. As a result of the experimental studies, among the scenarios created with percentage split, RUS-RF gave the best result with an accuracy value of 99.26%, and SMOTE-k-NN gave the worst result with an accuracy value of 80.74%. Among the scenarios created with k-fold cross-validation, RUS-C4.5, ROS-RF, and SMOTE-RF gave the best result with an accuracy value of 97.55%, and RUS-k-NN gave the worst result with an accuracy value of 78.62%.
URI:	http://hdl.handle.net/11452/31034
Appears in Collections:	Fen Bilimleri Yüksek Lisans Tezleri / Master Degree

Files in This Item:

File	Description	Size	Format
İsmail_Buğra_Bölükbaşı.pdf		4.15 MB	Adobe PDF	View/Open

Show full item record

This item is licensed under a Creative Commons License

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets