PDF (Russian)

Keywords

machine learning, bioinformatics, mutation pathogenicity, Apache Spark, Random Forest, genetic variant classification, ClinVar, personalized medicine.

How to Cite

Machine learning system for mutation pathogenicity prediction. (2025). SMART TECHNOLOGIES JOURNAL, 1(8). https://doi.org/10.62687/STJ.8.1.2025.6

Abstract

The article presents the design of a distributed machine learning system for automatic classification of genetic variant pathogenicity based on ClinVar clinical data. The relevance is determined by the need to accelerate the interpretation of next-generation sequencing results in clinical practice, where manual analysis of hundreds of thousands of variants takes weeks of geneticists' work.

Architectural solutions for processing large volumes of genetic data using Apache Spark MLlib technology and ensemble learning methods are investigated. Methods of system analysis of biomedical databases, feature engineering for categorical genetic features, cross-validation, and comparative analysis of classification algorithms were applied.

A three-stage methodology was developed: data preparation with normalization and categorization of clinical significance, feature engineering using StringIndexer and OneHotEncoder, training three models (Logistic Regression, Random Forest, Gradient Boosted Trees) with hyperparameter optimization through Grid Search. A recommendation system with five-level variant prioritization (CRITICAL/HIGH/MEDIUM/LOW/MINIMAL) based on pathogenicity probabilities was designed.

Results include a scalable architecture for processing 1млн+ records and an automated clinical recommendation generation module.

PDF (Russian)