International Journal of Data Science

OptionMC: A Python Package for Monte Carlo Pricing of European Options

2025-04-26T22:30:16+00:00

This article presents OptionMC, a Python package designed for educational purposes that implements Monte Carlo methods for European option pricing. We describe the package's architecture and demonstrate its application through systematic testing against established Black-Scholes analytical solutions. The implementation supports both standard Monte Carlo estimation and variance reduction via antithetic variates, allowing examination of convergence patterns and computational efficiency. Our results suggest that Monte Carlo estimates converge toward analytical solutions as the number of iterations increases, with convergence behavior generally consistent with theoretical expectations. Analysis of parameter sensitivity indicates the package appropriately captures fundamental pricing relationships, including volatility effects, time decay, and moneyness considerations. The distributional characteristics of simulated stock prices and option payoffs align reasonably well with theoretical predictions. While OptionMC primarily serves pedagogical objectives rather than high-performance applications, it offers a transparent framework that may benefit students and researchers seeking to understand the practical implementation of option pricing algorithms through Monte Carlo techniques.

Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Engineering

2025-12-05T11:46:03+00:00

Sarcasm is common on social media, yet difficult for machines to interpret. Its meaning often relies on conversational tone, speaker intent or situational contrast—signals not directly visible in plain text. This study investigates how far one can go in sarcasm detection using only classical machine learning techniques and hand-crafted feature engineering, without relying on neural architecture or contextual information. Using a 100,000-comment stratified subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF–IDF representations with simple stylistic features such as length, punctuation use, and uppercase ratios. Four classical classifiers are evaluated: logistic regression, linear support vector machines, multinomial Naive Bayes, and random forests. Despite the context-free design, logistic regression and Naive Bayes reach F1-scores of approximately 0.57 on sarcastic comments, demonstrating that classical approaches capture part of the underlying signal. The full code is included for reproducibility.

Sentiment Analysis of Instagram User Comments related to the Inauguration of Mr. Prabowo Subianto as President of the Republic of Indonesia Using Natural Language Processing

2026-01-07T12:20:50+00:00

This study employs Natural Language Processing (NLP) techniques to analyze sentiments expressed in Instagram comments about Prabowo Subianto's inauguration as Indonesia's president. The dataset comprises a rich collection of user-generated comments, meticulously preprocessed with the Sastrawi stemmer tailored for Indonesian. This preprocessing stage includes rigorous text cleaning, stemming, and stopword removal, ensuring that the analysis is based on the most relevant linguistic elements. To accurately classify the sentiment of these comments as positive or negative, a logistic regression model has been trained. The model leverages TF-IDF (Term Frequency-Inverse Document Frequency) for effective feature extraction, enhancing the precision of the analysis. With promising results, particularly in identifying uplifting remarks that celebrate the new president's ascendance, this study underscores the essential role of natural language processing in unraveling public sentiment surrounding pivotal political events. The findings of this research not only shed light on the intricate tapestry of public opinion but also pave the way for future sentiment analysis endeavors within the vibrant landscape of Indonesian social media. The model demonstrates robust accuracy, illustrating its effectiveness in interpreting the nuanced sentiments of digital discourse surrounding significant political milestones.

Hybrid Deep Learning for Spatiotemporal Traffic Forecasting: Integrating LSTM, Transformer, and Graph Convolutional Networks on the METR-LA Dataset

2026-01-07T23:30:44+00:00

Accurate traffic prediction in large cities such as Los Angeles is increasingly necessary as cities expand and more vehicles are added to the roads. Using the METR-LA dataset. Using the METR-LA dataset, this study proposes a hybrid deep learning architecture that combines time and space modeling techniques to improve the accuracy and scalability of traffic flow predictions. The dataset consists of multivariate time series data from 207 loop detectors that record traffic speeds every five minutes with very high resolution. This study evaluates five potential model configurations: Long Short-Term Memory (LSTM), Transformer-based TSFormer, a combination of LSTM and TSFormer, Spatio-Temporal Graph Convolutional Network (STGCN), and a model combining STGCN and TSFormer. The evaluation conducted using three performance metrics Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) were used to assess how well each model captures complex temporal and spatial relationships. Our results show that the LSTM+TSFormer hybrid model consistently outperforms all other models across all criteria. This model has the lowest MAE (0.0624) and RMSE (0.1204), meaning it is better at learning patterns that occur over time and patterns that occur rapidly. STGCN-based models are quite good at capturing spatial dependencies, but their performance improves when combined with attention-based TSFormer modules. The hybrid models introduced in this study overcome major limitations, including the narrow receptive range of recurrent networks and the inflexible spatial structures assumed in graph-based methods. This work offers important perspectives for developing forecasting models that are not only accurate and scalable but also transparent and adaptable. Future work may explore dynamic graph construction and multimodal input integration to further enhance adaptability in real-world applications.

Machine Learning-Based Non-Communicable Disease Prediction Evaluating the Impact of Hypertension, Diabetes, and Lifestyle Factors on Stroke Risk

2025-04-11T13:53:55+00:00

Chronic diseases such as diabetes, stroke, and heart disease are major challenges in the global health system. Data-driven risk prediction for this disease is important for supporting more precise and effective medical decisions. This study aims to evaluate the main factors contributing to the incidence of diabetes, stroke, and heart disease using logistic regression analysis. The data used are from health sources and includes demographic variables, lifestyle factors, and health indicators. Logistic regression was used to identify variables significantly associated with each health condition studied. The model was evaluated using p-value, regression coefficient, and confidence interval to assess the significance of risk factors. The results of the analysis showed that age, high blood pressure, cholesterol levels, and body mass index (BMI) contributed significantly to the risk of diabetes, stroke, and heart disease. Physical activity and alcohol consumption negatively affected the risk, while smoking factors did not show strong significance in the model. These findings confirm that certain lifestyle factors and health conditions significantly affect the risk of chronic disease. The implications of this research can inform data-driven prevention and early intervention strategies in the health sector.