Using a hybrid model to detect earnings management for Polish public companies

. This paper analyses the role of non-financial variables in the detection of earnings management in Poland. Previous research on earnings management in Poland concentrated on the use of the Beneish and Roxas models. The sample comprises 63 non-financial Polish companies listed on the Warsaw Stock Exchange for the years 2010-2021. The author uses the hybrid model with elements of decision trees and logistic regression as a proxy for earnings management detection. The results indicate that using a hybrid model increases the accuracy more than standard methods such as decision trees and logistic regression do. Accordingly, inclusion of non-financial variables related to the shareholding structure and the audit increases model accuracy and has a significant impact on the construction of the hybrid model. The findings suggest that using only financial variables worsens model accuracy. The author makes a significant contribution to accounting literature by providing new empirical evidence on the importance of non-financial variables in earnings management detection and their impact on model construction.


INTRODUCTION
Based on research conducted by the Association of Certified Fraud Examiners (ACFE, 2020), the majority of fraud schemes involve asset misappropriation (86%), corruption (43%), and, least commonly, financial statement fraud (10%), although the last one is the most harmful and costliest category of occupational fraud. Financial statement fraud or earnings management is a serious challenge to market participants' confidence in financial information; it is estimated to cost firms a significant amount of money and is viewed as unacceptable, illegitimate, and illegal corporate conduct (Rezaee, 2005). In general, financial statement fraud techniques work by improperly recognising revenue and overstating assets or understating expenses and liabilities (Beasley et al., 2010).
In the Polish legal system, no legal act refers to the definition of financial statement fraud. In such cases, serious objections from auditors or processes initiated by various regulators resulting in the imposition of penalties may be the only clear evidence that the financial statements have been manipulated. The Polish Financial Supervision Authority (UKNF Board) is one of the bodies ensuring proper functioning, stability, security, transparency, and confidence in the financial market and that the interests of market participants are protected. The UKNF Board also imposes financial or legal sanctions in connection with noncompliance with the International Financial Reporting Standards (IFRS) guidelines. In previous research on earnings management in Poland, the authors used the Beneish (1999) or Roxas (2011) models in empirical analyses (Golec, 2019;Comporek, 2020;Hołda, 2020). However previous studies also had some limitations: some excluded the control group selection from the analysis (Comporek, 2020), others included only eight companies in the analysis (Hołda, 2020), while yet others classified companies that received an adverse or disclaimer opinion by the auditors as manipulators (Golec, 2019).
Many authors apply traditional methods to detect earnings management, such as logistic regression. In recent years, many researchers have attempted to use data mining because of its superiority in terms of forecasting after inputting large amounts of data for machine learning. Data mining is an analytical tool used to handle complicated data analysis and can solve the main shortcomings of the traditional statistical analysis methods by the overcoming limitations of data sets and avoiding the high classification error rate (Yao et al., 2019). In my study, I use hybridization of the decision trees model with logistic regression. Using a hybrid model approach provides a higher predictive accuracy than traditional methods (Steinberg & Cardell, 1998;Brezigar-Masten & Masten, 2012;Łapczyński, 2014). In the first step, the decision tree with a 10-fold cross-validation approach was based on the independent variables, and each leaf included the interaction between the ratios. In the second step, logistic regression, a set of statistically significant independent variables from stepwise regression with backward selection and 10-fold cross-validation was complemented by an artificial variable in the category notified for classification from the root node. I use stepwise regression with backward selection because adding too many variables to the logistic regression may cause overfitting of sample data, model instability, or difficulties in applying the model to an external data set. Several studies focused on the identification of significant indicators in fraud detection, while the number of statistically significant variables in the models ranges from 4 to 35. In recent years, several empirical studies have revealed a significant relationship between non-financial indicators and financial statement fraud (Beasley, 1996;Skounsen et al., 2008;Yuan et al., 2008;Brazel et al., 2009;Johnson et al., 2009;Amara et al., 2013;Jan, 2018;Nindito, 2018;Yao et al., 2019;Subair et al., 2020). For instance, Brazel et al. (2009) found that substantial differences between financial statement data and non-financial indicators should serve as a red flag to auditors and a tipping point for assigning forensic experts to the engagement. Skounsen et al. (2008) also discovered that non-financial variables improve the prediction of financial statement fraud models.
My analysis is based on a sample of 63 public companies listed on the Warsaw Stock Exchange (WSE) that were involved, according to the UKNF Board, in alleged instances of earnings management over the period 2010-2021. Each fraudulent company was matched with a control firm based on firm size, financial year, and industry. The classifiers used in the study were logistic regression and a decision tree. Also, I selected R-squared as a measure of the goodness of fit model and accuracy as metrics to evaluate the classification performance of each classifier.
This study contributes to the literature on the detection of financial statement fraud in several ways. First, evidence suggests that a hybrid model improves model accuracy and goodness of fit more than standard models. Therefore, I can determine that combining the elements of the models will give better results than using standard fraud detection methods. Secondly, the results show that to detect financial statement fraud it is necessary to include non-financial indicators. The inclusion of variables representing the company's shares being held by the Management and Supervisory Board, the shareholding ratio of the largest shareholder, and the use of unqualified audit opinion increase the likelihood of identifying earnings management. Thirdly, when authors build predictive models to detect financial fraud, they should include non-financial variables in the first step, not as additional model parameters. I thus contribute to the literature by providing evidence that more directly explains the impact of non-financial variables under a constructed financial fraud model. Finally, I am the first, to the best of my knowledge, to use methods other than the Beneish or Roxas model to detect financial statement fraud in Polish public companies listed on the WSE.
The rest of this paper is organized as follows. Section 2 presents the literature review. Section 3 describes the methodology. Section 4 presents the empirical results, finally Section 5 concludes.

LITERATURE REVIEW
The research on earnings management prediction contributes to understanding factors that can be used to predict fraud. In prior research studies, authors most often used nonlinear regression such as logit and probit models to detect earnings management (Dechow et al., 1996;Beasley, 1996;Beneish, 1999;Spathis et al., 2002;Skounsen et al., 2008;Johnson et al., 2009;Dechow et al., 2011;Amara et al., 2013;Kanapickiene & Grundiene, 2015;Ozcan, 2016;Ozdagoglu et al., 2017;Pazarskis et al., 2017;Nindito, 2018;Mohammadi et al., 2020). In the logit and probit regressions, the coefficients of the explanatory variables do not influence the different values of the indices in the fraudulent and control companies. Logistic regression is most often used if the researchers' goal is only to identify the variables that are important in detecting fraudulent financial statements. Research that uses nonlinear regression more often uses ratio analysis than nonfinancial variables as the method of determining financial statement fraud.
Table 1 presents studies that use non-linear regression to detect financial fraud. Most of the research concerns the financial markets of the United States and European countries and uses a small number of independent variables in the regression. Moreover, in most analyzed research, the model accuracy reaches between 85 and 93%. In contrast to nonlinear regressions, the data mining methods enable the analysis of large and complex data sets. Data mining methods allow for analyses of a larger number of independent variables, which increases the probability of detecting fraudulent financial statements. Moreover, data mining methods are frequently implemented for financial forecasting to identify market trends.
Decision trees, next to logistic regression, are one of the most popular methods of detecting financial statement fraud. Decision trees are a type of data mining tool and can handle continuous data or nonparametric data for classification. The choice of dividing conditions is based on the quantity and attributes of the data as well as the Gini index (Chen et al., 2014). In decision trees, each node represents a test of an attribute and each branch represents an outcome of the test. In this way, the tree attempts to divide observations into mutually exclusive subgroups. The goodness of a split is based on the selection of the attribute that best separates the sample (Kirkos et al., 2007). The attributes are chosen in terms of the goodness of a split and the sample is divided into subsets until all the training data are correctly classified. The biggest advantage of decision trees is the interpretability of the rules generated from the model (Hajek & Henriques, 2017). Decision trees provide a hierarchical decision model and are easy to interpret. However, the decision tree model generated may be complex, which may be due to overfitting and memorizing the training data, which reduces the generalizability of the resulting model (Ata & Seyrek, 2009).
Table 2 describes studies that use decision trees to detect earnings management. Most of the research concerns the financial markets of the United States and Asian countries and uses fewer independent nonfinancial variables and more financial variables in the analysis than logistic regression analysis. Moreover, in most analyzed research, the model accuracy reaches between 80 and 95%. On the other hand, the number of observations included in the analysis is similar to that of logistic regression research.

METHODOLOGY
In the Polish legal system, there are no legal acts that refer to the definition of financial statement fraud. In such a case, the only clear evidence that financial statements have been manipulated may be serious reservations of auditors or proceedings initiated by various regulators resulting in the imposition of penalties. The UKNF Board is one of the bodies ensuring proper functioning, stability, security, transparency, and confidence in the financial market and that the interests of market participants are protected. The UKNF Board also imposes financial or legal sanctions in connection with non-compliance with the IFRS guidelines.
I identified instances of alleged earnings management by companies listed on the WSE that received a monetary fine from the UKNF Board in the context of compliance with International Accounting Standards (IAS) or IFRS principles. My sample includes 63 public companies involved in alleged instances of earnings management during the period 2010-2021. The 63 fraudulent companies are matched with 63 control firms. I use a matched pair of samples whereby each company is matched with a corresponding control firm based on: • Firm size, where a nonfraudulent firm was considered similar if total assets were within +/-30% of total assets for the fraudulent firm in the fraud year, • Financial year, where annual reports for non-fraudulent firms were available for the same time period as the fraudulent firm, • Industry, where firms were reviewed to identify a non-fraudulent firm within the same three-digit Standard Industrial Classification (SIC) as the fraudulent firm. If no match was found, two-digit codes were used.
The choice of variables used as candidates for participation in the input vector was based on previous research related to earnings management topics. I collected data from the annual reports of the companies. To effectively detect financial statement fraud, researchers use not only financial variables but also nonfinancial variables that are known to have some predictive ability in detecting financial statement fraud (Johnson et al., 2008;Skousen et al., 2008;Pai et al., 2011;Amara et al., 2013;Chen et al., 2014;Chen, 2016;Jan, 2018;Nindito, 2018;Yao et al., 2019). So, I decided to divide the non-financial variables into three groups related to the company's board of directors, the shareholding structure, and the audit. Table 3 includes definitions of variables used in this paper.

Descriptive statistics
The data mining tool used in this paper is R. In Table 4, I report the descriptive statistics of the continuous variables for fraudulent and control firms. All financial variables are winsorised at 5%. Fraudulent companies are less successful than non-fraudulent ones in terms of profitability indicators, high leverage, low liquidity, and asset rotation ratios. Source: Authors' results.
In Figure 1, I present a correlation matrix for the continuous variables and show only significant correlations at the 1% level. Overall, most correlation coefficients are either insignificant or have a low significance. However, RT is negatively and significantly associated with margin ratios. On the other hand, CR and QR are highly and positively correlated and have the same relationship as OM, GM, and NM.

Figure 1. Correlation matrix with statistically significant levels
Blue circles indicate a positive correlation coefficient and red circles indicate a negative correlation that is significantly different from zero at the 1% level. Source: Authors' results

Regression results for financial variables
I chose to follow a 10-fold cross-validation approach. Each subset is tested sequentially by adopting the classifier trained on the remaining nine subsets. Cross-validation accuracy is the percentage of data that is correctly classified. I define a Type I error as classifying a fraudulent firm as a non-fraudulent one and a Type II error as classifying a non-fraudulent firm as a fraudulent one. Type I errors may result in unacceptable audits that damage reputation and lead to huge economic losses. Type II errors may lead to additional investigation. I initiate the cost of a type I error as 2 and the cost of a type II error as 1. Figure 2 shows the statistically significant financial variables with the critical values used in the construction of the decision tree of these rules. The decision tree analysis selected five variables: OM, DR, IT, RT, and WC_A. It can be observed that the feature of OM is the first split point. This means that the relationship between operating profit and sales is critical in predicting financial statement fraud. Twentynine percent of the total sample (37 observations) had a lower OM value than the critical value equal to -0.059 and were classified as manipulators, with 100% (37 observations) being fraudulent firms. The model correctly classified 92.1% of the total sample, 84.1% of the fraudulent firms, and 100.0% of the nonfraudulent firms.  Table 5 reports the results from logistic regressions with only financial variables. Columns (1) and (3) show the results of variables from the stepwise regression, columns (2) and (4) include variables from the decision tree, column (3) adds Leaf as a binary variable equal to 1 if OM is smaller than -0.059 and 0 otherwise, which is from the root node, and in column (4) I analyzed only significant variables from decision tree and changed OM with Leaf. The results show that all the variables from the stepwise regression in column (1) are statistically significant. On the contrary, for the decision tree model only the coefficients for OM, WC_A, and RT in column (2) are statistically significant at the 1 or 5% level. However, logistic regression constructed from decision tree indicators has lower accuracy and goodness of fit measured by R-squared than stepwise regression ratios. In addition, adding Leaf to the logistic regression (column (3)) improves the model's Rsquared and does not change the accuracy of the model, however, the variable was not statistically significant. Also, removing statistically insignificant variables from decision trees and replacing OM with Leaf (column (4)) affects the model accuracy by increasing the specificity and sensitivity and increasing the R-squared of the model, however, the variable also was not statistically significant. As for the controls, RT is positively and significantly related to Fraud and the other controls are insignificant; however, this is likely attributable to the use of variables from the decision tree. Table 6 reports the results from logistic regressions including non-financial variables. Columns (4-6) show the results of significant variables from stepwise regression with the artificial variable Leaf, and columns (7-9) include statistically significant variables from the decision tree replacing OM with Leaf. The analysis included Leaf, even though it was statistically insignificant in regressions with financial variables, but it improved the model's goodness of fit and accuracy. Adding variables related to the company's board of directors (CEO and Board) to the base models does not change the accuracy of the stepwise regression model but does improve the accuracy of the decision tree model slightly. Also, neither variable is statistically significant, CEO is positively correlated and Board is negatively correlated with the dependent variable. Including ratios related to the shareholding structure (Shares_B, Shares_I, and Shares_One) in the base models improved the accuracy of the stepwise regression model and the decision tree model. All shareholder ratios are positively correlated with the dependent variable, and Shares_B and Shares_One are statistically significant. This means that if the Management and Supervisory Board holds the company's shares or increases the number of shares held by the largest shareholder, the likelihood of earnings management increases. The marginal effect of holding the company's shares by the Management and Supervisory Board and increasing the number of shares held by the largest shareholder by 1 percent will increase the probability of financial fraud by 1.3 to 1.8 percentage points and by 0.5 to 1.3 percentage points, respectively. In addition, the accuracies of the stepwise regression model and the decision tree model have been improved by including ratios related to the audit (BIG4 and Audit) in the base models. BIG4 is negatively correlated with the dependent variable, Audit is positively correlated, and Audit is statistically significant. This means that if a company has an unqualified audit opinion, it increases the likelihood of earnings management. The marginal effect of receiving an unqualified audit opinion will increase the probability of financial fraud by 0.01 to 0.4 percentage points.

Regression results with non-financial variables
These results show that including variables related to the shareholding structure (Shares_B, Shares_I, and Shares_One) or ratios related to the audit (BIG4 and Audit) in the hybrid model increases model accuracy compared to the models with only financial ratios.
Furthermore, the regressions performed without Leaf showed that the R-squared decreased from 7.0 to 13.2 percentage points, and the accuracy decreased from 2.4 to 4.0 percentage points for stepwise regression. For decision trees, not including Leaf reduced the R-squared from 7.9 to 19.2 percentage points and the accuracy from 1.5 to 4.7 percentage points. The regressions without Leaf showed two main changes in the analysis results. First, CEO was statistically significant and positively correlated with the dependent variable for logistic regression. Second, Shares_B turned out to be statistically insignificant. Figure 3 shows the statistically significant financial and non-financial variables with the critical values used in the construction of the decision tree that follows the rules previously described in Section 5.2. The decision tree analysis selected six variables: OM, Shares_I, Audit, F_A, RT, and QR. Ratios DR, WC_A, and RT from the decision tree with only financial variables have been replaced by Shares_I, Audit, F_A, and QR. Also, OM is the first split point in the decision tree. The model correctly classified 96.0% of the total sample, in particular, 96.8% of the fraudulent firms and 95.2% of the non-fraudulent companies.

Models with non-financial variables
The stepwise regression that follows the rules previously described and includes non-financial variables showed that the indicators NM, CR, QR, C_A, IT, RT, TAT, CEO, Shares_B, Shares_One, and Audit are statistically significant. Table 7 reports the results from logistic regressions for including non-financial variables in the construction of the model. Columns (10) and (12) show the results of variables from the stepwise regression with non-financial variables, columns (11) and (13) include variables from the decision tree with non-financial variables, in column (12), I add Leaf as a binary variable equal to 1 if OM is lower than -0.059 and 0 otherwise, which is from the root node, and in column (13), I analyzed only significant variables from decision tree and changed of OM with Leaf.  Including non-financial variables in building the hybrid model increases model accuracy and goodness of fit compared to the base model. Also, adding Leaf to the model increases the accuracy and R-squared of the stepwise regression model and the decision tree model. The variable Leaf is positively correlated with the dependent variable and statistically significant. This suggests that if a company has an operating margin lower than -0.059, the likelihood of financial statement fraud is increased.
The above results suggest that building a model using only financial variables or adding non-financial variables rather than the first step of model building, decreases model goodness of fit and accuracy.

CONCLUSION
Overall, my findings suggest that including an artificial variable related to the decision tree increases the accuracy of the hybrid model. I show that including non-financial variables improves the accuracy of the hybrid models both after adding the variables to the models and during the construction of the models. This empirical evidence means that using only financial variables and logistic regression in the study of financial statement fraud reduces the accuracy of the model and the construction of decision trees. Therefore, my findings may be relevant for other researchers who analyze earnings management. However, the models should include or exclude variables over time to effectively identify the companies likely to report earnings management. Moreover, my study is the first attempt to investigate this type of analysis for Polish public companies and I hope that future research will explore the other variables and models to improve their predictions.