In February, 2023, Xiamen University Accounting Development Research Center collaborated with Professor Li Sheng and team members Zheng Tianyu and Dr. Teng Chuanhao to publish the paper "Research on Fraud of Listed Companies in China Based on Machine Learning Method" in Journal of Xiamen University (Philosophy and Social Sciences Edition).

introduction
In the fourth session of the 13th National People's Congress in 2021, it was proposed that in order to prevent and resolve financial risks, maintain the order of financial market and improve the construction of capital market, we must adopt a "zero tolerance" attitude towards illegal and criminal acts such as financial fraud of listed companies. The fraudulent behavior of listed companies not only harms the interests of investors, but also reduces the effectiveness of the capital market. With the advent of the digital age, financial fraud presents new characteristics and trends. With the continuous development of computer technology and related hardware facilities, using machine learning methods to identify financial fraud has become a hot topic of research. In foreign related research, some scholars have made good use of machine learning to identify and predict fraud, and domestic related research also draws lessons from foreign research methods. Compared with the capital markets of developed countries such as the United States, China's capital market has the characteristics of a large proportion of retail investors and a high daily transaction volume. These characteristics indicate that there will be a big deviation in studying the fraud in China's capital market only by using foreign methods and data. Therefore, how to design a machine learning financial fraud identification model in line with China's institutional background is a research problem worth exploring.
Model construction
There are many methods of machine learning. This paper will use decision tree, random forest, Adaboost- decision tree and support vector machine (SVM) to identify fraud, and compare the advantages and disadvantages of different models in China capital market and the identification effects of different input values.
evaluating indicator
In order to evaluate the advantages and disadvantages of the model, we need to introduce relevant evaluation indicators for analysis. Generally speaking, the evaluation indexes of the model include accuracy, precision, recall and the area under the receiver's operating characteristic curve (AUC). This paper will focus on the recall rate of the model, and choose the best model suitable for China's capital market on the premise of considering the cost and benefit. Therefore, the main measure of this paper is the recall rate. At the same time, based on the principle of cost-benefit, the model parameters are manually adjusted to ensure the accuracy of each model at the same level, and the model and data are comprehensively evaluated by the recall rate and accuracy rate.
Data selection
Firstly, the sample data selection of this paper refers to the existing literature on financial fraud identification by machine learning, selects the original financial data that can be directly obtained from financial reports, and then adds relevant financial indicators to supplement the original financial data, making the financial data between different companies more comparable; Then, referring to the research on the factors affecting financial fraud, corporate governance indicators and audit indicators were added respectively to improve the accuracy of machine learning model identification and prediction; Finally, because this paper uses the data of China's capital market for research, compared with the American sample data, the characteristics of China's listed companies are different, such as goodwill and equity pledge, so it is necessary to add special elements of China's capital market and listed companies to the sample.
Research conclusion
Based on the original financial data, this paper gradually introduces financial ratio, corporate governance indicators, audit indicators and special indicators of China's capital market, and takes Logistic model as the evaluation benchmark (M-Score, F-Score and C-Score), and uses decision tree, random forest, Adaboost- decision tree and support vector machine (SVM) model for machine learning analysis respectively. Using oversampling to reduce the imbalance of samples, taking recall as the standard to evaluate each model, taking into account the principle of cost and benefit, and comprehensively using accuracy, recall and AUC to judge the quality of models and data. It is found that based on the original financial data, the model with financial ratio, audit index and special factor index of China's capital market can get better identification effect, but corporate governance index can not improve the fraud identification ability of the model; Compared with other models, the random forest model and Adaboost- decision tree model have better fraud identification effects, with accuracy rates of 62% and 64% respectively, and recall rates of 64% and 62%. At the same time, the recognition results outside the sample show that the model adopted in this paper has high robustness.
contribution
On the one hand, this paper enriches the literature in the field of corporate fraud identification, and the existing research on corporate fraud identification is mainly based on traditional financial data, ignoring the predictive role of special indicators of China's capital market in fraud identification. On the basis of financial data, this paper gradually introduces corporate governance indicators, audit indicators and special indicators of China's capital market, making the fraud identification model more suitable for China's stock market background and enterprise characteristics, and the model has higher applicable value. On the other hand, the fraud identification model proposed in this paper is helpful to improve the effectiveness of the capital market and safeguard the interests of investors. Under the background of the full implementation of the stock issuance registration system, regulators and investors rely more on information disclosure to supervise enterprises, and this paper uses the multidimensional information disclosed by enterprises to identify fraud, which provides regulators and investors with a more effective tool for fraud identification.