top of page

This project was proposed and carried out with my thesis work and was inspired by technology and autonomous vehicles. This complete work uses several machine and deep learning models (ML y DL), but also uses science methods and data analysis. In this project it has several contributions such as a system that can generate a database to train machine learning models, and a system that can recognize the pedestrian's intention in real time as metrics at the level of the state of the art and in time within the average human reaction.

Project justification

Despite the fact that you are in a time that is revolutionizing with artificial intelligence, there is still a long way to go to reach a utopia, and in the case of an autonomous driving area, there is still work to be done so that autonomous cars are completely reliable and that is why there are still Mortality rates of pedestrians involved in vehicular accidents.

Figure 1. World Health Organisation, “Road traffic injuries,” 2018, https://extranet.who.int/roadsafety/death-onthe- roads/#ticker/pedestrians, Last accessed on 2021-11-05.

Background

Many articles on this subject use different characteristics of the pedestrian, their environment or vehicle to predict or recognize the pedestrian's intention to cross a street in real time and be equal to or faster than the human reaction that is between 1.30 - 1.5 seconds [2]. Some functions used for this task are, for example, traffic lights, pedestrian crossings, vehicle speed, head orientation, and pedestrian gaze, among others. So the problem is that there are no standardized characteristics that are really necessary for the prediction or recognition of the pedestrian's intention to cross.

[2]Droździel, P., Tarkowski, S., Rybicka, I., & Wrona, R. (2020). Drivers ’reaction time research in the conditions in the real traffic. Open Engineering, 10(1), 35–47. https://doi.org/10.1515/eng-2020-0004.

[3] B. Yang and R. Ni, “Vision-based recognition of pedestrian crossing intention in an urban environment,” 9th IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems, CYBER 2019, pp. 992–995, 2019.

[4] M. Raza, Z. Chen, S. U. Rehman, P. Wang, and P. Bao, “Appearance based pedestrians’ head pose and body orientation estimation using deep learning,” Neurocomputing, vol. 272, pp. 647–659, 2018. [Online].

[5] ] R. Q. Minguez, I. P. Alonso, D. Fernandez-Llorca, and M. A. Sotelo, “Pedestrian Path, Pose, and Intention Prediction Through Gaussian Process Dynamical Models and Pedestrian Activity Recognition,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 5, pp. 1803–1814, 2019.

Hypothesis

The aim of this work is that by extracting pedestrian characteristics (virtual skeleton and knee angles), and their environment (pedestrians, stop signs, traffic lights) from videos, recognize the pedestrian's intention to cross more efficiently and in less time through the use of artificial intelligence.

General Objective

Develop a system that allows recognizing if a pedestrian is going to cross the street through AI techniques and methods to recognize the orientation of the head and characteristics of the body using its skeleton in order to prevent or reduce road accidents involving pedestrians.

Specific Objectives

  • Select and use methods for extracting information from videos.

  • Develop a system and apply AI methods that make it possible to recognize certain characteristics of the pedestrian's body and variables in their environment.

  • Propose a strategy to unite these characteristics; knee angles, head orientation, crosswalks, traffic lights, and stop signs.

  • Carry out tests using the proposed strategy, and apply metrics to it.

Theoretical foundation (metrics)

Precisión  = VP / (VP+FP)

Exactitud = (VP+VN) / (VP+FP+FP+VN)

Recall       = VP / (VP+FN)

F1-Score  = ((2)(Precisión)(Recall)) / (Precisión+Recall)

These metrics [6] are the most common for this type of binary classification problem, where the accuracy is not the best of them because the accuracy does not give a good understanding of the behavior of any machine or deep learning model. For this project, a high precision value implies few false positives (FP) and a high recall value implies few false negatives (FN) which are very dangerous.

[6]Zhang, S., Abdel-Aty, M., Wu, Y., & Zheng, O. (2021). Pedestrian Crossing Intention Prediction at Red-Light Using Pose Estimation. IEEE Transactions on Intelligent Transportation Systems, 1–9. https://doi.org/10.1109/TITS.2021.3074829

Methodology

The Methodology used for this work consists of three main steps, Data Generation, Training of ML and DL models, and the Final system. 

In step 1 or database generation, the Resnet50 model pre-trained with the COCO public database, and the YOLOV8 model trained with a database created for this work that is based on the database were used. JAAD. The Resnet50 model is the one that performs the detection of pedestrians and traffic signals. The YOLOV8 model will be in charge of detecting pedestrian crossings and the orientation of the pedestrian's head. In step 2, the ML model training will process and clean the database generated in the previous step. Several variants of the original database were made to observe the behavior of the ML models. In step 3 or the final system, the structure of step 1 is mixed with the already trained ML models. This system is in charge of recognizing the pedestrian's intention to cross.

Database JAAD

The database used for this work is the JAAD database, which consists of 346 short videos with a duration of 5 to 10 seconds at a speed of 30 FPS. This database contains a large amount of annotations from the weather, to the status of the pedestrian (Crossing or not crossing). [7]. 

[7]A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Are they going to cross?a benchmark dataset and baseline for pedestrian crosswalk behavior,” inProceedings of the IEEE International Conference on Computer VisionWorkshops, 2017, pp. 206–213.

Results - Data generation

This is an example of the generated database, which needed to be processed and cleaned with data analysis methods such as principal component analysis (PCA), Pearson's correlation matrix, data imputation, and synthetic data. These and more methods were used to transform the original database to have a better quality database to train the ML models.

Results - ML Training

In this step, three ML models, K Nearest Neighbors (KNN), Support Vector Machine (SVM) and Random Forest (RF), were trained. Each of these models was trained with 9 variations of the original base previously presented with due processing. Being at the end of this training 27 models to analyze, they were analyzed using the K-fold method (K=5), confusion matrices, the already mentioned metrics, and the precision versus recall (PR) graph. In the end, only four models were selected, trained with different variations of the database and sharing two things in common, all four use the SVM model and data imputation. Of these 27 models, only four were selected for their high values in the F1-Score metrics, followed by recall and finally by precision. Remembering that while the recall metric is high, the VN is reduced, that the cases are dangerous and that high values of the precision metric indicate little VP.

Each of these tables correspond to the SVM model with variations in the database for each of the four. It can be seen that the metrics have high values, which is good, but what really matters is that this behavior remains constant during these five tests (K=5) and especially for the F1-score metric.

Table 1. Imputed database

(SVM).

Table 3. Imputed database Normalized method Maxmin and Reduce (S,PP,OC,AD)(SVM).

Table 2. Imputed database Normalized method Maxmin(SVM).

Table 4. Imputed database Normalized method Maxmin and Reduce (S,PP,OC,AD)(SVM).

As can be seen in the confusion matrices below, the VN is very low, which is why it is good, the models classify the pedestrians who are crossing in a good way, on the other hand, the models tend to be preventive since a lot of FP is generated, this is that the models predict that the pedestrian is crossing when he really is not. So you have the opportunity to improve the accuracy metric, which is responsible for this behavior.

 Confusion matrix for Imputed database

(SVM).

Confusion Matriz for Imputed database Normalized method Maxmin and Reduce (S,PP,OC,AD)(SVM).

 Confusion matrix for Imputed database Normalized method Maxmin(SVM).

Confusion Matriz for Imputed database Normalized method Maxmin and Reduce (S,PP,OC,AD)(SVM).

Now the PR graphs are presented, and we see that the area under the curve for all of them is greater than 0.7 and we see that for the model trained with the imputed database, normalized with the StandardScale and Reduced method, it has a more stable behavior than the others. 

PR curve for matrix for Imputed database, AUC=0.73, (SVM).

PR curve for Imputed database Normalized method Maxmin and Reduce (S,PP,OC,AD), AUC=0.74, (SVM).

PR curve for matrix for Imputed database, AUC=0.78, (SVM).

Confusion Matriz for Imputed database Normalized method Maxmin and Reduce (S,PP,OC,AD), AUC = 0.75, (SVM).

Below is a comparative table of some works that used the same JAAD database and as can be seen, this work is within the state of the art and won the Recall matrix.

Table 5. Comparison table of the best ML model against other works.

[8]J. Gesnouin, S. Pechberti, G. Bresson, B. Stanciulescu, and F. Moutarde, “Predicting intentions of pedestrians from 2d skeletal pose sequences with a representation-focused multi-branch deep learning network,” Algorithms, vol. 13, no. 12, pp. 1–23, 2020.

[9] Z. Fang and A. M. López, “Is the Pedestrian going to Cross? Answering by 2D Pose Estimation,” IEEE Intelligent Vehicles Symposium, Proceedings, vol. 2018-June, pp. 1271–1276, 2018.

[10] Gesnouin, J., Pechberti, S., Stanciulcscu, B., & Moutarde, F. (2021). TrouSPI-Net: Spatio-temporal attention on parallel atrous convolutions and U-GRUs for skeletal pedestrian crossing prediction. Proceedings - 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2021. https://doi.org/10.1109/FG52635.2021.9666989.

[11] Yang, D., Zhang, H., Yurtsever, E., Redmill, K., & Ozguner, U. (2022). Predicting Pedestrian Crossing Intention with Feature Fusion and Spatio-Temporal Attention. IEEE Transactions on Intelligent Vehicles, 14(8), 1–9. https://doi.org/10.1109/TIV.2022.3162719.

[12] Yao, Y., Atkins, E., Johnson-Roberson, M., Vasudevan, R., & Du, X. (2021). Coupling Intent and Action for Pedestrian Crossing Behavior Prediction. IJCAI International Joint Conference on Artificial Intelligence, 1238–1244. https://doi.org/10.24963/ijcai.2021/171.

[13] J. A. Abbasi, N. M. Imran, and M. Won, “WatchPed: Pedestrian Crossing Intention Prediction Using Embedded Sensors of Smartwatch,” 2022. [Online]. Available: http://arxiv.org/abs/2208.07441.

[14] Lorenzo, I. Parra, F. Wirth, C. Stiller, D. F. Llorca, and M. A. Sotelo, “RNN-based Pedestrian Crossing Prediction using Activity and Pose related Features,” IEEE Intelligent Vehicles Symposium, Proceedings, pp.1801–1806, 2020.

Results - Final System
No pedestrian and Pedestrian no crossing

To test the final system, videos that were never used to train the machine learning and deep learning models were used. There were a total of 69 videos (277-346).
For videos 285,292,296,323,343 and 346, zero values were obtained in all metrics, this is because there was no pedestrian or, failing that, the pedestrian detection model could not detect it, this being the case for video 346 since the conditions in this video are against light, which makes it difficult to detect the pedestrian. For videos 284,288,289,300,304,308,309,318,329,335,337,342, and 344, bad metrics were obtained because the system tends to be preventive, that is, in these videos it did detect the pedestrian but they never crossed and the system predicted them as crossing. This behavior has been seen since the training of machine learning models with the k-fold method focused on the non-crossing class.

Videos without pedestrians:

  • 285

  • 292,

  • 296

  • 323

  •  343

  •  346

Videos where pedestrians were not crossing:

  • 284

  • 288

  • 289

  • 300

  • 304

  • 308

  • 309

  • 318

  • 329

  • 335

  • 337

  • 342

  • 344

Results - Final System
Pedestrian crossing

For videos where the pedestrian crosses the street or avenue, the metrics were good as seen in the machine learning models. As can be seen in the following four bar charts, each chart contains the metrics for accuracy, precision, recall, and f1-Score. Where each bar represents a final system where your ML model was trained with one of the four final variations. In this way we can compare the performance of each trained system with variations in the database and see how it affects these variations. Thus, the three graphs on the right are the ones that perform better than the bar graph on the left for almost all metrics. In order to continue filtering the best system of the remaining three, since these can no longer be separated by their metrics, the bar graphs in the center and on the right were selected because they have the same metrics but using fewer attributes. And finally, the best system from the author's point of view is the system with the bar graphs on the right (BD_imp_SS_Red), this is due to the results of the ML models, where the graph of precision against recall has a more stable behavior.
Summarizing all of the above, the system with the machine learning models trained with the imputed database, normalized with the Standard Scale method and with reduced attributes, is the best of the 4 finalists due to its metrics, number of attributes, and stability.

Results - Final System
Pedestrian crossing

A comparative table of some works that use the same JAAD database is presented. The 4 systems developed are compared to see if all or any of them are within the state of the art. As can be seen, the 4 systems are within the state of the art, where the recall metrics of the four systems stand out as being the best in the state of the art.

Table 6. Comparative table of the 4 systems against other works that use the same JAAD database.

[8]J. Gesnouin, S. Pechberti, G. Bresson, B. Stanciulescu, and F. Moutarde, “Predicting intentions of pedestrians from 2d skeletal pose sequences with a representation-focused multi-branch deep learning network,” Algorithms, vol. 13, no. 12, pp. 1–23, 2020.

[9] Z. Fang and A. M. López, “Is the Pedestrian going to Cross? Answering by 2D Pose Estimation,” IEEE Intelligent Vehicles Symposium, Proceedings, vol. 2018-June, pp. 1271–1276, 2018.

[10] Gesnouin, J., Pechberti, S., Stanciulcscu, B., & Moutarde, F. (2021). TrouSPI-Net: Spatio-temporal attention on parallel atrous convolutions and U-GRUs for skeletal pedestrian crossing prediction. Proceedings - 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2021. https://doi.org/10.1109/FG52635.2021.9666989.

[11] Yang, D., Zhang, H., Yurtsever, E., Redmill, K., & Ozguner, U. (2022). Predicting Pedestrian Crossing Intention with Feature Fusion and Spatio-Temporal Attention. IEEE Transactions on Intelligent Vehicles, 14(8), 1–9. https://doi.org/10.1109/TIV.2022.3162719.

[12] Yao, Y., Atkins, E., Johnson-Roberson, M., Vasudevan, R., & Du, X. (2021). Coupling Intent and Action for Pedestrian Crossing Behavior Prediction. IJCAI International Joint Conference on Artificial Intelligence, 1238–1244. https://doi.org/10.24963/ijcai.2021/171.

13] J. A. Abbasi, N. M. Imran, and M. Won, “WatchPed: Pedestrian Crossing Intention Prediction Using Embedded Sensors of Smartwatch,” 2022. [Online]. Available: http://arxiv.org/abs/2208.07441.

[14] Lorenzo, I. Parra, F. Wirth, C. Stiller, D. F. Llorca, and M. A. Sotelo, “RNN-based Pedestrian Crossing Prediction using Activity and Pose related Features,” IEEE Intelligent Vehicles Symposium, Proceedings, pp.1801–1806, 2020.

[15]Razali, H., Mordan, T., & Alahi, A. (2021). Pedestrian intention prediction: A convolutional bottom-up multi-task approach. Transportation Research Part C: Emerging Technologies, 130(June), 103259. https://doi.org/10.1016/j.trc.2021.103259

Conclusions and Contributions

Contclusions:

  • The objective of creating a system that recognizes the pedestrian image by image in times within the average reaction of a human being (1.30 - 1.5 s in unexpected situations[16]) was achieved.

  • It was possible to complete the hypothesis of recognizing the intention of a pedestrian with the necessary characteristics and with a good performance.

  • The system is in the state of the art, and it has been experimentally shown that attributes are really needed from only the pedestrian and its environment through dimensionality reduction techniques.

  • Values of time and adequate metrics were reached for the objective of recognizing the intention with a smaller number of attributes than other works.

  • A system was generated that generates a database to train any ML model, which is very valuable.

  • It was shown that for this system the best ML model was SVM.

  • Based on the confusion matrices, the system created is good since it tends to be preventive, it is not the best and its performance should be improved.

Contributions:

  • Database generator system.

  • Attributes really necessary to recognize the pedestrian's intention.

  • Pedestrian intention recognition system in less than average human reaction time.

[16] Droździel, P., Tarkowski, S., Rybicka, I., & Wrona, R. (2020). Drivers ’reaction time research in the conditions in the real traffic. Open Engineering, 10(1), 35–47. https://doi.org/10.1515/eng-2020-0004

bottom of page