Key Stats

Following are metrics of this XGBoost Binary Classifer (yes/no) model

AUC (ROC)
0.9897
High ROC means model always ranks real collisions above non-collisions
PR AUC
0.9222
High PRAUC means model stays accurate when finding positives even though positives are rare.
Log Loss
0.0330
Probability guesses are very close to the truth (lower is better)
Brier Score
0.00788
How close probability guesses are to reality (0 is perfect).
BSS
0.7920
~79% better than model that gives everyone the same probability as the overall collision rate
Accuracy
0.9906
~99% of predictions are correct (does not account for rarity).
Precision
0.9284
When model flags a collision, it’s right ~93% of the time.
Recall
0.8242
Model finds ~82% of all real collisions.
F1
0.8732
Single score showing a good balance between catching collisions and avoiding false alarms
Raw metric values (JSON)
{
  "AUC": 0.9896967068152173,
  "PRAUC": 0.9221800275998946,
  "LogLoss": 0.03295311660934501,
  "Brier": 0.007877794374411067,
  "BSS": 0.7920132663556981,
  "Prevalence": 0.03943125072909336
}

Curves

ROC curve for the XGBoost collision classifier
ROC shows how your model’s catch-rate (TPR) trades off against false alarms (FPR) as you vary the cutoff; AUC = 0.990 means it almost perfectly ranks positives above negatives (~99% of random positive–negative pairs are ordered correctly).
Picking a threshold:
  • Lower threshold → higher recall, more false positives.
  • Higher threshold → higher precision, more misses.
  • Choose a threshold that minimises expected cost: cost = FN_rate × Cost(FN) + FP_rate × Cost(FP).

Confusion Matrix

Confusion matrix at the chosen threshold
A confusion matrix is just a 2×2 tally of right vs wrong calls. In your case: it correctly said “no collision” 509,192 times, falsely cried collision 1,333 times, missed 3,684 real collisions, and correctly caught 17,273, about 92.8% precision, 82.4% recall, and 99.1% accuracy on a heavily imbalanced set.

Classification Report

ClassPrecisionRecallF1Support
0 (no collision)0.99280.99740.9951510,525
1 (collision)0.92840.82420.873220,957
accuracy0.9906531,482
macro avg0.96060.91080.9341
weighted avg0.99030.99060.9903
How to read this:
  • Class 0 (no collision): precision 0.9928, recall 0.9974, F1 0.9951, support 510,525. Almost no false alarms and it correctly identifies most non-collisions.
  • Class 1 (collision): precision 0.9284, recall 0.8242, F1 0.8732, support 20,957. Most alerts are real and it finds about 82% of true collisions.
  • Accuracy: 0.9906 over 531,482 cases. Very high, but boosted by many non-collision cases.
  • Macro average: precision 0.9606, recall 0.9108, F1 0.9341. Treats both classes equally and shows class 1 is harder.
  • Weighted average: precision 0.9903, recall 0.9906, F1 0.9903. Weighted by class sizes, so it tracks overall accuracy.
  • Support: the number of examples per class used to compute the metrics.

Feature Glossary

All predictors used by the model, grouped for clarity. Most come from OpenStreetMap; the rest are time and weather.

Time and CyclicalTemporal exposure and periodicity
  • dt_year, dt_month, dt_day, dt_hour — calendar parts.
  • dt_is_weekend, dt_is_weekday — weekend vs weekday patterns.
  • hour_sin, hour_cos — hour as a circle (23 close to 0).
  • dow_sin, dow_cos — day-of-week cyclical.
  • dom_sin, dom_cos — day-of-month cyclical.
  • month_sin, month_cos — month cyclical.
WeatherVisibility, grip, and flow conditions
  • temp, dwpt — temperature, dew point.
  • rhum — relative humidity.
  • prcp, snow — precipitation, snowfall.
  • wdir, wspd, wpgt — wind direction, speed, gusts.
  • pres — pressure; sometimes aligns with fronts/storms.
  • tsun — sunshine duration.
  • coco — weather condition code.
Geometry and LimitsRoad design and speed environment
  • lanes_num_avg, lanes_num_max — lane counts.
  • width_m_avg, width_m_max — carriageway widths.
  • smoothness_score_avg — mapped surface quality.
  • maxspeed_mph_avg, maxspeed_mph_max — posted limits.
Highway Class CountsMix of road functions in the cell
  • cnt_is_primary, cnt_is_secondary, cnt_is_tertiary
  • cnt_is_residential, cnt_is_service, cnt_is_track_or_path
  • Counts of segments by OSM highway class.
Flow and Side ProvisionDirectionality, footways, cycle access
  • cnt_oneway_forward, cnt_oneway_bidirectional, cnt_oneway_reverse
  • cnt_sidewalk_both, cnt_sidewalk_left, cnt_sidewalk_right, cnt_sidewalk_none
  • cnt_bicycle_yes, cnt_bicycle_designated, cnt_bicycle_permissive, cnt_bicycle_no
Access and StructuresRestrictions and control points
  • cnt_access_permissive, cnt_access_destination, cnt_access_private, cnt_access_no
  • cnt_is_bridge, cnt_is_tunnel — bridges and tunnels.
  • cnt_has_barrier, cnt_has_amenity — barriers and amenities.
  • cnt_has_bus_stop, cnt_has_mini_roundabout, cnt_has_speed_camera
Class Shares (within cell)Composition by length or count
  • share_is_motorway, share_is_trunk (sometimes shown as “truck”)
  • share_is_primary, share_is_secondary, share_is_tertiary
  • share_is_residential, share_is_service, share_is_track_or_path, share_is_foot_or_ped
  • These often sum to near 1; interpretations should consider their correlation.
Neighbourhood Context (r11_*)Windowed means and shares
  • r11_mean_maxspeed_mph, r11_mean_lanes, r11_mean_width_m, r11_mean_smoothness
  • r11_mean_cnt_has_signals, r11_mean_cnt_has_crossing, r11_mean_cnt_speed_camera, r11_mean_cnt_bus_stop, r11_mean_cnt_amenity
  • r11_mean_cnt_is_primary, _secondary, _tertiary, _residential, _service, _track_or_path
  • r11_share_is_primary, _secondary, _tertiary, _residential, _service, _track_or_path
  • r11_share_oneway_forward, _bidirectional, _reverse
  • r11_mean_cnt_sidewalk_none, r11_mean_cnt_sidewalk_both
  • r11_mean_cnt_cycle_infra, r11_mean_cnt_cycle_lane, r11_mean_cnt_cycle_track
Nodes and AdminIntersections and areas
  • junction_degree — number of approaches at the junction.
  • borough — administrative area (encoded for the model).
  • is_junction — location is at a junction node.
  • is_turn — is there a turn

Feature Importance

Top-30 XGBoost feature importances
They’re the inputs the model relies on most to separate collisions from non-collisions.

Model Notes

Back to top Open the interactive map