Armando Mateus – Report June 5, 2026 (Week 33)

Dataset Balancing & Generalisation: Quantity vs. Quality, Left-Right Symmetry, and Town02 Surprises

June 5, 2026

After the success of oversampling + 35% cropping in Week 32, this week we focused on improving left-turn performance by adding a new dataset with left turns from the right lane. A systematic exploration of 9 dataset compositions (varying total samples and turn/recovery ratios) reveals a counterintuitive finding: larger datasets do not guarantee better driving. The best models use fewer samples (2000-5000) with balanced left/right turn ratios. Evaluation on Town 02 highlights the need for diverse environment representation beyond Town 01.

0. Work Plan & New Data Sources

🎯 Primary objectives for Week 33:

Improve left-turn performance (identified as weak point in Week 32)
Test the hypothesis that more samples automatically improve results
Validate the best models on a new town (Town 02)

New data collections (previous week's directories + new recording):

📁 2026-05-24-bubble_forward
📁 2026-05-24-Bubble_turn_left / Bubble_turn_right
📁 2026-05-25_Recuperations_far_left / recuperations_from_lane
📁 2026-05-26-Recuperation_right
📁 2026-05-26_Drunk-DAgger-Hard / Medium / Soft
📁 2026-05-26_First_balanced
📁 2026-05-25_turn_left_by_mirror
📁 2026-06-02_Turn_left_from_right (new: left turns starting from the right lane)

The new left-turn-from-right dataset was created to explicitly teach the network how to position itself correctly during left maneuvers, avoiding the "wrong lane" issue seen in Week 32.

1. First Attempt: Maintaining Proportions Failed

⚠️ Initial hypothesis: Increase total samples while keeping 35% forward, 50% turns, 15% recoveries.

Result: The vehicle oscillated severely and failed to take curves correctly.

Analysis: Changing the total number of samples altered the effective distribution of steering angles, even though high-level category percentages were kept constant. The model became overconfident on straight-line features and lost the ability to generalise to turns. This motivated a systematic exploration of different dataset sizes and turn/recovery splits.

2. Systematic Dataset Construction (9 variants)

Nine datasets were created with different total sizes and category distributions. All variants use top 35% cropping + oversampling (based on Week 32 best configuration) to isolate the effect of dataset composition and size.

Dataset ID	Total samples	Forward / Turns / Recoveries / Other	Key result	Video
A (00)	15,931	35% / 35% / 34% / 16%	Takes left lane, does not recover to right lane; right turns exit road	—
B (01)	12,420	35% / 15% / 50% / —	Low oscillations, correct lane keeping, but left turns too sharp, right turns exit road	—
C (02)	10,349	35% / 15% / 20% / 30%	Same as B: left turns too sharp, right turns exit road	—
D (03)	2,000	35% / 15% / 20% / 30%	Left turns correct, but right turns still exit road; Candidate for Town02	watch
E (04)	5,000	35% / 15% / 20% / 30%	High oscillations, lane departures, left turns OK, right turn soft but ends in left lane	—
F (05)	8,000	35% / 15% / 20% / 30%	Low oscillations, left turn fails (exits via right lane), right turn soft but stays left	—
G (06)	2,000	35% / 15% / 25% / 25% (balanced L/R turns)	No oscillations, no departures, lane keeping OK, left turns OK (fails in one specific scenario), right turns smooth ⭐	watch
H (07)	5,000	35% / 15% / 25% / 25%	No oscillations, lane keeping fails, left/right turns end in opposite lane	—
I (08)	8,000	35% / 15% / 25% / 25%	No oscillations, lane keeping fails, left/right turns fail completely	—

🔍 Critical observation – Quantity vs. Quality:

Datasets with 2,000 samples (D and G) consistently outperformed larger datasets (5k, 8k, 10k, 15k) across almost all metrics.
Increasing sample size introduced overfitting: the model memorised specific road sections rather than learning general steering behaviour.
Balanced left/right turn representation (Dataset G) eliminated the sharp-left-turn problem and produced smooth right turns.
Dataset G (2000 samples, 35% forward, 15% recoveries, 25% left turns, 25% right turns) became the primary candidate for cross-town evaluation.

3. Cross-Town Evaluation: Town 02

The two candidate models (Dataset 03 and Dataset 06 from the table above) were evaluated on the completely unseen Town 02 environment.

Model	Town 02 Performance	Video	Failure mode
Dataset 03 (2000 samples, 35/15/20/30)	No turns performed, oscillations present	watch	Overfitted to Town 01 turning patterns
Dataset 06 (2000 samples, 35/15/25/25 with balanced left/right)	Left turns correct, but confused by plazas, gardens, and wide open spaces (not present in Town 01)	watch	Lack of diverse environmental examples (wide spaces, plazas)

📌 Key generalisation insight: A model that performs nearly perfectly on Town 01 can fail catastrophically on Town 02 if the training data lacks environmental diversity. Dataset 06 handled left turns correctly but got confused by open plazas and wide intersections — structural features that are rare in Town 01. This suggests that environmental diversity is as important as steering distribution balance.

4. Interpretation – The Curse of Quantity

🧠 Why did datasets with 2,000 samples outperform larger ones?

Overfitting to repetitive patterns: Larger datasets contain many visually similar frames from the same road sections. The model learns to memorise these specific views rather than extracting general lane-following features.
Implicit regularisation: Smaller datasets force the network to focus on the most salient features (lane boundaries, curvature cues) because there is insufficient capacity to memorise all examples.
Balanced steering distribution: The 2,000-sample datasets were carefully stratified to maintain equal left/right turn proportions, while larger datasets inadvertently introduced slight imbalances that amplified certain steering behaviours.

Conclusion for imitation learning: Variety and balance matter more than raw count. A well-curated dataset of 2,000 diverse, balanced samples can outperform a 15,000-sample dataset with hidden biases.

5. Failure Mode Analysis

🔴 Consistent failure patterns:

Right-turn exit failures: Many models (A, B, C, D) exhibited a tendency to exit the road on the right side during right turns. This is likely due to an insufficient number of right-turn examples where the lane marking curves away.
Left-turn sharpness: Without balanced left/right representation, left turns become excessively sharp (oversteering). Dataset G (25% left, 25% right) solved this.
Lane keeping failures in larger datasets: Models with 5,000+ samples (E, F, H, I) either stayed in the left lane permanently or failed to centre after turns, indicating learned positional bias.
Wide space confusion (Town 02): The model trained only on Town 01 could not generalise to plazas and gardens because those visual patterns were entirely absent from the training set.

6. Conclusions & Week 34 Plan

📌 Week 33 key takeaways:

More data is not always better. The best results came from 2,000 carefully balanced samples (Dataset G). Larger datasets (5k–15k) led to overfitting and worse generalisation.
Balanced left/right turn representation (25%/25%) is critical for symmetric steering behaviour. The new Turn_left_from_right dataset was essential to achieve this balance.
Environmental diversity matters. A model that works perfectly on Town 01 may fail on Town 02 due to unseen features like plazas and wide intersections.
Dataset G (2000 samples, 35% forward, 15% recoveries, 25% left, 25% right + 35% cropping + oversampling) is the best candidate for further testing.

🚀 Week 34 action items:

Augment training data with diverse environments: Collect 500–1000 samples from Town 02, Town 04, and Town 12 (focusing on plazas, wide intersections, and gardens) and add them to Dataset G's balanced distribution.
Test dataset size upper bound: Systematically vary dataset size from 1k to 4k in 500-sample increments to find the optimal trade-off between variety and overfitting.
Dynamic evaluation: Test the best Town 01+Town 02 hybrid model on Town 05 (dynamic traffic) to measure robustness under real conditions.
Introduce a validation metric for environmental diversity: Track performance drop when moving from training town to unseen towns.

All code, datasets, and trained models will be released after Week 34 validation. A complete technical note on “The Curse of Quantity in Imitation Learning” is in preparation.

7. References & Team Notes

[1] Mateus, A. (2026): Week 32 report – Oversampling + 35% cropping success. Week 32 Report
[2] Rodríguez, J. (2025): Dataset balancing strategies for end-to-end driving.
[3] Bojarski, M., et al. (2016): End to end learning for self‑driving cars (PilotNet). arXiv:1604.07316
[4] Codevilla, F., et al. (2019): On the interaction between dataset size and generalisation in imitation learning – Inspiration for our quantity vs. quality analysis.

Special thanks to the Robotics Lab team for the detailed failure analysis sessions and to Jorge Rodríguez for the new left-turn-from-right dataset collection. The systematic 9-dataset training campaign was run on the lab’s GPU cluster over 48 hours.

— Armando Mateus, Robotics Lab URJC

📌 WEEK 33 SUMMARY – JUNE 5, 2026

📊 Built 9 dataset variants (2k–15k samples) with different turn/recovery splits, all using the best Week 32 configuration (35% cropping + oversampling).

🔍 Discovered the "Curse of Quantity": 2,000-sample datasets consistently outperformed larger datasets. Overfitting, not underfitting, is the main challenge in imitation learning for driving.

⚖️ Balanced left/right turns (25% each) resolved sharp-left-turn issues. The new Turn_left_from_right dataset was critical.

🌍 Generalisation to Town 02 revealed environmental gaps: plazas, gardens, and wide intersections cause failures even for models perfect on Town 01.

✅ Best model: Dataset G (2000 samples, 35/15/25/25 split) – no oscillations, correct turns on Town 01, promising basis for Week 34’s multi-town augmentation.

🔜 Week 34: Add diverse environment samples (Town 02/04/12), optimise dataset size between 1k–4k, and evaluate on dynamic Town 05.