Week 19 – Dataset Balance Analysis and Latency Issue Investigation

February 3, 2026

Comprehensive dataset balancing strategies and identification of steering latency problems in the data acquisition pipeline

Week 19 focused on advanced dataset analysis and correction of critical issues identified in the data collection process. Following Enrique Shinohara's Master's work methodology, a detailed examination of dataset distribution and steering latency problems was conducted.

1. Dataset Balancing Strategies Implementation

Inspired by Enrique Shinohara's approach, I performed a detailed analysis of data distribution across 48 steering categories ranging from -0.7 (sharp left) to 0.7 (sharp right). The categories were defined with intervals of 0.025: (-0.7, -0.675), (-0.675, -0.65), etc.

Three balancing strategies were implemented and compared:

1. Original Dataset Distribution

Undersampling distribution

Figure 1: Dataset distribution after applying undersampling strategy

The initial dataset showed significant imbalance with overrepresentation of certain steering values and underrepresentation of others, leading to model biases.

2. Undersampling Strategy

Undersampling distribution

Figure 2: Dataset from undersampling strategy.

This approach reduces the number of samples in overrepresented categories to create a more balanced distribution.

3. Oversampling Strategy

Undersampling distribution

Figure 3: Dataset from oversampling strategy.

This method increases samples in underrepresented categories through duplication or synthetic data generation.

4. Weighted Strategy

A weighted approach where samples from underrepresented categories receive higher weights during training.

Balance strategies comparison

Figure 2: Comparison of three balancing strategies: Undersampling, Oversampling, and Weighted approach

Initial Training Results: All three balancing strategies were implemented and corresponding training sessions were conducted. However, three persistent adverse effects remained across all approaches:

1. Zigzagging Behavior

Model exhibits oscillatory steering corrections even in straight road segments.

2. Lane Departures

Unexpected exits from the lane, particularly in straight sections.

3. Incorrect Turns

Mistimed or improperly executed turns at curves.

2. Critical Dataset Construction Issues Identified

Research into PilotNet documentation revealed that these symptoms commonly indicate specific dataset problems:

Further investigation of the dataset uncovered a significant technical issue:

Steering Latency Discovery

Analysis of consecutive frames revealed a systematic latency between steering commands and vehicle response:

2025-12-04T18_36_12-111464.jpg 0.1005
2025-12-04T18_36_12-161632.jpg 0.1053
2025-12-04T18_36_12-211651.jpg 0.1101
2025-12-04T18_36_12-261667.jpg 0.1142
2025-12-04T18_36_12-311761.jpg 0.1171
2025-12-04T18_36_12-361465.jpg 0.1189
2025-12-04T18_36_12-411711.jpg 0.1202
2025-12-04T18_36_12-461495.jpg
2025-12-04T18_36_12-511498.jpg 0.1217
2025-12-04T18_36_12-561665.jpg 0.1225
          

Consecutive images with incrementally increasing steering values despite essentially identical visual content

Key Finding: All these images are essentially identical visually, but the steering value increases progressively. This indicates a system latency of approximately 0.5 seconds between command issuance and vehicle response.

Implication: Since we aim to develop a reactive system, this latency is causing significant problems in model training and performance.

Potential Causes Identified:

1. Pygame Steering Acquisition Filtering

The Pygame function used for steering data acquisition may include filtering that introduces latency.

2. Warm-up Time Requirements

The system may require additional warm-up time before stable steering measurements are obtained.

3. Corrective Actions and Implementation Plan

Based on the identified issues, the following corrective strategies have been developed:

1. Consecutive Frame Reduction

Remove consecutive images that don't represent appreciable changes, keeping only one representative frame and averaging corresponding steering values.

2. Specialized Dataset Construction

Create specific datasets for different driving scenarios:

  • Straight-line driving dataset
  • Turning maneuvers dataset
  • Correction maneuvers dataset

This approach allows more efficient addressing of inertia and warm-up effects produced by vehicle dynamics.

3. Dataset-Specific Processing Pipeline

For each specialized dataset, implement three specific processing steps:

Step 1: Warm-up Period Removal

Eliminate the first 1 second of driving (20 frames) to avoid warm-up effects.

Step 2: Redundant Frame Elimination

Remove consecutive images without significant changes. Since driving logs contain vehicle position and orientation data, we compare entries with closely matching positional data rather than directly analyzing images. This addresses effects due to inertia.

Step 3: Anomaly Filtering

Filter values inconsistent with the driving type. For example, straight-line driving should have steering values between -0.04 and 0.04. Values outside these limits are considered anomalous and removed.

Current Progress

A new dataset is currently being constructed incorporating all these corrective actions. Initial tests with steering = 0 (straight-line driving) are being conducted to validate the approach.

Immediate Next Steps:

• Complete the new dataset construction with corrected data

• Launch training sessions with the cleaned dataset

• Compare performance between different balancing strategies on the corrected data

• Further investigate and potentially modify the Pygame steering acquisition pipeline

KEY FINDINGS THIS WEEK:

• Implemented three dataset balancing strategies following Shinohara's methodology

• Identified persistent model issues: zigzagging, lane departures, incorrect turns

• Discovered critical 0.5-second steering latency in data acquisition

• Developed comprehensive corrective action plan for dataset reconstruction

• New specialized datasets under construction with anomaly filtering

• Initial straight-line driving tests initiated to validate corrections

Conclusion:
Week 19 revealed fundamental issues in our data acquisition pipeline, particularly regarding steering latency. While balancing strategies alone didn't resolve model performance problems, the identification of systematic latency provides a clear path forward. The current dataset reconstruction effort, incorporating specialized scenario datasets and rigorous filtering, aims to create a cleaner, more reliable training foundation. Success in this correction phase is crucial for achieving stable, predictable autonomous driving behavior.