Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DevSecOps : Identify More Accurate Batch Size Limit #17003

Open
emvaldes opened this issue Jan 7, 2025 · 16 comments
Open

DevSecOps : Identify More Accurate Batch Size Limit #17003

emvaldes opened this issue Jan 7, 2025 · 16 comments
Assignees
Labels
DevSecOps Team Aq DevSecOps work label documentation Tickets that add documentation on existing features and services platform-current Platform - Current Capabilities reportstream
Milestone

Comments

@emvaldes
Copy link
Collaborator

emvaldes commented Jan 7, 2025

Objective:

Determine the optimal batch size for processing workloads efficiently without compromising system performance or reliability. The current estimate is 2.5k, and this testing aims to validate or refine this value.


Deliverables

  1. Test Results: Detailed performance metrics for each batch size.
  2. Visualizations: Charts comparing batch size vs. latency, error rates, and resource utilization.
  3. Batch Size Recommendation: Documented optimal batch size with supporting data.
  4. Pilot Report: Insights from testing the recommended batch size in production.

Integration With Existing Sections

  • This section integrates closely with:
    1. Reproducing the Production Environment: Ensures the test environment matches production for accurate results.
    2. Load Testing: Uses tools like K6 or JMeter for simulating batch processing.
    3. Monitoring: Leverages Azure Monitor and Application Insights to capture metrics during testing.
@emvaldes emvaldes added DevSecOps Team Aq DevSecOps work label platform-current Platform - Current Capabilities reportstream labels Jan 7, 2025
@emvaldes emvaldes added this to the todo milestone Jan 7, 2025
@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Analyze Current Batch Processing Behavior

Goal: Understand how the system behaves when processing batches of different sizes under various conditions.


Tasks:

  1. Review Historical Performance Data

    • Sub-Tasks:
      1. Extract batch processing metrics from logs (e.g., Azure Log Analytics).
      2. Identify common batch sizes processed in production and their associated performance metrics (e.g., latency, errors).
  2. Document Key Constraints

    • Sub-Tasks:
      1. Determine system bottlenecks for large batches (e.g., memory limits, database transaction timeouts).
      2. Identify downstream service limits (e.g., message queue throughput).
  3. Identify Baseline Metrics

    • Sub-Tasks:
      1. Define acceptable performance thresholds for batch processing:
        • Maximum latency (e.g., <500ms per batch).
        • Error rate (e.g., <1% failures).
        • Resource usage (e.g., <80% CPU/memory utilization).
      2. Document current performance metrics for the 2.5k batch size.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Design and Execute Batch Size Testing

Goal: Test batch processing performance across a range of batch sizes to determine the optimal size.


Tasks:

  1. Prepare the Test Environment

    • Sub-Tasks:
      1. Deploy a high-fidelity staging environment that replicates production conditions.
      2. Use mock data or anonymized production data for testing.
  2. Define Test Scenarios

    • Sub-Tasks:
      1. Test with varying batch sizes: 500, 1000, 1500, 2000, 2500, 3000, and 5000.
      2. Include edge-case scenarios, such as extremely small (<100) or large (>10,000) batch sizes.
  3. Execute Batch Size Tests

    • Sub-Tasks:
      1. Use K6 or JMeter to simulate batch processing with different sizes.
      2. Measure and log performance metrics:
        • Processing time per batch.
        • Resource utilization (CPU, memory, disk I/O).
        • Error rates (e.g., failed transactions, retries).
      3. Test under both normal and peak load conditions.
  4. Monitor Performance During Tests

    • Sub-Tasks:
      1. Use Azure Monitor to track resource utilization (CPU, memory, disk IOPS).
      2. Use Application Insights to monitor API call latency and dependency performance.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Analyze Results and Identify Optimal Batch Size

Goal: Evaluate the performance metrics to determine the batch size that offers the best balance of throughput, latency, and resource usage.


Tasks:

  1. Aggregate Test Results

    • Sub-Tasks:
      1. Consolidate metrics for each batch size tested (e.g., latency, throughput, resource usage, error rates).
      2. Visualize results using tools like Power BI, Excel, or Tableau.
  2. Identify Performance Trends

    • Sub-Tasks:
      1. Plot batch size vs. latency to identify trends (e.g., linear increase, sudden spikes).
      2. Plot batch size vs. error rate to determine thresholds where errors increase significantly.
  3. Determine Optimal Batch Size

    • Sub-Tasks:
      1. Select the batch size that provides:
        • Maximum throughput with minimal latency.
        • Error rates within acceptable thresholds (<1%).
        • Resource usage below critical limits (<80% CPU/memory utilization).
      2. Validate the selected batch size against production constraints (e.g., SLAs, compliance).

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Validate Findings in Production

Goal: Test the identified batch size in a controlled production environment to confirm its feasibility.


Tasks:

  1. Pilot the Optimal Batch Size

    • Sub-Tasks:
      1. Use a subset of production workloads to process batches at the new size.
      2. Monitor performance metrics closely (latency, resource usage, errors).
  2. Compare with Current Batch Size

    • Sub-Tasks:
      1. Measure performance differences between the current (2.5k) and optimal batch sizes.
      2. Document any additional insights or challenges observed during the pilot.
  3. Finalize Batch Size Recommendation

    • Sub-Tasks:
      1. Prepare a detailed report summarizing findings and the recommended batch size.
      2. Present results to stakeholders for validation and approval.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Scalability Testing Implementation Details

Scalability testing for batch size focuses on understanding how the system performs as you adjust the batch size under varying loads. Here's a detailed guide:

@emvaldes emvaldes added the documentation Tickets that add documentation on existing features and services label Jan 7, 2025
@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Define the Test Goals

  1. Determine the maximum batch size the system can handle under normal and peak load conditions.
  2. Ensure batch processing maintains acceptable performance metrics:
    • Latency: Time taken to process each batch.
    • Throughput: Number of batches processed per second.
    • Error Rate: Percentage of failed batches.
    • Resource Utilization: CPU, memory, and I/O usage.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Prepare the Test Environment

  1. High-Fidelity Staging Environment:

    • Reproduce the production environment, including all critical components (e.g., VMs, containers, databases).
    • Use Infrastructure as Code (Terraform or ARM templates) to ensure consistency.
  2. Data Setup:

    • Use production-like test data for batch processing. For example:
      • Mock datasets with similar size, structure, and complexity as production data.
      • Vary data types and sizes within the batches to include edge cases.
  3. Monitoring and Metrics Collection:

    • Enable Azure Monitor and Application Insights:
      • Track infrastructure-level metrics (CPU, memory, disk IOPS).
      • Monitor application-level metrics (latency, throughput, errors).

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Define Test Scenarios

  1. Test with a range of batch sizes:
    • Start with small sizes (e.g., 500, 1000) and gradually increase to larger sizes (e.g., 3000, 5000, 10,000).
  2. Simulate different load patterns:
    • Normal Load: Average number of batches processed per second.
    • Peak Load: High-concurrency scenarios with multiple batches submitted simultaneously.
  3. Test under failure scenarios:
    • Simulate a batch with corrupted or missing data.
    • Test system behavior under partial database or network failure.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Execute the Tests

  1. Load Testing:

    • Use K6 or Apache JMeter to simulate batch submissions.
    • Example with K6:
      import http from 'k6/http';
      import { sleep } from 'k6';
      
      export let options = {
        stages: [
          { duration: '1m', target: 20 }, // Ramp up to 20 users
          { duration: '5m', target: 100 }, // Sustain load
          { duration: '1m', target: 0 }, // Ramp down
        ],
      };
      
      export default function () {
        const payload = JSON.stringify({
          batchId: `batch-${__VU}-${__ITER}`,
          data: Array.from({ length: 2500 }, (_, i) => i + 1),
        });
      
        const params = {
          headers: {
            'Content-Type': 'application/json',
          },
        };
      
        http.post('https://your-api-endpoint.com/process-batch', payload, params);
        sleep(1);
      }
  2. Stress Testing:

    • Push the system beyond its expected capacity to identify bottlenecks.
    • Gradually increase batch size or load until the system becomes unstable.
  3. Soak Testing:

    • Submit batches continuously over an extended period (e.g., 6–12 hours) to test for resource leaks or performance degradation.
  4. Chaos Testing:

    • Use Azure Chaos Studio to simulate failures during batch processing.
    • Example: Introduce network latency or shut down a database node while processing large batches.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Analyze Test Results

  1. Aggregate metrics for each batch size tested:
    • Latency (average, 95th percentile).
    • Error rates (e.g., failed or retried batches).
    • Resource utilization (CPU, memory, disk IOPS).
  2. Identify trends and thresholds:
    • Plot batch size vs. latency, error rates, and resource usage.
    • Determine the size at which performance degradation becomes unacceptable.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Document Findings

  1. Create a scalability report:
    • Include test scenarios, metrics, and visualizations.
    • Recommend the optimal batch size based on test results.
  2. Present findings to stakeholders for validation.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Monitoring Batch Size Metrics

Monitoring batch size metrics is critical for identifying the optimal size and ensuring consistent performance. Below are the key metrics to monitor and how to capture them.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Metrics to Monitor

  1. Latency:
    • Average time taken to process a batch.
    • Tail-end latency (e.g., 95th and 99th percentile).
  2. Throughput:
    • Number of batches processed per second.
  3. Error Rate:
    • Percentage of failed or retried batches.
  4. Resource Utilization:
    • CPU, memory, and disk I/O usage during batch processing.
  5. Queue Depth (if applicable):
    • Number of unprocessed batches in the message queue.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Tools for Monitoring

  1. Azure Monitor:
    • Use metrics for VM or container-level performance:
      • CPU utilization.
      • Disk IOPS and network bandwidth.
  2. Application Insights:
    • Capture application-level metrics:
      • Custom events for batch processing start and end times.
      • Dependency metrics for database or API calls.
  3. Log Analytics:
    • Query logs for batch processing errors or performance metrics.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

KQL Queries for Monitoring

  1. Batch Processing Latency:

    customEvents
    | where name == "BatchProcessed"
    | summarize AvgLatency = avg(todouble(customDimensions['duration'])), P95Latency = percentile(todouble(customDimensions['duration']), 95) by bin(timestamp, 1m)
  2. Batch Processing Error Rate:

    customEvents
    | where name == "BatchError"
    | summarize ErrorCount = count() by bin(timestamp, 1m), customDimensions['batchId']
  3. CPU and Memory Utilization:

    Perf
    | where ObjectName == "Processor" and CounterName == "% Processor Time"
    | summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 1m)
  4. Throughput (Batches Processed Per Second):

    customEvents
    | where name == "BatchProcessed"
    | summarize Throughput = count() by bin(timestamp, 1s)

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Visualizing Metrics in Dashboards

  1. Azure Workbooks:
    • Create charts for latency, error rate, and resource utilization.
    • Use line charts for trends over time and bar charts for batch size comparisons.
  2. Grafana (Optional):
    • Integrate Azure Monitor with Grafana for advanced visualizations.
    • Display real-time dashboards for throughput and latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DevSecOps Team Aq DevSecOps work label documentation Tickets that add documentation on existing features and services platform-current Platform - Current Capabilities reportstream
Projects
None yet
Development

No branches or pull requests

2 participants