Wally состоит из частей, которые стоит разделить и унифицировать с другими тулами

  • Сделать ceph-lib, вынести ее в отдельный проект, должна поддерживать и 2.7 и 3.5 и не иметь строгих внешних бинарных зависимостейю В нее вынести:

    • Cluster detector
    • Cluster info collector
    • Monitoring
    • FSStorage
  • Openstack VM spawn

  • Load generators

  • Load results visualizator

  • Cluster load visualization

  • Поиск узких мест

  • Расчет потребляемых ресурсов

  • Сопрягающий код

  • Хранилища должны легко подключаться

  • Расчет потребления ресурсов сделать конфигурируемым - указывать соотношения чего с чем считать

  • В конфиге задавать storage plugin

Ресурсы:

На выходе из сенсоров есть

NODE_OR_ROLE.DEVICE.SENSOR

create namespace with all nodes/roles as objects with specially overloaded getattr method to handle device and then handle sensor. Make eval on result

(CLUSTER.DISK.riops + CLUSTER.DISK.wiops) / (VM.DISK.riops + VM.DISK.wiops)

Remarks:

  • With current code impossible to do vm count scan test

TODO next

  • Remove DBStorage, merge FSStorage and serializer into ObjStorage, separate TSStorage.
  • Build WallyStorage on top of it, use only WallyStorage in code
  • check that OS key match what is stored on disk
  • unit tests for math functions
  • CEPH PERFORMANCE COUNTERS
  • Sync storage_structure
  • fix fio job summary
  • Use disk max QD as qd limit?
  • Cumulative statistic table for all jobs
  • Add column for job params, which show how many cluster resource consumed
  • show extra outliers with arrows
  • More X = func(QD) plots. Eg. - kurt/skew, etc.
  • Hide cluster load if no nodes available
  • Show latency skew and curtosis
  • Sort engineering report by result tuple
  • Name engineering reports by long summary
  • Latency heatmap and violin aren't consistent
  • profile violint plot
  • Fix plot layout, there to much unused space around typical plot
  • iops boxplot as function from QD
  • collect device types mapping from nodes - device should be block/net/...
  • Optimize sensor communication with ceph, can run fist OSD request for data validation only on start.
  • Update Storage test, add tests for stat and plot module
  • Aggregated sensors boxplot
  • Hitmap for aggregated sensors
  • automatically find what to plot from storage data (but also allow to select via config)

Have to think:

  • Send data to external storage

  • Each sensor should collect only one portion of data. During start it should scan all available sources and tell upper code to create separated funcs for them.

  • store statistic results in storage

  • During prefill check io on file

  • Store percentiles levels in TS, separate 1D TS and 2D TS to different classes, store levels in 2D TS

  • weight average and deviation

  • C++/Go disk stat sensors to measure IOPS/Lat on milliseconds

  • TODO large


  • Force to kill running fio on ctrl+C and correct cleanup or cleanup all previous run with 'wally cleanup PATH'

  • Code:


  • RW mixed report

  • RPC reconnect in case of errors

  • store more information for node - OSD settings, FS on test nodes, target block device settings on test nodes

  • Sensors

    • Revise sensors code. Prepack on node side, different sensors data types
    • perf
    • bcc
    • ceph sensors
  • Config validation

  • Add sync 4k write with small set of thcount

  • Flexible SSH connection creds - use agent, default ssh settings or part of config

  • Remove created temporary files - create all tempfiles via func from .utils, which track them

  • Use ceph-monitoring from wally

  • Use warm-up detection to select real test time.

  • Report code:

    • Compatible report types set up by config and load??
  • Calculate statistic for previous iteration in background

  • UT


  • UT, which run test with predefined in yaml cluster (cluster and config created separatelly, not with tests) and check that result storage work as expected. Declare db sheme in seaprated yaml file, UT should check.

  • White-box event logs for UT

  • Result-to-yaml for UT

  • Infra:


  • Add script to download fio from git and build it

  • Docker/lxd public container as default distribution way

  • Update setup.py to provide CLI entry points

  • Statistical result check and report:



  • Add sensor collection time to them
  • Make collection interval configurable per sensor type, make collection time separated for each sensor
  • DB <-> files conversion, or just store all the time in files as well
  • Automatically scale QD till saturation
  • Runtime visualization
  • Integrate vdbench/spc/TPC/TPB
  • Add aio rpc client
  • Add integration tests with nbd
  • fix existing folder detection
  • Simple REST API for external in-browser UI