# Raw Data Pipeline

Image

The scheme describes the way raw data gets collected from external sources, processed and stored.

The Pipeline implies the ability to collect raw data in any text format with further conversion into JSON and automatically determine the DB schemas to store the data.

The Pipeline consists of 4 elements:

  1. Log Collector – HTTP receiver. The program wraps received data in the following model: (_id, _aggregatedAt, _connector, _sourceType, _source) and sends it to the preprocessor. The program also validates the API key and the received data model.

  2. Preprocessor. Processes received raw data. Does the following:

    • Parses the text and converts data into JSON.
    • Processes bulk events and extracts the single data elements.
    • Modifies input model, e.g. adds computable attributes or changes attribute's type.
    • Adds system label to the model.
  3. DB Scheme Validator. Creates a DB schema according to the data model located in the connector. Creates the schema from a JSON model and adds additional attributes if needed. Does the stream processing using ML algorithms based on the trained model and adds the following labels:

    • _labels.ml.cluster.id – The ID of the cluster, the event is related to.
    • _labels.ml.cluster.modelVersion – ML model version (iteration).
    • _labels.ml.cluster.position – Event position relative to the cluster's center. Fractional number, the smaller it is, the closer it is to the center of the cluster.
  4. Data Buffer. Stores all the raw data to the ClickHouse DB according to the data stream's schema.