Company attributes
Other attributes
Iterative.ai is a machine learning operations (MLOps) company focused on streamlining the workflow of data scientists. The company builds developer tools for machine learning that are designed to reduce the complexity of managing datasets, ML infrastructure, and ML models lifecycle management. Iterative.ai's products have been developed by over 200 open-source contributors, engaged with by more than 4000 community members, used by over 400 companies, and awarded more than 7000 Github stars.
Data Version Control (DVC) enables the capturing of versions of specific data and models in Git commit commands while storing them on-premises or in cloud storage. It also provides a mechanism to switch between different data contents. The result is a single traversable history for data, code, and ML models.
DVC enables data versioning through codification, wherein simple metafiles are produced once by the user, describing which datasets, ML artifacts, and other items should be tracked. This metadata can be put in Git in lieu of large files. DVC can then be used to create snapshots of the data, restore previous versions, reproduce experiments, record evolving metrics, and more.
As DVC is used, unique versions of the user's data files and directories are cached in a systematic way to prevent file duplication. Although the working datastore is separated from the workspace to minimize the project's size, it stays connected via file links handled automatically by DVC.
The DVC platform offers the following features:
- Compatibility with Git: DVC is compatible with any standard Git server or provider (GitHub, GitLab, etc.) and can be integrated with any Git repository. Data file contents can be shared by network-accessible storage or any supported cloud solution. DVC offers advantages similar to that of a distributed version control system, such as lock-freedom, local branching, and versioning.
- Support of various kinds of storage systems: DVC can use Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network-attached storage, or discs to store data. The platform is continuously updated with new remote storage options.
- Reproducibility: DVC conserves input, configuration, and code data to enable reproducibility of experiments and failure tracking functionality.
- Git branching: DVC supports instantaneous, non-duplicative Git branching in all file sizes, allowing for the reuse of a single file across multiple experiments.
- Metric tracking: DVC includes a command to list all branches, along with metric values, to track progress or aid the user in selecting the desired version of code.
- ML pipelines framework: DVC has a built-in system that assembles ML steps into a DAG (Directed Acyclic Graph) and runs the pipeline end-to-end. DVC handles caching of intermediate results and does not repeat a step if input data or code are the same.
- Language- and framework-agnostic operation: DVC is independent of the programming language in use, library types, and code structure; reproducibility and pipelines are based on input and output files or directories. Python, R, Julia, Scala Spark, custom binary, Notebooks, flatfiles/TensorFlow, PyTorch, etc. are all supported.
- HDFS, Hive & Apache Spark: Spark and Hive jobs may be included in the DVC data versioning cycle, along with local ML modeling steps. Spark and Hive jobs can also be managed with DVC. This enables the decrease of feedback loops by decomposing a large cluster task into smaller DVC pipeline steps, as well as independent iteration on the steps with respect to dependencies.
DVC can find uses in the storage and processing of data files and in the production of other data or machine learning models. DVC also enables the user to perform the following:
- Track and save data and machine learning models in the same way that code is captured
- Create and switch between versions of data and ML models
- Gain insight into the original structure of datasets and ML artifacts
- Compare model metrics among experiments
- Adopt engineering tools and best practices in data science projects
Advantages of the DVC tool include those below:
- No cost: DVC is a free, open-source, command-line tool and does not require databases, servers, or any other special services to operate.
- Project readability: File names represent variable data and can remain unchanged, ensuring that projects are kept readable.
- Data management functionality: DVC provides a storage solution for data and models (e.g. SFTP, S3, HDFS, etc.) that is free from Git hosting constraints. The platform also optimizes the storage and transfer of large files.
- Collaboration capacities: DVC aids collaborative efforts by allowing straightforward distribution of project development, as well as sharing its data internally and remotely or reusing it elsewhere.
- Data compliance: Data modification attempts can be reviewed as Git receives requests. The user can audit the project's history to learn when and why datasets or models were approved.
- GitOps: Data science projects can be connected with the Git-powered platform. Git workflows support tools such as continuous integration (e.g. CML or CI/CD), as well as specialized patterns (e.g. data registries), and other best practices.
Continuous Machine Learning (CML) is an open-source library for implementing CI/CD (continuous integration/delivery) in machine learning projects. It can be used to automate parts of the user's development workflow, including model training and evaluation, comparing ML experiments across the user's project history, and monitoring variable datasets.
CML was developed to enable the use of GitLab or GitHub to manage ML experiments, track whoever trains ML models or modifies data and at what time, and to automatically generate reports for ML experiments, with metrics and plots in every Git Pull Request. CML allows the users to build their own ML platform using GitHub or GitLab and cloud services, such as AWS, Azure, or GCP. Like DVC, CML works independently of extraneous databases and services.
Iterative.ai's Studio is a collaboration tool for machine learning, offering data and model management, experiment tracking, visualization, and automation. Studio is offered for teams and for individual users and works with other Iterative.ai software products.
MLEM is an open-source tool offered by Iterative.ai and is intended to help users simplify machine learning model deployments. MLEM allows users to save an ML model with a Python call, can capture the ML models metadata automatically in a human-readable YAML format, allows users to deploy models where they want, and enables them to switch platforms for deployments. MLEM is developed to help users make a Git model registry and was developed for Git-native ML models.