Processing large CSV files in Alumio

This guide will teach you how to make working with large files within Alumio less resource intensive by utilizing the file splitter feature.

The file splitter feature, is a feature that allows the user to set up a cronjob which will occasionally check a filesystem for large files and split these up into smaller files.

How it works

The file splitter uses a filesystem, a splitter and a maximum file size to check each file in the root directory of a filesystem to see whether it is a valid candidate for splitting.

The command passes a stream of the file to the splitter. The splitter then creates individual chunks of grouped data and sends it back to the command/endpoint. Within the action, the new files are created, and the original file is discarded.

The whole process uses streams, this prevents the server memory from overloading during the process.

Setting up a cronjob

Currently, the following splitters are available:

  • csv

The invocation of the command looks like this:

vendor/bin/magement filesystem:split my-filesystem-queue csv 100000

The my-filesystem-queue part of the command is the reference to the identifier of the created filesystem. csv is the identifier of the requested splitter and 100000 is the maximum size in bytes. The API call variant would be:

GET: https://environment.magement.com/filesystem/split?filesystem=my-filesystem-queue&splitter=csv&maximum_size=100000

It is recommended to try to perform the split command at the moment that imports have been performed (e.g. a CSV file has been uploaded to a filesystem).

For example, if a CSV file is uploaded each night at 00:00, and it is certain it is there at 00:05, the cronjob can be set up like this:

5 0 * * * $MAGEMENT_ROOT/vendor/bin/magement filesystem:split my-filesystem-queue csv 100000

Creating the file queue

To create a file queue which is going to handle split files, we're going to need two filesystems. One of these filesystems will point to the queue directory of the file queue.

The main filesystem

The main filesystem can be created in the dashboard by going to Storages > Filesystems. Then click on the button to create a new filesystem.

Choose a name for the filesystem. The identifier will be automatically generated. Then choose the Local filesystem option under Prototype. And finally choose a Root directory, it is recommended to use the same value for this path as the identifier.

Main filesystem configuration

Then press save.

The queue filesystem

The queue filesystem is necessary for the split command, so it knows which path to use. As before go to Storages > Filesystems and click on the button to create a new filesystem.

Use the same name as the previous example but suffix it with queue. This will also be visible in the identifier. Choose the same type of Prototype as the previous example. The difference here is that the Root directory should point to the same directory as the previous example, but now with the suffix /queue.

Queue filesystem configuration

Then press save.

Creating the incoming configuration

The incoming configuration of the file queue can now be created. To create this configuration go to Connection > Incoming and press the button to create a new configuration.

Choose a name for the incoming configuration. The identifier will be automatically generated. Add a description to describe the purpose of the configuration. Then under Subscriber, select the File Queue Subscriber. The paths can be modified, but for this example the values will stay this way. Under Filesystem select the main filesystem created in an earlier example. For the parser select the CSV parser. The delimiter, enclosure and escape characters can be modified to the characters used in the CSV file. Finally select the Default entity in the dropdown menu for the Entity Type field.

File Queue Subscriber configuration

Then press save.

Testing the file queue

In order to test the file queue and splitter with the previous configuration, take the following steps.

  • Go to the filesystem of your Alumio installation.
  • From the root of the Alumio installation create a file at var/local/my-filesystem/queue/test.csv
  • Open the file and add some data, e.g.:
    firstname,lastname,age,job
    john,doe,25,developer
    jane,doe,33,tester
    foo,bar,44,placeholder

Then test the splitter by running the following command:

vendor/bin/magement filesystem:split my-filesystem-queue csv 20

If everything is set up correctly, the original test.csv should be replaced with 3 new files:

test_1.csv
test_2.csv
test_3.csvC

Each file should contain the first line of the CSV file (the header) and an additional line with the data of one entity. The subscriber can now be executed by running the command:

vendor/bin/magement consume:subscriber my-file-queue-subscriber

The files will now be moved from the queue directory to the processing directory. After the file queue subscriber has interpreted the files and created tasks, the files will be moved to the finished directory on success and to the failed directory if an error occurred.