Data Director: Huge performance improvement in version 3.0

Paul Vetter

Text & Concept

19 May 2022 Digital Agency Pimcore

With version 3.0, Blackbit takes a big step by equipping Data Director with a more efficient storage method that makes imports significantly faster than with Pimcore's standard storage logic.

Blackbit veröffentlicht Version 3.0 des Data Directors für Pimcore

New, resource-efficient storage method

Pimcore's default save method is not optimal from a performance perspective because it not only saves the changed data, but recalculates every aspect of all class fields: All fields are checked for validity, dependencies are recalculated every time, etc. So even if you change only one input field, all these processing steps are performed.

That's why Data Director 3.0 comes with its own storage mechanism that stores only the data that has really changed. This efficiency-optimized process of saving brings a performance increase of about 200%.

Since this is a major change, after upgrading to version 3.0 all existing dataports will be set to a so-called "compatibility mode". This means that they will initially continue to use the old storage mechanism to reduce the risk of many dataports not working. You can disable "compatibility mode" in the advanced settings for dataports. For new dataports, "Compatibility Mode" is automatically turned off to ensure optimal performance.

More efficient loading of latest version data

Previously, to check whether an object had changed during import, the latest object version was loaded. However, since the versions are stored serialized in the file system, this process is quite time-consuming. In version 3, the old values of mapped import fields are now read before they are changed - thus the resource-intensive loading of versions becomes unnecessary.

Parameterization of dataport runs

You can now parameterize all elements of a dataport:

  • Access URL / CLI parameters in data query selectors of Pimcore-based dataports, e.g. image:thumbnail# returns the thumbnail path of the thumbnail definition "500px" when the dataport is accessed via URL /api/export/dataport-name?format=500px
  • Access URL/CLI parameters in callback functions via . This can be used to specify the output format of an export, for example.
  • Access to URL / CLI parameters in import resource / SQL condition (exists since version 2.6)

It is now also possible to access source data class fields in Data Query selectors of Pimcore-based data ports. For example, if you have the Data Query selector myMethod#, the article number of the current source data class object is automatically used (as long as no URL parameter "articlenumber" overrides this). You can also call service classes with parameters this way: @service_name::method#2922 calls the method "method" of a Symfony service "service_name" with the id of the current source class object.

Import into Calculated-Value fields

It is now possible to import data into Calculated-Value fields without having to create an extra PHP class for the calculation logic. This can be applied to all data that is only to be used for display but not for editing, such as for visualizing data quality.

Display raw data in Pimcore report

Version 3.0 provides a report adapter for Dataport raw data. This has two main use cases:

  1. Make raw data reusable between multiple dataports,
  2. Simplify report creation as no SQL/Pimcore database knowledge is required.

Prevent duplicate assets

With a single checkbox, it is now possible to create assets only if they do not already exist. This even works if the asset images have different file names or different image sizes.

UI changes

Dataport settings

  • Improvement of autocompletion for Data Query selectors.
  • sorting of suggested data query selectors by Levenshtein distance to the desired data query selector -> more relevant sorting
  • when creating dataports, the dataport name is parsed to determine the dataport source type + source/target class

Attribute mapping

  • Language of localized fields is displayed as a flag to better identify the language of the localized field
  • Visualization of dependencies when clicking on an attribute mapping field
  • Speed up generation of callback function templates
  • Update dependent field preview when updating callback function
  • Maximize callback function window

History panel

  • support for searching by dataport log file name to facilitate access to the import archive file
  • formatting of the start date according to the user's current locale (derived from the user's language setting)
  • No new window opens if the result callback function does not generate output (e.g. for imports that call a dependent import)

Raw data extraction/data query selectors

  • A warning is triggered if CSV/Excel file contains the same column header multiple times.
  • Support for ":url" data query selector for assets and image fields to get the absolute URL of the mapped asset(s).
  • Support for searching reverse relations via Category:products: when the Category class manages the relation to products and the source data class of the export is Product.
  • Support for "ancestors" / "descendants" in data query selectors to get all objects above / below the current object.
  • Support filtering of arrays in data query selectors, e.g. a many-to-many relationship by category can be filtered with categories:filter#published,true to get only the published related category objects.
    Another use case is if you have a field collection of prices and their validity dates, you can use prices:filter#validFrom,now,>=:filter#validTo,now,<= to get all field collection items that are valid today.
  • Support for "withInheritance" / "withoutInheritance" helper functions to enable / disable inheritance for individual data query selectors.
  • Support for suffix aliases in Data Query selectors, e.g. (scalar;object:scalar) as group1

Processing of raw data

  • provision of $params['transfer'] also for field callback functions
  • Support for searching relational objects via a unique index: no data query selector "manufacturer:name:".$params['value'] needs to be returned if the "name" field in the "manufacturer" class is unique. Here it is sufficient to assign the manufacturer name as a raw data field.
  • Support for searching multiple objects via data query selectors
  • Streaming of result documents keeps memory consumption low even when creating large export documents (currently implemented for CSV only).
  • Bugfix: object key was not valid if the key was 255 characters long and an object with the same key already existed. The length of the suffix is now subtracted to get back to 255 characters.
  • Added new option to automatically create Classification Store fields
  • Support for automatic text creation via OpenAI API
  • Support for language mapping to translation provider, e.g. to use en-gb as target language for "en".
  • When restarting Dataport runs due to an accidental termination, a check is made to see if Dataport can continue: Non-incremental exports cannot be continued and must be completely restarted. Imports and incremental exports can continue as before.
  • Support for assigning elements to asset metadata (previously only the "input" type was supported).
  • Bugfix: processing of virtual fields used in key fields.

Other changes

  • Raw data is deleted in packages, since extensive deletion processes otherwise take too much time.
  • Restructuring of logging to use memory capacity more efficiently.
  • spatie/once is removed as this caused many unnecessary debug_backtrace() calls.
  • Logs in the application logger are grouped: Certain messages are listed only once and given a frequency index, e.g. "happened 3x".
  • The Product:articleNo:.:name#en data query selector for determining the current field value "articleNo" is no longer supported. This is because this data query selector is supposed to find products with articleNo=".". You can use $params['currentObjectData']['articleNo'] to get the current value of the "articleNo" field.
  • Automatically correct misconfigured default timezones between web server PHP and CLI PHP by always storing data in UTC. Otherwise dataport runs could be aborted because they take too long, or negative run times are displayed in the history panel, etc.
  • Fixed: Notification mail about queue processor that could not be started was sent even if the queue processor was started but finished in less than 5 seconds.
  • Automatic reloading of elements if they were modified by automatic imports after saving.
  • Skip hash check for pimcore-based imports.
    Use case: automatic import that sets published based on specific logic of raw data fields.
    • 1st pass: object is published -> import logic sets published to false -> hash of raw data is saved -> object is saved & republished without changes.
    • 2nd pass: raw data is same -> but published was changed -> we need to run dataport again, otherwise object will be published even though published logic would unpublish it.
  • Support for different request contexts to be able to change the behavior of overridden getter methods.
  • When renaming dataports, all redirects for old REST API endpoint URLs of that dataport point to new URL. This prevents redirect chains.
  • Bugfix: Edit lock message is no longer triggered if the current user has just saved an object.
  • Result document action "send as mail" supports sending response documents as attachments.
  • Fixed: Automatic start did not work for Excel imports.
  • More efficient deletion of temporary files after each raw data block. Because by default they are deleted only when the whole process is finished, which wastes a lot of memory.
  • Bugfix: Application logger log files cleanup did not work correctly.
  • Logging the user who started the Dataport run.
  • Removal of the automatic setting that manually uploaded files should use the "default" dataport resource. Instead, a separate resource is created for the uploaded file. Consequence: when a data port is run with the same file, the previous file is overwritten and the file name is displayed in the history panel instead of the generated uniqid().
  • Multiple raw data items are no longer processed in one database transaction, otherwise a problem with one item would also prevent the import of all other items in the same transaction.

Blackbit Data Director on YouTube

Do you already know our video tutorials around Data Director? For useful tips and detailed application visit Blackbit on YouTube!

Still questions?

You are curious and would like to learn more about our Data Director? Contact us now and we will show you in a free demo, which possibilities the Data Director opens up for you.

About the Author

Sometimes creative, sometimes very dry and objective texts & concepts that are tailored to the needs of customers and target groups and sharpen Blackbit's self-marketing - that's Paul's world. Whether online or offline, for the blog, in fast moving social media or for corporate publishing of Bestand.

Content is also king in his private life: As a daddy, foodie and fitness junkie, Paul also feeds the Instagram Feed tirelessly from home. #liebezumlöffeln