How to import large CSV files in Power Platform (or Dynamics 365)

There are several ways you can import data into Power Platform (or Dynamics 365). When you are importing CSV files, you have the following out-of-the-box options, each with their cons and pros.

  • Import feature in Power Apps – Allows you to import Excel (.xlsx), comma-separated values (.csv), and XML Shpreadsheet 2003 (.xml) formats. The data can be compressed, but the maximum size is only 32 MB for the .zip files.
    • PRO: Easy-to-use UI that guids you through the process.
    • CON: Limited files size.
    • CON: No transformation / advanced mapping logic allowed.
    • CON: Very unforgiving in the format of data
    • CON: No so fast.
    • CON: Cannot be automated
  • Power Platform Data flows – can be used for both data migration and continues data integration. Data flows is a new and exciting way of importing data to Power Platform based on Power Query. It supports tons of different data formats that includes even external Web APIs, Parquet files, and even another Dataverse!
    • PRO: Based on Power Query – I consider it a pro, because you can learn Power Query in minutes and the UI is unparalleled to any tool that I have seen in my career.
    • PRO: Can handle large files (although I haven’t tested at gigabyte level yet).
    • PRO: Advanced data transformations can be applied easily on the fly
    • PRO: Support a wide range of data sources, although we are concerned about CSV here, but you never know maybe that CSV comes from another source that is supported 😉.
    • CON: Absolutely no way to manage failures in the data flow. You will have access to a report at the end that helps you track down which records failed, but if you have serveral failures it will not be practival to read the report and filter those manually!
    • CON: Performance is usually better compared to the import feature, but very unreliable. I used it in project where the scope of tables were limited and eveyy time I tested I got totally different performance.
    • CON/PRO: Can be automated, but only based on a schedule.
  • Custom cloud flow (or Power App) – Sky is the limit when it comes to app development right? but, what about the effort? What if you need to do it in different environments, are you goin to also implement CI/CD? How reusable your flow / app will be?
  • Using Azure Synapse Analytics – is as the name suggests, an analytics service on Azure. While it is not a free service like the previous options that I iterated, but provisioning it for the sake of a full data migration project when the number of tables is more than a handful can be extremely benefitial. The fact that you can just focus on the transformation logic and Synapse would take care of managing the data storage will save you so much time. It supports so many data source and targets out-of-the-box and if you already have some background with a database technology (e.g. SQL Server) the learning curve won’t be steep.
  • SQL Server’s SSIS – is a wellknown ETL tool that has been used for data integration and data migration for ages.
    • PRO: Familiar tool (if you know your way around SQL Server family of products) and an easy-to-use UI.
    • PRO: Part of SQL Server
    • PRO: Several out-of-the-box and 3rd party components. In case of Power Platform I recommend Kingswaysoft’s components which I have used several time in many projects with success. They also have a great support team for those edge cases 😉
    • PRO: Reliable tool with good performance
    • CON: Implementing complex logic and debugging can be difficult.
    • CON: In many cases you will have to rely on ScriptComponent to run C# code pieces that will be hidden in multiple components that is not reusable.
    • CON: Although you would be producing code behind the scene, the storage format makes it impossible to trace code changes or revert changes.
    • CON: Since the introduction of Synapse, I find it really hard to recommend SSIS for any but a very narrow range of use cases (e.g. you already have it and use it for another similar ETL process).

As you see, there are several options, but all of them are somehow limited by either size, or performance or fragile and limited when it comes to handling fault, apart from Azure Synapse Analytics. The latter might not be justifiable in terms of cost or learning curve compared to the scope of the project. Today I’m going to show you another, very easy way to load large amounts of data in Power Platform that can be another tool in your toolbox. This one aims to address the shortcoming of the above options that you have.

Using PSDataverse to load large CSV files

Chances are you have landed here because you have a large CSV file and you might be asking how large are we talking. The answer is I have not faced any limit yet. When done right, virtually you should not face any limit. I have personally tested with larg CSV files of several gigabytes without any issues.

I have posted about PSDataverse before in this blog and in fact previously I showed an example where I load a CSV file into Accounts table. Let’s start from there. The only prerequisite is to install PSDataverse, a PowerShell module that has no other dependecy. You can either install it once from the official PowerShell gallery using the following command:

Install-Module -Name PSDataverse

Or, you can get it from its official Gihub repository and install it manually.

Connect-Dataverse "authority=https://login.microsoftonline.com/<tenant-id>/oauth2/authorize;clientid=<client-id>;clientsecret=<client-secret>;resource=https://<environment>.crm4.dynamics.com/"
Import-Csv "Account.csv" | ForEach-Object {
    @{ Uri = "account", Method= "PATCH", Value = @{
        name = $_.AccountName,
        overridencreatedon = $_.CreationDate,
        ...}}} | ConvertTo-Json | Send-DataverseOperation

The first line uses a client-id and a client-code to authenticates to Power Platform. There are different ways you can connect to Power Platform. Using a client-id and a client-secret, using a client-id and a client certificate or simply using a client-id and an automatically generated device code. In all the cases you will need to create an application registration and then create an application user from it. This can be done in literally a minute through the UI. Read more here if you haven’t done it before.

The second line does all the magic! It first imports a CSV file, then for each record in that CSV file does a simple transformation which is a basic mapping, and then converts it to a JSON model and sends it to Power Platform using Send-DataverseOperation command.

Tuning the performance

At this point, if you run the script, it will start pushing data to Power Platform one record at a time. It waits for the response and returns it to you in a simple PowerShell object. The problem with this approach is that while it is reliable and you will get back the response, the performance won’t be great. If you are a veteran Dynamics developer, you might already know two important features: parallelism and batching.

By parallelism, I mean PowerShell allows you to push data concurrently using up to 20 streams in parallel which can virtually increase the speed by almost 20x and this limit can be raised by Microsoft during your data migration by opening a support request.

Batching allows to send multiple operations in one batch, which means you will save on the amount of data transferred over the wire that can contribute to increasing performance. The hard limit is 1000 operations, but you should try different numbers, because depending on the table, its relations and the logic around it, the number of operations you put in a batch can worsen the performance as well. My suggestion is start with 500 and fine-tune per each table to get the best performance.

Let’s rewrite our script to benefit from parallelism and batching.

Connect-Dataverse "authority=https://login.microsoftonline.com/<tenant-id>/oauth2/authorize;clientid=<client-id>;clientsecret=<client-secret>;resource=https://<environment>.crm4.dynamics.com/"
Import-Csv "Account.csv" | ForEach-Object {
    @{ Uri = "account", Method= "PATCH", Value = @{
        name = $_.AccountName,
        overridencreatedon = $_.CreationDate,
        ...}}} | ConvertTo-Json | Send-DataverseOperation -MaxDop 20 -BatchSize 500

As you see just by adding two parameters the performance will be increased by an order of magnitude. I will not give you numbers to get you hops high or low, because every environment is different, but I can guarantee that you will be impressed.

Advanced scenarios

While the example I gave before is very basic, you can extend it to add complex data transformations or even get data from other sources or even multiple sources. The Send-DataverseOperation will try its best to succeed. For example in case of transient errors (e.g. brief network outage) it will wait and retry multiple times and even loger each time. There is a lot more to it that you can learn from the Github repository.

I also suggest that you add a proper fault-handling to isolate and categorize failed records, so you can analyze how many records failed and why (e.g. name is one character too long 😄) . That way you can mitigate the problems and retry only the records that failed.

For scenarios where you have multiple tables with different transformations, I recommend a staging database. That way you can have a multi-stage migration. First you import data into the staging DB and apply some cleanup. Then you will transform and load data into Power Platform. It might sound like a lot of work, but the time you will save is not even comarable with the effort. Perhaps in another blog post, I will show how you can do it in a matter of minutes.

Introducing PSDataverse, PowerShell’s new super power

PowerShell is the solution for task automation. It has everything you need to iterate fast. A command-line shell with auto-completion, a scripting language that is object oriented and it runs on Windows, Linux, and macOS. PowerShell’s command-line shell is the best in class with tab completion, and even command prediction, aliases, in-console help (Unix man pages anyone) and my favorite feature Pipelines that lets you chaim commands together. You can even debug directly in the shell! If you are new to PowerShell start here.

Dataverse on the other hand is a managed relational database and a lot more. It hosts Dynamics 365’s data, provides rich metadata, and can even host logic and validation with a range of its own productivity tools. PowerShell can be handy to automate tasks, and orchestrate one-off solutions in Power Platform. From another perspective Dataverse can be usefull as a managed database. You don’t need a licese if the user is not a human, by the way.

This picture depicts an equation. The first figure is showing PowerShell Hero, then a plus sign, then the logo of Microsoft Dataverse, then an equal sign and the result of the equation is PowerShell hero with logo of Dataverse hovering over her palm.
PowerShell + Dataverse = PSDataverse

While working with Power Platform, often time you migh need to quicky try something to see if it works, or to just check some data real quick. Other times you might need to automate some tasks or integrate processes that are not exposed through the UI. Take data migration as an example, or simply loading a a couple of huge CSV files with some data transformation. Are you going to build a fully custom .NET solution just for that? or are you going to use a full-fledged SSIS solution with its notoriously troublesome Script components and non-Git friendly projects? you other options are learning 3rd-party one size fits all cloud-based services with their edge cases. Well, now you have a 3rd option. PSDataverse, a PowerShell module that brings Dataverse to PowerShell.

Show me an example

Let’s check how easy it is to load a CSV file called Account.CSV to Microsoft Dynamics 365.

Connect-Dataverse "authority=https://login.microsoftonline.com/<tenant-id>/oauth2/authorize;clientid=<client-id>;thumbprint=<client-cert-thumbprint>;resource=https://<environment>.crm4.dynamics.com/"
Import-Csv "Account.csv" | ForEach-Object {
    @{ Uri = "account", Method= "PATCH", Value = @{
        name = $_.AccountName,
        overridencreatedon = $_.CreationDate,
        ...}}} | ConvertTo-Json | Send-DataverseOperation
        

You know, that technically is two lines of code in PowerShell. Sure, I’m cheating here. I have piped four commands together, but that makes it even easier to read. Let’s go over what I’m doing here.

  • The first line is obviously using Connect-Dataverse command to connect to Dataverse using a client certificate.
  • The second line (which I have broken into multiple lines for clarity):
    • Imports a CSV file called Account.csv.
    • Transforms each line into a Dataverse operation using a ForEach-Object.
    • Converts operations to JSON using ConvertTo-Json command.
    • Sends the operations to Dataverse

A Power Platform professional might say sending several operations one-by-one will not be ideal in terms of performance. I do agree with that statement. Send-DataverseOperation has two parameters for just that.

Improving performance

Dataverse provides two features to improve performance over its API for when needed. The first one is parallelism. Each user can send data in parallel up to a configurable (by Microsoft) number of lanes. By default the number is 20, but you can ask Microsoft to increase it for when needed (e.g. data migration). The other one is called batching. In order to use it, you’ll need to put several operations in a single HTTP request with a special syntax and structure and send it to the API, where it will be taken apart and executed one-by-one. All the operations will run in a single transaction however. In other words they succeed or fail (and rolled back) together. To use these two features, Send-DataverseOperation provides two parameters.

  • -BatchCapacity a number between 0 (default) and 1000 that defines the maximum number of operations that can be put in a single batch.
  • -MaxDop a number between 1 (default) and 1000 that indicates how many data lanes will be used to send data in parallel.

As you see, the default values indicate that batching will be off and only one lane will be used by default. This makes experimentation easier and safer by default. For example, running the following command:

@{Uri="WhoAmI"} | Send-DataverseOperation

Will give you BusinessUnitId, UserId, and OrganizationId of the current user.

@odata.context : https://helloworld.crm4.dynamics.com/api/data/v9.2/$metadata#Microsoft.Dynamics.CRM.WhoA
                 mIResponse
BusinessUnitId : 6f202e6c-e471-ec11-8941-000d3adf0002
UserId         : 88057198-a9b1-ec11-9840-00567ab5c181
OrganizationId : e34c95a5-f34c-430c-a05e-a23437e5b9fa

What’s more

There is a lot more to PSDataverse that what has been covered above. The best way to know it is to download PSDataverse from its GitHub repository and try it. There are several features in its core like automaticall transient fault detection and retry. Automatically adopting the pace based on hints provided by Dataverse and more.

Keep in mind that PSDataverse is a free, open-source PowerShell module and it can benefit from your suggestions, bug reports and contributions. Currently yhe module is in alpha, but I have already used it in a data migration project for a major bank in Europe and I try to use it for different tasks as frequent as I can to enhance it to fit more and more use cases and I put developer productivity above all.