Select Page

photo by Bluesnap

With the rise of Digital era more and more organisations want to exchange the data. I’m not talking about Facebook, but mostly about the government organisations I usually work with. One could think that it would be implemented as some sort of API allowing machines to talk to each other and exchange portions of data. However, most of these solutions are implemented with the file exchange via good old secure FTP. In my job I have to deal with file import and export all the time.

If we talk about regular file exchange process, an organisation probably develops some sort of ETL solution. Usually such ETL development starts with just a handful of files and very uninformative documentation. The first thing I do in this case, upload the file to my working tool, MS SQL Server, to be able to profile the data and prepare the solution.

OPENROWSET

The first option allows initial data profiling without actually loading anything on server. OPENROWSET method has been designed to query any data via ODBC provider. The only complication with this method is a need for a format file for SQL Server to be able to format the data it reads into the columns. It shouldn’t be a problem if this is not some random file you want to read, but an example of the actual file you are planning to upload into your database. Most likely, you have file specification. You can compose format file relying on that, it is not hard at all. Or if you have a table ready, you can generate this file from its metadata. Execute the following query to generate a format file (first check if TCP/IP is enabled on your server):

DECLARE @SQLCommand VARCHAR(1000)
SET @SQLCommand = 
'bcp ODS.files.Offers ' + --Database.Schema.Table_name
'format nul ' + --Type of generated file - format file
'-c ' + --Show everything as character string. I found it the best option for profiling
'-S SERVER01 -T ' + --Server name and type of connection to server. -T stands for "Trusted", i.e. for Windows authentication. Alternatively use -U and -P for username and password connection
'-t , -r \n ' + --Column and row delimeters. Should match the ones in your file
'-f D:\FormatFiles\OffersFormat.fmt' --File location on disk
EXEC master..xp_cmdshell @SQLCommand
GO

Now you can use the file data in your queries, including the INSERT statements:

SELECT * 
FROM OPENROWSET ( BULK 'D:\Files\TestFile.csv',
 FORMATFILE = 'D:\FormatFiles\OffersFormat.fmt' ) AS Table1
WHERE Offer_Type_Code = 'HT01'
GO

BULK INSERT

This method of insert require table to exist before insert happens.

BULK INSERT ODS.files.Offers
FROM 'D:\Files\TestFile.csv'
--You can specify column and row terminators
WITH (FIELDTERMINATOR =','
, ROWTERMINATOR = '0x0A') --0A is a hex code for a line feed, 0D is a carriage return; you don't need this at all if your row terminator is a combination of both
--Alternatively you can use the same format file as above
--WITH (FORMATFILE = 'D:\FormatFiles\OffersFormat.fmt')
GO

BCP Utility

Bulk Copy Program Utility is a handy tool for copying data both ways from and to SQL Server. It could be executed from command line or Power Shell, but personally I find it very handy to use it in my TSQL code. However, as I have mentioned above, the TCP/IP port should be opened and server configured to run an external utility (xp_cmdshell option enabled).

Import data into table script:

DECLARE @SQLCommand varchar(1000)
SET @SQLCommand = 'bcp ODS.files.Offers in ' + --Destination table and data flow direction
'D:\Files\TestFile.csv ' + --Source file full path
'-S SERVER01 -T ' + --Server and connection. Use -U and -P for SQL credentials
'-c  -t , -r 0x0A ' --Parameter for non-Unicode characters, column terminator and row terminator in hex code. No need in row terminator if it's a combination of a carriage return and line feed
--The alternative for the last string is using the same format file as before
--'-f D:\FormatFiles\OffersFormat.fmt'
EXEC   master..xp_cmdshell @SQLCommand
GO

I have already used BCP utility above to make a format file based on the table metadata. It could also help with the data export. A piece of code below would write first 100 records of each table from the database into individual files.

DECLARE @TableName varchar(128)
DECLARE @SchemaName varchar(128)
DECLARE db_cursor CURSOR FOR 

SELECT TABLE_SCHEMA, TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES

OPEN db_cursor  
FETCH NEXT FROM db_cursor INTO @SchemaName, @TableName

WHILE @@FETCH_STATUS = 0  
BEGIN  
	DECLARE @SQLCommand varchar(1000)
	SET @SQLCommand = 'bcp "SELECT TOP 100 * FROM ODS.' + @SchemaName + '.' + @TableName + 
        '" queryout D:\Files\' + @SchemaName + '_' + @TableName + '.csv -S SERVER01 -T -c  -t , ' 
	EXEC master..xp_cmdshell @sqlcommand

    FETCH NEXT FROM db_cursor INTO @SchemaName, @TableName
END 

CLOSE db_cursor  
DEALLOCATE db_cursor 
GO

Import and Export Wizard

Enough of code! Microsoft has spoiled us with Wizards, let’s use one. Right click on the database in SQL Server Management Studio Object Explorer, under “Tasks” you have two options, “Import Flat File” and “Import Data”. The first option works faster, but it only allows data upload into new table. Import Flat File Wizard is smart enough to define the best field types for file data to be loaded into. The resulting process could not be reused or scripted, it only loads data once.

Import Data Wizard is more generic, so it require more clicks. Choose a Flat file source. Traverse to the file on disk. Don’t forget to open “Advanced” tab and click “Suggest Types…” button, otherwise it will treat every field as a 50 characters string. You can choose an existing table or create new table as a target for data load. Wizard can save generated SSIS package to SQL Server or on disk. Either way, you can reuse it later: execute from SQL Agent job or add this package to your SSIS project. Maybe it will save you a couple of clicks when developing an ETL solution for the regular file upload.

Export wizard is not different from the Import wizard, you just need to swap the source and the target.

Side note on files insert

As I work in Data warehousing, this is a rule anyway, but I thought I should mention it. All the methods of insert described above are using very primitive insert in batches, called Bulk Insert. It means that the speed is taking greater priority than quality. It is not a great idea to load the data from files straight into the target table. The data from files needs to be cleansed, checked for consistency and duplicates first. Only after that data is ready to be loaded into the final destination table.

Kate
Data masseuse

 

Kate blogs about the details that make the Data Warehouses work.
Read more from Kate.

%d bloggers like this: