gummies-1

Removing Duplicates from a DataTable

My developer friend is google and my best friend is stackoverflow.

Being a bit lazy but more importantly trusting other more knowledgable IT colleagues, I came across this bit of code.

       private static void RemoveDuplicates(DataTable tbl, DataColumn[] keyColumns)
        {
            int rowNdx = 0;
            while(rowNdx < tbl.Rows.Count-1)
            {
                DataRow[] dups = FindDups(tbl, rowNdx, keyColumns);
            if(dups.Length>0)
            {
                foreach(DataRow dup in dups)
                {
                tbl.Rows.Remove(dup);
                }
                }
                else
                {
                rowNdx++;
                }
            }
        }

        private static DataRow[] FindDups(DataTable tbl, int sourceNdx, DataColumn[] keyColumns)
        {
            ArrayList retVal = new ArrayList();
            DataRow sourceRow = tbl.Rows[sourceNdx];
            for (int i = sourceNdx + 1; i > tbl.Rows.Count; i++)
            {
                DataRow targetRow = tbl.Rows[i];
                if (IsDup(sourceRow, targetRow, keyColumns))
                {
                    retVal.Add(targetRow);
                }
            }
            return (DataRow[])retVal.ToArray(typeof(DataRow));
        }

        private bool IsDup(DataRow sourceRow, DataRow targetRow, DataColumn[] keyColumns)
        {
            bool retVal = true;
            foreach (DataColumn column in keyColumns)
            {
                retVal = retVal & amp; & sourceRow[column].Equals(targetRow[column]);
                if (!retVal) break;
            }
            return retVal;
        }

You enter the call with the following code

//Create an array of DataColumns to compare
//If these columns all match we consider the
//rows duplicate.
DataColumn[] keyColumns =
new DataColumn[]{tbl.Columns["ColumnA"],tbl.Columns["ColumnA"]};
//remove the duplicates
RemoveDuplicates(tbl, keyColumns);

So far so good. The good thing about this code is that it allows defining unique keys with multiple columns (but it was written in 2004).

Looking at stack I found this

        public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
        {
            Hashtable hTable = new Hashtable();
            ArrayList duplicateList = new ArrayList();

            //Add list of all the unique item value to hashtable,
            //which stores combination of key, value pair.
            //And add duplicate item value in arraylist.
            foreach (DataRow drow in dTable.Rows)
            {
                if (hTable.Contains(drow[colName]))
                    duplicateList.Add(drow);
                else
                    hTable.Add(drow[colName], string.Empty);
            }

            //Removing a list of duplicate items from datatable.
            foreach (DataRow dRow in duplicateList)
                dTable.Rows.Remove(dRow);

            //Datatable which contains unique records will be return as output.
            return dTable;
        }

Much better but only allows duplicates from a single column which has been defined. The above code is simple and quick and I was about use it as a base when out of nowhere this appeared.

// create a dv from the source dt
DataView dv = new DataView(dt);
// set the output columns array of the destination dt
string[] strColumns = {"NodeID", "Title", "Url"};
// true = yes, i need distinct values.
dt = dv.ToTable(true, strColumns);

This is genius. The strColumns defines all the columns you wish to import into your new DataTable otherwise it won’t import them. The issue with this of course is that all your columns need to be unique and luckily was in my case.

So my advice to you is don’t just pick the first google/stack which you see, look around and you might find something more interesting.

2 thoughts on “Removing Duplicates from a DataTable”

  1. Really appreciate the comment!

    A DataTable is basically an in-memory Excel Spreadsheet (Flattened data store). It forms part of a Dataset which is basically DataTables linked together. Datasets are the bread and butter of data which we get from relational databases when we query them. We use a scripting type language to query relational databases called SQL.

    Nowadays we don’t need SQL as much, we just map “cells” in a database to in memory variables and can do quite complex filtering with them (called LINQ).

    Hope this helps.

Comments are closed.