Using CSV files in Baby X

The Baby X resource compiler comes with a powerful CSV parser and tools for integrating CSV data into Baby X programs. You can also use the Baby X resource compiler to create data which can be included in other programs.

What is CSV

CSV stand for "comma-separated values" files. At its heart it is the simplest of formats. A CSV file is simply a write out of data, with one record per line, and the fileds separated by commas.

Name, Salary, Payroll ID
Tom, 32000.00, 123
Dick, 24000.00, 124
Harry, 50000.00, 125

That's it. The simplicity and the human-readbility of this format has made it a favourite for storing high value data

CSV files represent what computer scientists call a "dataframe". A dataframe is simply an array of records, with all the rows in the same format and the columns representing fields. So in this case we have three records (for the three employees),and three fields, name, salary, and payroll id. Dataframes are naturally represented in C programs as arrays of structs.

The simplicity of CSV is a bit of an illusion. Real datasets often have missing data. And string data can get quite long and contain embedded quotation marks or newlines. There are also undesirable variations on the basic format, such as the use of tabs insead of commas as the separator, or the use of commas instead of decimal points. There is also the problem that some CSVs have a header line and some do not.

The CSV file parser

The CSV file parser which comes with the Baby X resource compiler is powerful and easy to use. It will load most CSV files.


#define CSV_NULL 0       /* null data */
#define CSV_REAL 1       /* floating-point data */
#define CSV_STRING 2     /* string data */
#define CSV_BOOL 3       /* boolean data */

typedef union
{
  double x;        /* real value   */
  char *str;       /* string value */
} CSV_VALUE;

typedef struct
{
  char **names;     /* column names (can be NULL) */
  int *types;       /* type of column CSV_REAL, CSV_STRING etc */
  int width;        /* number of columns */
  int height;       /* number of row (excl header) */
  CSV_VALUE *data;  /* data in row-major format */
} CSV;

CSV *loadcsv(const char *fname);
void killcsv(CSV *csv);
void csv_getsize(CSV *csv, int *width, int *height);
int csv_hasdata(CSV *csv, int col, int row);
double csv_get(CSV *csv, int col, int row);
const char *csv_getstr(CSV *csv, int col, int row);
int csv_hasheader(CSV *csv);
const char *csv_column(CSV *csv, int col, int *type);

The structures are supposed to be opaque but exposed to calling code to make troubleshooting easier. However the CVS_xxxx type identifiers are needed for querying column types.

Sometimes when using CSV files you know what data you are expecting, for example for a payroll program,and sometimes you do not, for example when implementing a statistical package. After the CSV file is loaded, you can quickly query the width (the number of columns, or fields), and the column names and types, to ensure that you have data in the format you were expecting. Or you can use that information to buld dynamic structures at run time, which is more difficult.

Example programs


   int main(int argc, char **argv)
   {
      CSV *csv;
      int width, height;
      int i, ii;
      int type;
      const char *fieldname;
      double total = 0.0;
      int N = 0;

      csv = loadcsv(argv[1]);
      if (!csv)
      {
         printf("Can't load CSV file %s\n", argv[1]);
         exit(EXIT_FAILURE);
      }
      if (!csv_hasheader(csv))
      {
         printf("CSV file must have a header\n");
         exit(EXIT_FAILURE);
      }
      csv_getsize(csv, &width, &height);

      for (i = 0; i < width; i++)
      {
         fieldname = csv_column(csv, i, &type);
         if (strcmp(fieldname, argv[2]))
         {
             if (type != CSV_REAL)
             {
                printf("Can only take mean of numerical data\n);
                exit(EXIT_FAILURE);
             }
             for (ii =0; ii < height; ii++)
             {
                if (csv_hasdata(csv, i, ii))
                {
                   total += csv_get(csv, i, ii);
                   N++;
                }
             }
             printf("The mean of %s is %f\n", argv[2], total / N);
             break;
         }
      }
      if (i == width)
      {
          printf("Can't find field %s\n", argv[2]);
          exit(EXIT_FAILURE);
      }

      killcsv(csv);
      return 0;
   }

Here's a program to take a CSV file and a field name, and calculate the mean of that field.