Skip to content

Commit 4a2f9f1

Browse files
committed
sep="" is now a "whitespace" separator for readCSV()
1 parent 08fd0b7 commit 4a2f9f1

5 files changed

Lines changed: 45 additions & 15 deletions

File tree

EidosScribe/EidosHelpFunctions.rtf

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4689,7 +4689,11 @@ The separator between values is supplied by
46894689
\f1\fs18 sep
46904690
\f3\fs20 ; it is a comma by default, but a tab can be used instead by supplying tab (
46914691
\f1\fs18 "\\t"
4692-
\f3\fs20 in Eidos), or another character may also be used.\
4692+
\f3\fs20 in Eidos), or another character may also be used. If
4693+
\f1\fs18 sep
4694+
\f3\fs20 is the empty string
4695+
\f1\fs18 ""
4696+
\f3\fs20 , the separator between values is \'93whitespace\'94, meaning one or more spaces or tabs. When the separator is whitespace, whitespace at the beginning or the end of a line will be ignored.\
46934697
Similarly, the character used to quote string values is a double quote (
46944698
\f1\fs18 '"'
46954699
\f3\fs20 in Eidos), by default, but another character may be supplied in
@@ -4910,7 +4914,6 @@ See
49104914
\f3\fs20 will be returned; if not,
49114915
\f1\fs18 F
49124916
\f3\fs20 will be returned (but at present, an error will result instead).\cf0 \
4913-
\pard\pardeftab543\li547\ri720\sb60\sa60\partightenfactor0
49144917
\cf2 If
49154918
\f1\fs18 compress
49164919
\f3\fs20 is
@@ -5752,9 +5755,8 @@ Named
57525755
\f1\fs18 c()
57535756
\f3\fs20 function (including the possibility of type promotion).\
57545757
Since this function can be hard to understand at first, here is an example:\
5755-
\pard\pardeftab397\li547\ri720\sb60\sa60\partightenfactor0
57565758

5757-
\f1\fs18 \cf2 sapply(1:10, "if (applyValue % 2) applyValue ^ 2; else NULL;");\
5759+
\f1\fs18 sapply(1:10, "if (applyValue % 2) applyValue ^ 2; else NULL;");\
57585760
\pard\pardeftab397\li547\ri720\sb60\sa60\partightenfactor0
57595761

57605762
\f3\fs20 \cf2 \kerning1\expnd0\expndtw0 This produces the output

QtSLiM/help/EidosHelpFunctions.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
<meta http-equiv="Content-Style-Type" content="text/css">
66
<title></title>
77
<meta name="Generator" content="Cocoa HTML Writer">
8-
<meta name="CocoaVersion" content="1894.6">
8+
<meta name="CocoaVersion" content="1894.7">
99
<style type="text/css">
1010
p.p1 {margin: 18.0px 0.0px 3.0px 0.0px; font: 11.0px Optima}
1111
p.p2 {margin: 9.0px 0.0px 3.0px 36.0px; text-indent: -22.3px; font: 9.0px Menlo}
@@ -384,7 +384,7 @@
384384
<p class="p5"><b>Reads data from a CSV or other delimited file</b> specified by <span class="s2">filePath</span> and returns a <span class="s2">DataFrame</span> object containing the data in a tabular form.<span class="Apple-converted-space">  </span>CSV (comma-separated value) files use a somewhat standard file format in which a table of data is provided, with values within a row separated by commas, while rows in the table are separated by newlines.<span class="Apple-converted-space">  </span>Software from R to Excel (and Eidos; see the <span class="s2">serialize()</span> method of <span class="s2">Dictionary</span>) can export data in CSV format.<span class="Apple-converted-space">  </span>This function can actually also read files that use a delimiter other than commas; TSV (tab-separated value) files are a popular alternative.<span class="Apple-converted-space">  </span>Since there is substantial variation in the exact file format for CSV files, this documentation will try to specify the precise format expected by this function.<span class="Apple-converted-space">  </span>Note that CSV files represent values differently that Eidos usually does, and some of the format options allowed by <span class="s2">readCSV()</span>, such as decimal commas, are not otherwise available in Eidos.</p>
385385
<p class="p5">If <span class="s2">colNames</span> is <span class="s2">T</span> (the default), the first row of data is taken to be a header, containing the string names of the columns in the data table; those names will be used by the resulting <span class="s2">DataFrame</span>.<span class="Apple-converted-space">  </span>If <span class="s2">colNames</span> is <span class="s2">F</span>, a header row is not expected and column names are auto-generated as <span class="s2">X1</span>, <span class="s2">X2</span>, etc.<span class="Apple-converted-space">  </span>If <span class="s2">colNames</span> is a <span class="s2">string</span> vector, a header row is not expected and <span class="s2">colNames</span> will be used as the column names; if additional columns exist beyond the length of <span class="s2">colNames</span> their names will be auto-generated.<span class="Apple-converted-space">  </span>Duplicate column names will generate a warning and be made unique.</p>
386386
<p class="p5">If <span class="s2">colTypes</span> is <span class="s2">NULL</span> (the default), the value type for each column will be guessed from the values it contains, as described below.<span class="Apple-converted-space">  </span>If <span class="s2">colTypes</span> is a singleton <span class="s2">string</span>, it should contain single-letter codes indicating the desired type for each column, from left to right.<span class="Apple-converted-space">  </span>The letters <span class="s2">lifs</span> have the same meaning as in Eidos signatures (<span class="s2">logical</span>, <span class="s2">integer</span>, <span class="s2">float</span>, and <span class="s2">string</span>); in addition, <span class="s2">?</span> may be used to indicate that the type for that column should be guessed as by default, and <span class="s2">_</span> or <span class="s2">-</span> may be used to indicate that that column should be skipped – omitted from the returned <span class="s2">DataFrame</span>.<span class="Apple-converted-space">  </span>Other characters in <span class="s2">colTypes</span> will result in an error.<span class="Apple-converted-space">  </span>If additional columns exist beyond the end of the <span class="s2">colTypes</span> string their types will be guessed as by default.</p>
387-
<p class="p5">The separator between values is supplied by <span class="s2">sep</span>; it is a comma by default, but a tab can be used instead by supplying tab (<span class="s2">"\t"</span> in Eidos), or another character may also be used.</p>
387+
<p class="p5">The separator between values is supplied by <span class="s2">sep</span>; it is a comma by default, but a tab can be used instead by supplying tab (<span class="s2">"\t"</span> in Eidos), or another character may also be used.<span class="Apple-converted-space">  </span>If <span class="s2">sep</span> is the empty string <span class="s2">""</span>, the separator between values is “whitespace”, meaning one or more spaces or tabs.<span class="Apple-converted-space">  </span>When the separator is whitespace, whitespace at the beginning or the end of a line will be ignored.</p>
388388
<p class="p5">Similarly, the character used to quote string values is a double quote (<span class="s2">'"'</span> in Eidos), by default, but another character may be supplied in <span class="s2">quote</span>.<span class="Apple-converted-space">  </span>When the string delimiter is encountered, <i>all</i> following characters are considered to be part of the string until another string delimiter is encountered, terminating the string; this includes spaces, comment characters, newlines, and everything else.<span class="Apple-converted-space">  </span>Within a string value, the string delimiter itself is used twice in a row to indicate that the delimiter itself is present within the string; for example, if the string value (shown without the usual surrounding quotes to try to avoid confusion) is <span class="s2">she said "hello"</span>, and the string delimiter is the double quote as it is by default, then in the CSV file the value would be given as <span class="s2">"she said ""hello"""</span>.<span class="Apple-converted-space">  </span>The usual Eidos style of escaping characters using a backslash is <i>not</i> part of the CSV standard followed here.<span class="Apple-converted-space">  </span>(When a string value is provided <i>without</i> using the string delimiter, all following characters are considered part of the string except a newline, the value separator <span class="s2">sep</span>, the quote separator <span class="s2">quote</span>, and the comment separator <span class="s2">comment</span>; if none of those characters are present in the string value, the quote delimiter may be omitted.)</p>
389389
<p class="p5">The character used to indicate a decimal delimiter in numbers may be supplied with <span class="s2">dec</span>; by default this is <span class="s2">"."</span> (and so <span class="s2">10.0</span> would be ten, written with a decimal point), but <span class="s2">","</span> is common in European data files (and so <span class="s2">10,0</span> would be ten, written with a decimal comma).<span class="Apple-converted-space">  </span>Note that <span class="s2">dec</span> and <span class="s2">sep</span> may not be the same, so that it is unambiguous whether <span class="s2">10,0</span> is two numbers (<span class="s2">10</span> and <span class="s2">0</span>) or one number (<span class="s2">10.0</span>).<span class="Apple-converted-space">  </span>For this reason, European CSV files that use a decimal comma typically use a semicolon as the value separator, which may be supplied with <span class="s2">sep=";"</span> to <span class="s2">readCSV()</span>.</p>
390390
<p class="p5">Finally, the remainder of a line following a comment character will be ignored when the file is read; by default <span class="s2">comment</span> is the empty string, <span class="s2">""</span>, indicating that comments do not exist at all, but <span class="s2">"#"</span> is a popular comment prefix.</p>

VERSIONS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ development head (in the master branch):
1515
fix the error position reported for assignment into a non-existent property; this fixes a bug in SLiMgui's autofix feature, as a side effect (with, e.g., "sim.generation = 5")
1616
revise recipe 6.1.2 (reading a recombination map from a file) to use readCSV() instead of readFile()
1717
extend the subset() method of DataFrame to accept NULL for rows/cols, to take entire columns or entire rows respectively, for usability
18+
extend readCSV() to allow sep="", meaning that the separator is "whitespace", as in R
1819

1920

2021
version 4.0 (Eidos version 3.0):

eidos/eidos_class_DataFrame.cpp

Lines changed: 33 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -575,21 +575,22 @@ static EidosValue_SP Eidos_ExecuteFunction_readCSV(const std::vector<EidosValue_
575575
std::string dec_string = dec_value->StringAtIndex(0, nullptr);
576576
std::string comment_string = comment_value->StringAtIndex(0, nullptr);
577577

578-
if (sep_string.length() != 1)
579-
EIDOS_TERMINATION << "ERROR (Eidos_ExecuteFunction_readCSV): readCSV() requires that sep be a string of exactly one character." << EidosTerminate(nullptr);
578+
if (sep_string.length() > 1)
579+
EIDOS_TERMINATION << "ERROR (Eidos_ExecuteFunction_readCSV): readCSV() requires that sep be a string of exactly one character, or the empty string \"\"." << EidosTerminate(nullptr);
580580
if (quote_string.length() != 1)
581581
EIDOS_TERMINATION << "ERROR (Eidos_ExecuteFunction_readCSV): readCSV() requires that quote be a string of exactly one character." << EidosTerminate(nullptr);
582582
if (dec_string.length() != 1)
583583
EIDOS_TERMINATION << "ERROR (Eidos_ExecuteFunction_readCSV): readCSV() requires that dec be a string of exactly one character." << EidosTerminate(nullptr);
584584
if (comment_string.length() > 1)
585585
EIDOS_TERMINATION << "ERROR (Eidos_ExecuteFunction_readCSV): readCSV() requires that comment be a string of exactly one character, or the empty string." << EidosTerminate(nullptr);
586586

587-
char sep = sep_string[0];
587+
char sep = (sep_string.length() ? sep_string[0] : 0); // 0 indicates "whitespace separator", a special case
588588
char quote = quote_string[0];
589589
char dec = dec_string[0];
590590
char comment = (comment_string.length() ? comment_string[0] : 0); // 0 indicates "no comments"
591591

592-
if ((sep == quote) || (sep == dec) || (sep == comment) || (quote == dec) || (quote == comment) || (dec == comment))
592+
if ((sep && ((sep == quote) || (sep == dec) || (sep == comment))) ||
593+
((quote == dec) || (quote == comment) || (dec == comment)))
593594
EIDOS_TERMINATION << "ERROR (Eidos_ExecuteFunction_readCSV): readCSV() requires sep, quote, dec, and comment to be different from each other." << EidosTerminate(nullptr);
594595
if (!std::isprint(dec) || std::isalnum(dec) || (dec == '+') || (dec == '-'))
595596
EIDOS_TERMINATION << "ERROR (Eidos_ExecuteFunction_readCSV): readCSV() requires that dec be a printable, non-alphanumeric character that is not '+' or '-' (typically '.' or ',')." << EidosTerminate(nullptr);
@@ -613,22 +614,31 @@ static EidosValue_SP Eidos_ExecuteFunction_readCSV(const std::vector<EidosValue_
613614
if ((ch == 0) || (comment && (ch == comment)))
614615
continue;
615616

617+
// if the separator is "whitespace" the line can begin with whitespace, which we eat here
618+
if (!sep)
619+
while ((ch == ' ') || (ch == '\t'))
620+
ch = *(++line_ptr);
621+
616622
do
617623
{
618624
// ch should always be equal to *line_ptr here already, no need to fetch it again
619625
bool line_ended_without_separator = false;
620626

621627
// at the top of the loop, we expect a new element; a comment or a null means we have an empty string and then end
622628
// this might look like: foo,bar,baz,#comment: the last element is an empty string
629+
// if the separator is "whitespace" then an empty string is not implied here; we just end the line
623630
if ((ch == 0) || (comment && (ch == comment)))
624631
{
625-
// empty element and then end the line
626-
row.emplace_back();
632+
// empty element (if the separator is not whitespace), and then end the line
633+
if (sep)
634+
row.emplace_back();
627635
break;
628636
}
629637

630-
// similarly, a separator character here means we have am empty string and then expect another element
638+
// similarly, a separator character here means we have an empty string and then expect another element
631639
// we make the empty element, eat the separator, and loop back for the next element
640+
// note this does not occur for a "whitespace" separator; any whitespace would already be eaten at this point,
641+
// because two consecutive "whitespace" separators cannot occur, whereas ",," can occur implying an empty string
632642
if (ch == sep)
633643
{
634644
row.emplace_back();
@@ -678,11 +688,18 @@ static EidosValue_SP Eidos_ExecuteFunction_readCSV(const std::vector<EidosValue_
678688
{
679689
// not a doubled quote; the element is terminated and ch is already the character after the end quote
680690
// at this point, we expect only a separator, a comment, or a line end; the element is done
681-
if (ch == sep)
691+
if (sep && (ch == sep))
682692
{
683693
ch = *(++line_ptr);
684694
break;
685695
}
696+
else if (!sep && ((ch == ' ') || (ch == '\t')))
697+
{
698+
// eat a "whitespace" separator, similar to above
699+
while ((ch == ' ') || (ch == '\t'))
700+
ch = *(++line_ptr);
701+
break;
702+
}
686703
else if ((ch == 0) || (comment && (ch == comment)))
687704
{
688705
line_ended_without_separator = true;
@@ -724,13 +741,20 @@ static EidosValue_SP Eidos_ExecuteFunction_readCSV(const std::vector<EidosValue_
724741
line_ended_without_separator = true;
725742
break;
726743
}
727-
else if (ch == sep)
744+
else if (sep && (ch == sep))
728745
{
729746
// we hit a separator, which terminates the element but expects another
730747
// eat the separator so we're at the start of the next element
731748
ch = *(++line_ptr);
732749
break;
733750
}
751+
else if (!sep && ((ch == ' ') || (ch == '\t')))
752+
{
753+
// eat a "whitespace" separator, similar to above
754+
while ((ch == ' ') || (ch == '\t'))
755+
ch = *(++line_ptr);
756+
break;
757+
}
734758
else if (comment && (ch == comment))
735759
{
736760
// we hit a comment character, which terminates the element

eidos/eidos_test_functions_other.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1299,6 +1299,9 @@ void _RunClassTests(std::string temp_path)
12991299
EidosAssertScriptSuccess_L("x = Dictionary('a', 3:6, 'b', c(121,131,141,141141)); file = writeTempFile('eidos_test_', '.csv', x.serialize('csv')); y = readCSV(file, quote='1'); Dictionary('\"a\"', 3:6, '\"b\"', c(2:4, 414)).identicalContents(y);", true);
13001300
EidosAssertScriptSuccess_L("x = Dictionary('b', c('10$25', '10$0', '10$')); file = writeTempFile('eidos_test_', '.csv', x.serialize('csv')); y = readCSV(file, dec='$'); Dictionary('b', c(10.25, 10, 10)).identicalContents(y);", true);
13011301
EidosAssertScriptSuccess_L("x = Dictionary('a', c('foo', 'bar'), 'b', c(10.5, 10.25)); file = writeTempFile('eidos_test_', '.csv', x.serialize('csv')); y = readCSV(file, dec='$', comment='.'); Dictionary('a', c('foo', 'bar'), 'b', c(10, 10)).identicalContents(y);", true);
1302+
1303+
// test sep="" whitespace separator)
1304+
EidosAssertScriptSuccess_L("file = writeTempFile('eidos_test_', '.csv', c(' a b c d e', ' 1 2 3 4 5 ', ' 10 20 30 40 50', '100 200 300 400 500')); y = readCSV(file, sep=''); Dictionary('a', c(1,10,100), 'b', c(2,20,200), 'c', c(3,30,300), 'd', c(4,40,400), 'e', c(5,50,500)).identicalContents(y);", true);
13021305
}
13031306
}
13041307

0 commit comments

Comments
 (0)