How to drop observations with inter-row difference being less than a specific value The 2019 Stack Overflow Developer Survey Results Are InFastest way to drop rows with missing values?Assign value to specific data.table columns and rowsHow to add column based on specific row differences?How to filter out matrix rows with entries less than specific valueHow to drop groups when there are not enough observations?drop columns that take less than n values?How to update constant values on specific rows by group?Computing inter-value differences in data.table columns (with .SD) in RHow to count number of values less than 0 and greater than 0 in a rowMean of the values that have less than 10 months in stock

A female thief is not sold to make restitution -- so what happens instead?

A word that means fill it to the required quantity

Can withdrawing asylum be illegal?

Why can I use a list index as an indexing variable in a for loop?

Did any laptop computers have a built-in 5 1/4 inch floppy drive?

How do you keep chess fun when your opponent constantly beats you?

Is an up-to-date browser secure on an out-of-date OS?

How to quickly solve partial fractions equation?

Are there any other methods to apply to solving simultaneous equations?

RequirePermission not working

If my opponent casts Ultimate Price on my Phantasmal Bear, can I save it by casting Snap or Curfew?

What is the most efficient way to store a numeric range?

What does Linus Torvalds mean when he says that Git "never ever" tracks a file?

How to notate time signature switching consistently every measure

Can a flute soloist sit?

Correct punctuation for showing a character's confusion

What is the meaning of Triage in Cybersec world?

Why does the nucleus not repel itself?

How did passengers keep warm on sail ships?

What do these terms in Caesar's Gallic Wars mean?

Why are there uneven bright areas in this photo of black hole?

Short story: man watches girlfriend's spaceship entering a 'black hole' (?) forever

What do I do when my TA workload is more than expected?

Pokemon Turn Based battle (Python)

How to drop observations with inter-row difference being less than a specific value

The 2019 Stack Overflow Developer Survey Results Are InFastest way to drop rows with missing values?Assign value to specific data.table columns and rowsHow to add column based on specific row differences?How to filter out matrix rows with entries less than specific valueHow to drop groups when there are not enough observations?drop columns that take less than n values?How to update constant values on specific rows by group?Computing inter-value differences in data.table columns (with .SD) in RHow to count number of values less than 0 and greater than 0 in a rowMean of the values that have less than 10 months in stock

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
 t = as.Date(c(27, 32:34, 36:41))) 
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11

For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:

I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this

# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11

Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.

Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.

Thank you in advanced!

asked Mar 8 at 11:03

Caleb

114

Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied

– NelsonGon
Mar 8 at 11:36

1

@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")

– Soren
Mar 8 at 12:19

Thanks @Soren for that.

– NelsonGon
Mar 8 at 12:20

add a comment |

I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
 t = as.Date(c(27, 32:34, 36:41))) 
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11

For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:

I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this

# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11

Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.

Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.

Thank you in advanced!

asked Mar 8 at 11:03

Caleb

114

Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied

– NelsonGon
Mar 8 at 11:36

1

@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")

– Soren
Mar 8 at 12:19

Thanks @Soren for that.

– NelsonGon
Mar 8 at 12:20

add a comment |

I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
 t = as.Date(c(27, 32:34, 36:41))) 
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11

For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:

I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this

# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11

Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.

Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.

Thank you in advanced!

asked Mar 8 at 11:03

Caleb

114

I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
 t = as.Date(c(27, 32:34, 36:41))) 
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11

For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:

I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this

# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11

Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.

Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.

Thank you in advanced!

r data.table

asked Mar 8 at 11:03

Caleb

114

asked Mar 8 at 11:03

Caleb

114

asked Mar 8 at 11:03

Caleb

114

asked Mar 8 at 11:03

Caleb

114

asked Mar 8 at 11:03

Caleb

114

Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied

– NelsonGon
Mar 8 at 11:36

1

@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")

– Soren
Mar 8 at 12:19

Thanks @Soren for that.

– NelsonGon
Mar 8 at 12:20

add a comment |

Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied

– NelsonGon
Mar 8 at 11:36

1

@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")

– Soren
Mar 8 at 12:19

Thanks @Soren for that.

– NelsonGon
Mar 8 at 12:20

Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied

– NelsonGon
Mar 8 at 11:36

@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")

– Soren
Mar 8 at 12:19

Thanks @Soren for that.

– NelsonGon
Mar 8 at 12:20

add a comment |

3 Answers
3

active

oldest

votes

If I understood correctly, you could do:

library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
 , filt := ifelse(x == filt, 
 shift(x, fill = x[1]), 
 filt)][
 x - filt >= 0.5 | x == filt, ][, filt := NULL]

Explanation:

First we calculate the minimum of x by each group;

Group is created by cumsum(c(1, +(x >= shift(x) + 0.5)[-1])). Therein, we check for each row whether x >= shift(x) + 0.5 (difference between x and previous row is larger or equal to 0.5). This evaluates to TRUE or FALSE which we turn to 1 and 0 with the + sign; as the first row will always be NA (as there is no previous one), we remove it with [-1] after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum - the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);

There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);

After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.

Output:

 x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11

edited Mar 8 at 15:29

answered Mar 8 at 11:53

arg0naut91

6,0291421

1

Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

– Caleb
Mar 8 at 15:00

You're welcome! Of course, will add a short description.

– arg0naut91
Mar 8 at 15:10

1

I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

– Caleb
Mar 8 at 16:14

Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

– arg0naut91
Mar 8 at 16:16

1

I see it when using paste( ) with collapse. Thx, bro!

– Caleb
Mar 8 at 16:25

add a comment |

As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
 t = as.Date(c(27, 32:34, 36:41))) 
setkeyv(z,"t")

find_gaps <- function(dt) 
 dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
 gaps <- dt[,abs(x-last_x) < 0.5,]
 gap <- which(gaps==TRUE)[1]
 #print(paste0("Removing row: ",gap))
 return (gap)


while(!is.na(gap<-find_gaps(z))) z <- z[-gap] 

z

Results:

[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
 x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE

Alternate

Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient

z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))

 if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
 else last_x <- x[i]

z1 <- z1[-(gaps)]

Benchmarking

microbenchmark::microbenchmark(times=100,
 forway=
 z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
 x <- z1$x; last_x <- x[1]; gaps <- c()

 for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i]; 
 z1 <- z1[-(gaps)]
 ,
 datatableway=
 z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

 z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
 whileway=
 z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

 find_gaps <- function(dt) 
 dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
 gaps <- dt[,abs(x-last_x) < 0.5,]
 which(gaps==TRUE)[1]
 
 while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap] 
 
)

(z1==z2) & (z2==z3[,.(x,t)])

Results:

Unit: milliseconds
 expr min lq mean median uq max neval
 forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
 datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
 whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
> 
> (z1==z2) & (z2==z3[,.(x,t)])
 x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE

edited Mar 8 at 15:06

answered Mar 8 at 12:05

Soren

1,2631711

Thank you! That's very intuitive.

– Caleb
Mar 8 at 15:05

Made an update for fastest approach -- a simply for() loop it seems!

– Soren
Mar 8 at 15:08

Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

– Caleb
Mar 8 at 16:17

add a comment |

You can use dplyr::mutate and filter:

z %>%
 mutate(diff = lead(x, 1) - x) %>%
 filter(diff >= 0.5 | is.na(diff)) %>%
 select(-diff)

I kept diff field for easy understanding purpose. You can do this in single filter statement also

answered Mar 8 at 11:13

Sonny

2,0361515

It's not given the desired results.

– tmfmnk
Mar 8 at 11:15

Why row 7 should not be removed?

– Sonny
Mar 8 at 11:17

I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

– tmfmnk
Mar 8 at 11:18

This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

– Caleb
Mar 8 at 11:20

You said between any two rows nearby , so should it only only for +/- 2 rows ?

– Sonny
Mar 8 at 11:21

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55061869%2fhow-to-drop-observations-with-inter-row-difference-being-less-than-a-specific-va%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

If I understood correctly, you could do:

library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
 , filt := ifelse(x == filt, 
 shift(x, fill = x[1]), 
 filt)][
 x - filt >= 0.5 | x == filt, ][, filt := NULL]

Explanation:

First we calculate the minimum of x by each group;

Group is created by cumsum(c(1, +(x >= shift(x) + 0.5)[-1])). Therein, we check for each row whether x >= shift(x) + 0.5 (difference between x and previous row is larger or equal to 0.5). This evaluates to TRUE or FALSE which we turn to 1 and 0 with the + sign; as the first row will always be NA (as there is no previous one), we remove it with [-1] after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum - the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);

There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);

After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.

Output:

 x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11

edited Mar 8 at 15:29

answered Mar 8 at 11:53

arg0naut91

6,0291421

1

Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

– Caleb
Mar 8 at 15:00

You're welcome! Of course, will add a short description.

– arg0naut91
Mar 8 at 15:10

1

I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

– Caleb
Mar 8 at 16:14

Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

– arg0naut91
Mar 8 at 16:16

1

I see it when using paste( ) with collapse. Thx, bro!

– Caleb
Mar 8 at 16:25

add a comment |

If I understood correctly, you could do:

library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
 , filt := ifelse(x == filt, 
 shift(x, fill = x[1]), 
 filt)][
 x - filt >= 0.5 | x == filt, ][, filt := NULL]

Explanation:

First we calculate the minimum of x by each group;

Group is created by cumsum(c(1, +(x >= shift(x) + 0.5)[-1])). Therein, we check for each row whether x >= shift(x) + 0.5 (difference between x and previous row is larger or equal to 0.5). This evaluates to TRUE or FALSE which we turn to 1 and 0 with the + sign; as the first row will always be NA (as there is no previous one), we remove it with [-1] after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum - the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);

There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);

After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.

Output:

 x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11

edited Mar 8 at 15:29

answered Mar 8 at 11:53

arg0naut91

6,0291421

1

Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

– Caleb
Mar 8 at 15:00

You're welcome! Of course, will add a short description.

– arg0naut91
Mar 8 at 15:10

1

I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

– Caleb
Mar 8 at 16:14

Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

– arg0naut91
Mar 8 at 16:16

1

I see it when using paste( ) with collapse. Thx, bro!

– Caleb
Mar 8 at 16:25

add a comment |

If I understood correctly, you could do:

library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
 , filt := ifelse(x == filt, 
 shift(x, fill = x[1]), 
 filt)][
 x - filt >= 0.5 | x == filt, ][, filt := NULL]

Explanation:

First we calculate the minimum of x by each group;

Group is created by cumsum(c(1, +(x >= shift(x) + 0.5)[-1])). Therein, we check for each row whether x >= shift(x) + 0.5 (difference between x and previous row is larger or equal to 0.5). This evaluates to TRUE or FALSE which we turn to 1 and 0 with the + sign; as the first row will always be NA (as there is no previous one), we remove it with [-1] after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum - the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);

There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);

After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.

Output:

 x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11

edited Mar 8 at 15:29

answered Mar 8 at 11:53

arg0naut91

6,0291421

If I understood correctly, you could do:

library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
 , filt := ifelse(x == filt, 
 shift(x, fill = x[1]), 
 filt)][
 x - filt >= 0.5 | x == filt, ][, filt := NULL]

Explanation:

First we calculate the minimum of x by each group;

Group is created by cumsum(c(1, +(x >= shift(x) + 0.5)[-1])). Therein, we check for each row whether x >= shift(x) + 0.5 (difference between x and previous row is larger or equal to 0.5). This evaluates to TRUE or FALSE which we turn to 1 and 0 with the + sign; as the first row will always be NA (as there is no previous one), we remove it with [-1] after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum - the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);

There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);

After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.

Output:

 x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11

edited Mar 8 at 15:29

answered Mar 8 at 11:53

arg0naut91

6,0291421

edited Mar 8 at 15:29

answered Mar 8 at 11:53

arg0naut91

6,0291421

answered Mar 8 at 11:53

arg0naut91

6,0291421

answered Mar 8 at 11:53

arg0naut91

6,0291421

1

Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

– Caleb
Mar 8 at 15:00

You're welcome! Of course, will add a short description.

– arg0naut91
Mar 8 at 15:10

1

I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

– Caleb
Mar 8 at 16:14

Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

– arg0naut91
Mar 8 at 16:16

1

I see it when using paste( ) with collapse. Thx, bro!

– Caleb
Mar 8 at 16:25

add a comment |

1

Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

– Caleb
Mar 8 at 15:00

You're welcome! Of course, will add a short description.

– arg0naut91
Mar 8 at 15:10

1

I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

– Caleb
Mar 8 at 16:14

Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

– arg0naut91
Mar 8 at 16:16

1

I see it when using paste( ) with collapse. Thx, bro!

– Caleb
Mar 8 at 16:25

Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

– Caleb
Mar 8 at 15:00

You're welcome! Of course, will add a short description.

– arg0naut91
Mar 8 at 15:10

I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

– Caleb
Mar 8 at 16:14

Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

– arg0naut91
Mar 8 at 16:16

I see it when using paste( ) with collapse. Thx, bro!

– Caleb
Mar 8 at 16:25

add a comment |

As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
 t = as.Date(c(27, 32:34, 36:41))) 
setkeyv(z,"t")

find_gaps <- function(dt) 
 dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
 gaps <- dt[,abs(x-last_x) < 0.5,]
 gap <- which(gaps==TRUE)[1]
 #print(paste0("Removing row: ",gap))
 return (gap)


while(!is.na(gap<-find_gaps(z))) z <- z[-gap] 

z

Results:

[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
 x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE

Alternate

Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient

z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))

 if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
 else last_x <- x[i]

z1 <- z1[-(gaps)]

Benchmarking

microbenchmark::microbenchmark(times=100,
 forway=
 z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
 x <- z1$x; last_x <- x[1]; gaps <- c()

 for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i]; 
 z1 <- z1[-(gaps)]
 ,
 datatableway=
 z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

 z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
 whileway=
 z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

 find_gaps <- function(dt) 
 dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
 gaps <- dt[,abs(x-last_x) < 0.5,]
 which(gaps==TRUE)[1]
 
 while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap] 
 
)

(z1==z2) & (z2==z3[,.(x,t)])

Results:

Unit: milliseconds
 expr min lq mean median uq max neval
 forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
 datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
 whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
> 
> (z1==z2) & (z2==z3[,.(x,t)])
 x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE

edited Mar 8 at 15:06

answered Mar 8 at 12:05

Soren

1,2631711

Thank you! That's very intuitive.

– Caleb
Mar 8 at 15:05

Made an update for fastest approach -- a simply for() loop it seems!

– Soren
Mar 8 at 15:08

Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

– Caleb
Mar 8 at 16:17

add a comment |

As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
 t = as.Date(c(27, 32:34, 36:41))) 
setkeyv(z,"t")

find_gaps <- function(dt) 
 dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
 gaps <- dt[,abs(x-last_x) < 0.5,]
 gap <- which(gaps==TRUE)[1]
 #print(paste0("Removing row: ",gap))
 return (gap)


while(!is.na(gap<-find_gaps(z))) z <- z[-gap] 

z

Results:

[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
 x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE

Alternate

Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient

z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))

 if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
 else last_x <- x[i]

z1 <- z1[-(gaps)]

Benchmarking

microbenchmark::microbenchmark(times=100,
 forway=
 z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
 x <- z1$x; last_x <- x[1]; gaps <- c()

 for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i]; 
 z1 <- z1[-(gaps)]
 ,
 datatableway=
 z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

 z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
 whileway=
 z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

 find_gaps <- function(dt) 
 dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
 gaps <- dt[,abs(x-last_x) < 0.5,]
 which(gaps==TRUE)[1]
 
 while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap] 
 
)

(z1==z2) & (z2==z3[,.(x,t)])

Results:

Unit: milliseconds
 expr min lq mean median uq max neval
 forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
 datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
 whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
> 
> (z1==z2) & (z2==z3[,.(x,t)])
 x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE

edited Mar 8 at 15:06

answered Mar 8 at 12:05

Soren

1,2631711

Thank you! That's very intuitive.

– Caleb
Mar 8 at 15:05

Made an update for fastest approach -- a simply for() loop it seems!

– Soren
Mar 8 at 15:08

Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

– Caleb
Mar 8 at 16:17

add a comment |

As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
 t = as.Date(c(27, 32:34, 36:41))) 
setkeyv(z,"t")

find_gaps <- function(dt) 
 dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
 gaps <- dt[,abs(x-last_x) < 0.5,]
 gap <- which(gaps==TRUE)[1]
 #print(paste0("Removing row: ",gap))
 return (gap)


while(!is.na(gap<-find_gaps(z))) z <- z[-gap] 

z

Results:

[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
 x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE

Alternate

Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient

z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))

 if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
 else last_x <- x[i]

z1 <- z1[-(gaps)]

Benchmarking

microbenchmark::microbenchmark(times=100,
 forway=
 z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
 x <- z1$x; last_x <- x[1]; gaps <- c()

 for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i]; 
 z1 <- z1[-(gaps)]
 ,
 datatableway=
 z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

 z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
 whileway=
 z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

 find_gaps <- function(dt) 
 dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
 gaps <- dt[,abs(x-last_x) < 0.5,]
 which(gaps==TRUE)[1]
 
 while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap] 
 
)

(z1==z2) & (z2==z3[,.(x,t)])

Results:

Unit: milliseconds
 expr min lq mean median uq max neval
 forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
 datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
 whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
> 
> (z1==z2) & (z2==z3[,.(x,t)])
 x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE

edited Mar 8 at 15:06

answered Mar 8 at 12:05

Soren

1,2631711

As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
 t = as.Date(c(27, 32:34, 36:41))) 
setkeyv(z,"t")

find_gaps <- function(dt) 
 dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
 gaps <- dt[,abs(x-last_x) < 0.5,]
 gap <- which(gaps==TRUE)[1]
 #print(paste0("Removing row: ",gap))
 return (gap)


while(!is.na(gap<-find_gaps(z))) z <- z[-gap] 

z

Results:

[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
 x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE

Alternate

Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient

z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))

 if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
 else last_x <- x[i]

z1 <- z1[-(gaps)]

Benchmarking

microbenchmark::microbenchmark(times=100,
 forway=
 z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
 x <- z1$x; last_x <- x[1]; gaps <- c()

 for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i]; 
 z1 <- z1[-(gaps)]
 ,
 datatableway=
 z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

 z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
 whileway=
 z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

 find_gaps <- function(dt) 
 dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
 gaps <- dt[,abs(x-last_x) < 0.5,]
 which(gaps==TRUE)[1]
 
 while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap] 
 
)

(z1==z2) & (z2==z3[,.(x,t)])

Results:

Unit: milliseconds
 expr min lq mean median uq max neval
 forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
 datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
 whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
> 
> (z1==z2) & (z2==z3[,.(x,t)])
 x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE

edited Mar 8 at 15:06

answered Mar 8 at 12:05

Soren

1,2631711

edited Mar 8 at 15:06

answered Mar 8 at 12:05

Soren

1,2631711

answered Mar 8 at 12:05

Soren

1,2631711

answered Mar 8 at 12:05

Soren

1,2631711

Thank you! That's very intuitive.

– Caleb
Mar 8 at 15:05

Made an update for fastest approach -- a simply for() loop it seems!

– Soren
Mar 8 at 15:08

Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

– Caleb
Mar 8 at 16:17

add a comment |

Thank you! That's very intuitive.

– Caleb
Mar 8 at 15:05

Made an update for fastest approach -- a simply for() loop it seems!

– Soren
Mar 8 at 15:08

Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

– Caleb
Mar 8 at 16:17

Thank you! That's very intuitive.

– Caleb
Mar 8 at 15:05

Made an update for fastest approach -- a simply for() loop it seems!

– Soren
Mar 8 at 15:08

Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

– Caleb
Mar 8 at 16:17

add a comment |

You can use dplyr::mutate and filter:

z %>%
 mutate(diff = lead(x, 1) - x) %>%
 filter(diff >= 0.5 | is.na(diff)) %>%
 select(-diff)

I kept diff field for easy understanding purpose. You can do this in single filter statement also

answered Mar 8 at 11:13

Sonny

2,0361515

It's not given the desired results.

– tmfmnk
Mar 8 at 11:15

Why row 7 should not be removed?

– Sonny
Mar 8 at 11:17

I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

– tmfmnk
Mar 8 at 11:18

This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

– Caleb
Mar 8 at 11:20

You said between any two rows nearby , so should it only only for +/- 2 rows ?

– Sonny
Mar 8 at 11:21

|
show 1 more comment

You can use dplyr::mutate and filter:

z %>%
 mutate(diff = lead(x, 1) - x) %>%
 filter(diff >= 0.5 | is.na(diff)) %>%
 select(-diff)

I kept diff field for easy understanding purpose. You can do this in single filter statement also

answered Mar 8 at 11:13

Sonny

2,0361515

It's not given the desired results.

– tmfmnk
Mar 8 at 11:15

Why row 7 should not be removed?

– Sonny
Mar 8 at 11:17

I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

– tmfmnk
Mar 8 at 11:18

This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

– Caleb
Mar 8 at 11:20

You said between any two rows nearby , so should it only only for +/- 2 rows ?

– Sonny
Mar 8 at 11:21

|
show 1 more comment

You can use dplyr::mutate and filter:

z %>%
 mutate(diff = lead(x, 1) - x) %>%
 filter(diff >= 0.5 | is.na(diff)) %>%
 select(-diff)

I kept diff field for easy understanding purpose. You can do this in single filter statement also

answered Mar 8 at 11:13

Sonny

2,0361515

You can use dplyr::mutate and filter:

z %>%
 mutate(diff = lead(x, 1) - x) %>%
 filter(diff >= 0.5 | is.na(diff)) %>%
 select(-diff)

I kept diff field for easy understanding purpose. You can do this in single filter statement also

answered Mar 8 at 11:13

Sonny

2,0361515

answered Mar 8 at 11:13

Sonny

2,0361515

answered Mar 8 at 11:13

Sonny

2,0361515

answered Mar 8 at 11:13

Sonny

2,0361515

It's not given the desired results.

– tmfmnk
Mar 8 at 11:15

Why row 7 should not be removed?

– Sonny
Mar 8 at 11:17

I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

– tmfmnk
Mar 8 at 11:18

This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

– Caleb
Mar 8 at 11:20

You said between any two rows nearby , so should it only only for +/- 2 rows ?

– Sonny
Mar 8 at 11:21

|
show 1 more comment

It's not given the desired results.

– tmfmnk
Mar 8 at 11:15

Why row 7 should not be removed?

– Sonny
Mar 8 at 11:17

I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

– tmfmnk
Mar 8 at 11:18

This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

– Caleb
Mar 8 at 11:20

You said between any two rows nearby , so should it only only for +/- 2 rows ?

– Sonny
Mar 8 at 11:21

It's not given the desired results.

– tmfmnk
Mar 8 at 11:15

Why row 7 should not be removed?

– Sonny
Mar 8 at 11:17

I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

– tmfmnk
Mar 8 at 11:18

This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

– Caleb
Mar 8 at 11:20

You said between any two rows nearby , so should it only only for +/- 2 rows ?

– Sonny
Mar 8 at 11:21

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

3 Answers
3

Alternate

Benchmarking

Results:

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Alternate

Benchmarking

Results:

Alternate

Benchmarking

Results:

Alternate

Benchmarking

Results:

Alternate

Benchmarking

Results:

Post as a guest

Popular posts from this blog

3 Answers 3

Alternate

Benchmarking

Results:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Alternate

Benchmarking

Results:

Alternate

Benchmarking

Results:

Alternate

Benchmarking

Results:

Alternate

Benchmarking

Results:

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

3 Answers
3

3 Answers
3

3 Answers
3