How to drop observations with inter-row difference being less than a specific value The 2019 Stack Overflow Developer Survey Results Are InFastest way to drop rows with missing values?Assign value to specific data.table columns and rowsHow to add column based on specific row differences?How to filter out matrix rows with entries less than specific valueHow to drop groups when there are not enough observations?drop columns that take less than n values?How to update constant values on specific rows by group?Computing inter-value differences in data.table columns (with .SD) in RHow to count number of values less than 0 and greater than 0 in a rowMean of the values that have less than 10 months in stock

A female thief is not sold to make restitution -- so what happens instead?

A word that means fill it to the required quantity

Can withdrawing asylum be illegal?

Why can I use a list index as an indexing variable in a for loop?

Did any laptop computers have a built-in 5 1/4 inch floppy drive?

How do you keep chess fun when your opponent constantly beats you?

Is an up-to-date browser secure on an out-of-date OS?

How to quickly solve partial fractions equation?

Are there any other methods to apply to solving simultaneous equations?

RequirePermission not working

If my opponent casts Ultimate Price on my Phantasmal Bear, can I save it by casting Snap or Curfew?

What is the most efficient way to store a numeric range?

What does Linus Torvalds mean when he says that Git "never ever" tracks a file?

How to notate time signature switching consistently every measure

Can a flute soloist sit?

Correct punctuation for showing a character's confusion

What is the meaning of Triage in Cybersec world?

Why does the nucleus not repel itself?

How did passengers keep warm on sail ships?

What do these terms in Caesar's Gallic Wars mean?

Why are there uneven bright areas in this photo of black hole?

Short story: man watches girlfriend's spaceship entering a 'black hole' (?) forever

What do I do when my TA workload is more than expected?

Pokemon Turn Based battle (Python)



How to drop observations with inter-row difference being less than a specific value



The 2019 Stack Overflow Developer Survey Results Are InFastest way to drop rows with missing values?Assign value to specific data.table columns and rowsHow to add column based on specific row differences?How to filter out matrix rows with entries less than specific valueHow to drop groups when there are not enough observations?drop columns that take less than n values?How to update constant values on specific rows by group?Computing inter-value differences in data.table columns (with .SD) in RHow to count number of values less than 0 and greater than 0 in a rowMean of the values that have less than 10 months in stock



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








1















I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this



z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
t = as.Date(c(27, 32:34, 36:41)))
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11


For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:



I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this



# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11


Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.



Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.



Thank you in advanced!










share|improve this question






















  • Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied

    – NelsonGon
    Mar 8 at 11:36






  • 1





    @NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")

    – Soren
    Mar 8 at 12:19











  • Thanks @Soren for that.

    – NelsonGon
    Mar 8 at 12:20

















1















I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this



z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
t = as.Date(c(27, 32:34, 36:41)))
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11


For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:



I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this



# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11


Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.



Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.



Thank you in advanced!










share|improve this question






















  • Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied

    – NelsonGon
    Mar 8 at 11:36






  • 1





    @NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")

    – Soren
    Mar 8 at 12:19











  • Thanks @Soren for that.

    – NelsonGon
    Mar 8 at 12:20













1












1








1








I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this



z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
t = as.Date(c(27, 32:34, 36:41)))
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11


For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:



I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this



# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11


Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.



Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.



Thank you in advanced!










share|improve this question














I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this



z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
t = as.Date(c(27, 32:34, 36:41)))
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11


For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:



I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this



# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11


Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.



Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.



Thank you in advanced!







r data.table






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 8 at 11:03









CalebCaleb

114




114












  • Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied

    – NelsonGon
    Mar 8 at 11:36






  • 1





    @NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")

    – Soren
    Mar 8 at 12:19











  • Thanks @Soren for that.

    – NelsonGon
    Mar 8 at 12:20

















  • Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied

    – NelsonGon
    Mar 8 at 11:36






  • 1





    @NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")

    – Soren
    Mar 8 at 12:19











  • Thanks @Soren for that.

    – NelsonGon
    Mar 8 at 12:20
















Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied

– NelsonGon
Mar 8 at 11:36





Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied

– NelsonGon
Mar 8 at 11:36




1




1





@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")

– Soren
Mar 8 at 12:19





@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")

– Soren
Mar 8 at 12:19













Thanks @Soren for that.

– NelsonGon
Mar 8 at 12:20





Thanks @Soren for that.

– NelsonGon
Mar 8 at 12:20












3 Answers
3






active

oldest

votes


















2














If I understood correctly, you could do:



library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]


Explanation:



  • First we calculate the minimum of x by each group;

  • Group is created by cumsum(c(1, +(x >= shift(x) + 0.5)[-1])). Therein, we check for each row whether x >= shift(x) + 0.5 (difference between x and previous row is larger or equal to 0.5). This evaluates to TRUE or FALSE which we turn to 1 and 0 with the + sign; as the first row will always be NA (as there is no previous one), we remove it with [-1] after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum - the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);

  • There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);

  • After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.

Output:



 x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11





share|improve this answer




















  • 1





    Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

    – Caleb
    Mar 8 at 15:00











  • You're welcome! Of course, will add a short description.

    – arg0naut91
    Mar 8 at 15:10






  • 1





    I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

    – Caleb
    Mar 8 at 16:14












  • Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

    – arg0naut91
    Mar 8 at 16:16






  • 1





    I see it when using paste( ) with collapse. Thx, bro!

    – Caleb
    Mar 8 at 16:25


















1














As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.



z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")

find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)


while(!is.na(gap<-find_gaps(z))) z <- z[-gap]

z


Results:



[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE


Alternate



Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient



z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))

if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]

z1 <- z1[-(gaps)]


Benchmarking



microbenchmark::microbenchmark(times=100,
forway=
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()

for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i];
z1 <- z1[-(gaps)]
,
datatableway=
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
whileway=
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]

while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap]

)

(z1==z2) & (z2==z3[,.(x,t)])


Results:



Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE





share|improve this answer

























  • Thank you! That's very intuitive.

    – Caleb
    Mar 8 at 15:05











  • Made an update for fastest approach -- a simply for() loop it seems!

    – Soren
    Mar 8 at 15:08











  • Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

    – Caleb
    Mar 8 at 16:17



















0














You can use dplyr::mutate and filter:



z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)


I kept diff field for easy understanding purpose. You can do this in single filter statement also






share|improve this answer























  • It's not given the desired results.

    – tmfmnk
    Mar 8 at 11:15











  • Why row 7 should not be removed?

    – Sonny
    Mar 8 at 11:17











  • I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

    – tmfmnk
    Mar 8 at 11:18











  • This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

    – Caleb
    Mar 8 at 11:20











  • You said between any two rows nearby , so should it only only for +/- 2 rows ?

    – Sonny
    Mar 8 at 11:21











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55061869%2fhow-to-drop-observations-with-inter-row-difference-being-less-than-a-specific-va%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














If I understood correctly, you could do:



library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]


Explanation:



  • First we calculate the minimum of x by each group;

  • Group is created by cumsum(c(1, +(x >= shift(x) + 0.5)[-1])). Therein, we check for each row whether x >= shift(x) + 0.5 (difference between x and previous row is larger or equal to 0.5). This evaluates to TRUE or FALSE which we turn to 1 and 0 with the + sign; as the first row will always be NA (as there is no previous one), we remove it with [-1] after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum - the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);

  • There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);

  • After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.

Output:



 x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11





share|improve this answer




















  • 1





    Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

    – Caleb
    Mar 8 at 15:00











  • You're welcome! Of course, will add a short description.

    – arg0naut91
    Mar 8 at 15:10






  • 1





    I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

    – Caleb
    Mar 8 at 16:14












  • Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

    – arg0naut91
    Mar 8 at 16:16






  • 1





    I see it when using paste( ) with collapse. Thx, bro!

    – Caleb
    Mar 8 at 16:25















2














If I understood correctly, you could do:



library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]


Explanation:



  • First we calculate the minimum of x by each group;

  • Group is created by cumsum(c(1, +(x >= shift(x) + 0.5)[-1])). Therein, we check for each row whether x >= shift(x) + 0.5 (difference between x and previous row is larger or equal to 0.5). This evaluates to TRUE or FALSE which we turn to 1 and 0 with the + sign; as the first row will always be NA (as there is no previous one), we remove it with [-1] after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum - the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);

  • There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);

  • After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.

Output:



 x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11





share|improve this answer




















  • 1





    Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

    – Caleb
    Mar 8 at 15:00











  • You're welcome! Of course, will add a short description.

    – arg0naut91
    Mar 8 at 15:10






  • 1





    I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

    – Caleb
    Mar 8 at 16:14












  • Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

    – arg0naut91
    Mar 8 at 16:16






  • 1





    I see it when using paste( ) with collapse. Thx, bro!

    – Caleb
    Mar 8 at 16:25













2












2








2







If I understood correctly, you could do:



library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]


Explanation:



  • First we calculate the minimum of x by each group;

  • Group is created by cumsum(c(1, +(x >= shift(x) + 0.5)[-1])). Therein, we check for each row whether x >= shift(x) + 0.5 (difference between x and previous row is larger or equal to 0.5). This evaluates to TRUE or FALSE which we turn to 1 and 0 with the + sign; as the first row will always be NA (as there is no previous one), we remove it with [-1] after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum - the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);

  • There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);

  • After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.

Output:



 x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11





share|improve this answer















If I understood correctly, you could do:



library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]


Explanation:



  • First we calculate the minimum of x by each group;

  • Group is created by cumsum(c(1, +(x >= shift(x) + 0.5)[-1])). Therein, we check for each row whether x >= shift(x) + 0.5 (difference between x and previous row is larger or equal to 0.5). This evaluates to TRUE or FALSE which we turn to 1 and 0 with the + sign; as the first row will always be NA (as there is no previous one), we remove it with [-1] after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum - the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);

  • There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);

  • After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.

Output:



 x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 8 at 15:29

























answered Mar 8 at 11:53









arg0naut91arg0naut91

6,0291421




6,0291421







  • 1





    Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

    – Caleb
    Mar 8 at 15:00











  • You're welcome! Of course, will add a short description.

    – arg0naut91
    Mar 8 at 15:10






  • 1





    I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

    – Caleb
    Mar 8 at 16:14












  • Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

    – arg0naut91
    Mar 8 at 16:16






  • 1





    I see it when using paste( ) with collapse. Thx, bro!

    – Caleb
    Mar 8 at 16:25












  • 1





    Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

    – Caleb
    Mar 8 at 15:00











  • You're welcome! Of course, will add a short description.

    – arg0naut91
    Mar 8 at 15:10






  • 1





    I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

    – Caleb
    Mar 8 at 16:14












  • Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

    – arg0naut91
    Mar 8 at 16:16






  • 1





    I see it when using paste( ) with collapse. Thx, bro!

    – Caleb
    Mar 8 at 16:25







1




1





Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

– Caleb
Mar 8 at 15:00





Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.

– Caleb
Mar 8 at 15:00













You're welcome! Of course, will add a short description.

– arg0naut91
Mar 8 at 15:10





You're welcome! Of course, will add a short description.

– arg0naut91
Mar 8 at 15:10




1




1





I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

– Caleb
Mar 8 at 16:14






I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂

– Caleb
Mar 8 at 16:14














Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

– arg0naut91
Mar 8 at 16:16





Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.

– arg0naut91
Mar 8 at 16:16




1




1





I see it when using paste( ) with collapse. Thx, bro!

– Caleb
Mar 8 at 16:25





I see it when using paste( ) with collapse. Thx, bro!

– Caleb
Mar 8 at 16:25













1














As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.



z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")

find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)


while(!is.na(gap<-find_gaps(z))) z <- z[-gap]

z


Results:



[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE


Alternate



Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient



z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))

if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]

z1 <- z1[-(gaps)]


Benchmarking



microbenchmark::microbenchmark(times=100,
forway=
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()

for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i];
z1 <- z1[-(gaps)]
,
datatableway=
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
whileway=
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]

while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap]

)

(z1==z2) & (z2==z3[,.(x,t)])


Results:



Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE





share|improve this answer

























  • Thank you! That's very intuitive.

    – Caleb
    Mar 8 at 15:05











  • Made an update for fastest approach -- a simply for() loop it seems!

    – Soren
    Mar 8 at 15:08











  • Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

    – Caleb
    Mar 8 at 16:17
















1














As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.



z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")

find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)


while(!is.na(gap<-find_gaps(z))) z <- z[-gap]

z


Results:



[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE


Alternate



Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient



z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))

if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]

z1 <- z1[-(gaps)]


Benchmarking



microbenchmark::microbenchmark(times=100,
forway=
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()

for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i];
z1 <- z1[-(gaps)]
,
datatableway=
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
whileway=
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]

while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap]

)

(z1==z2) & (z2==z3[,.(x,t)])


Results:



Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE





share|improve this answer

























  • Thank you! That's very intuitive.

    – Caleb
    Mar 8 at 15:05











  • Made an update for fastest approach -- a simply for() loop it seems!

    – Soren
    Mar 8 at 15:08











  • Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

    – Caleb
    Mar 8 at 16:17














1












1








1







As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.



z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")

find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)


while(!is.na(gap<-find_gaps(z))) z <- z[-gap]

z


Results:



[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE


Alternate



Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient



z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))

if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]

z1 <- z1[-(gaps)]


Benchmarking



microbenchmark::microbenchmark(times=100,
forway=
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()

for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i];
z1 <- z1[-(gaps)]
,
datatableway=
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
whileway=
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]

while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap]

)

(z1==z2) & (z2==z3[,.(x,t)])


Results:



Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE





share|improve this answer















As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.



z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")

find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)


while(!is.na(gap<-find_gaps(z))) z <- z[-gap]

z


Results:



[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE


Alternate



Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient



z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))

if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]

z1 <- z1[-(gaps)]


Benchmarking



microbenchmark::microbenchmark(times=100,
forway=
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()

for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i];
z1 <- z1[-(gaps)]
,
datatableway=
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
whileway=
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]

while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap]

)

(z1==z2) & (z2==z3[,.(x,t)])


Results:



Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 8 at 15:06

























answered Mar 8 at 12:05









SorenSoren

1,2631711




1,2631711












  • Thank you! That's very intuitive.

    – Caleb
    Mar 8 at 15:05











  • Made an update for fastest approach -- a simply for() loop it seems!

    – Soren
    Mar 8 at 15:08











  • Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

    – Caleb
    Mar 8 at 16:17


















  • Thank you! That's very intuitive.

    – Caleb
    Mar 8 at 15:05











  • Made an update for fastest approach -- a simply for() loop it seems!

    – Soren
    Mar 8 at 15:08











  • Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

    – Caleb
    Mar 8 at 16:17

















Thank you! That's very intuitive.

– Caleb
Mar 8 at 15:05





Thank you! That's very intuitive.

– Caleb
Mar 8 at 15:05













Made an update for fastest approach -- a simply for() loop it seems!

– Soren
Mar 8 at 15:08





Made an update for fastest approach -- a simply for() loop it seems!

– Soren
Mar 8 at 15:08













Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

– Caleb
Mar 8 at 16:17






Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?

– Caleb
Mar 8 at 16:17












0














You can use dplyr::mutate and filter:



z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)


I kept diff field for easy understanding purpose. You can do this in single filter statement also






share|improve this answer























  • It's not given the desired results.

    – tmfmnk
    Mar 8 at 11:15











  • Why row 7 should not be removed?

    – Sonny
    Mar 8 at 11:17











  • I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

    – tmfmnk
    Mar 8 at 11:18











  • This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

    – Caleb
    Mar 8 at 11:20











  • You said between any two rows nearby , so should it only only for +/- 2 rows ?

    – Sonny
    Mar 8 at 11:21















0














You can use dplyr::mutate and filter:



z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)


I kept diff field for easy understanding purpose. You can do this in single filter statement also






share|improve this answer























  • It's not given the desired results.

    – tmfmnk
    Mar 8 at 11:15











  • Why row 7 should not be removed?

    – Sonny
    Mar 8 at 11:17











  • I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

    – tmfmnk
    Mar 8 at 11:18











  • This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

    – Caleb
    Mar 8 at 11:20











  • You said between any two rows nearby , so should it only only for +/- 2 rows ?

    – Sonny
    Mar 8 at 11:21













0












0








0







You can use dplyr::mutate and filter:



z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)


I kept diff field for easy understanding purpose. You can do this in single filter statement also






share|improve this answer













You can use dplyr::mutate and filter:



z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)


I kept diff field for easy understanding purpose. You can do this in single filter statement also







share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 8 at 11:13









SonnySonny

2,0361515




2,0361515












  • It's not given the desired results.

    – tmfmnk
    Mar 8 at 11:15











  • Why row 7 should not be removed?

    – Sonny
    Mar 8 at 11:17











  • I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

    – tmfmnk
    Mar 8 at 11:18











  • This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

    – Caleb
    Mar 8 at 11:20











  • You said between any two rows nearby , so should it only only for +/- 2 rows ?

    – Sonny
    Mar 8 at 11:21

















  • It's not given the desired results.

    – tmfmnk
    Mar 8 at 11:15











  • Why row 7 should not be removed?

    – Sonny
    Mar 8 at 11:17











  • I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

    – tmfmnk
    Mar 8 at 11:18











  • This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

    – Caleb
    Mar 8 at 11:20











  • You said between any two rows nearby , so should it only only for +/- 2 rows ?

    – Sonny
    Mar 8 at 11:21
















It's not given the desired results.

– tmfmnk
Mar 8 at 11:15





It's not given the desired results.

– tmfmnk
Mar 8 at 11:15













Why row 7 should not be removed?

– Sonny
Mar 8 at 11:17





Why row 7 should not be removed?

– Sonny
Mar 8 at 11:17













I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

– tmfmnk
Mar 8 at 11:18





I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.

– tmfmnk
Mar 8 at 11:18













This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

– Caleb
Mar 8 at 11:20





This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).

– Caleb
Mar 8 at 11:20













You said between any two rows nearby , so should it only only for +/- 2 rows ?

– Sonny
Mar 8 at 11:21





You said between any two rows nearby , so should it only only for +/- 2 rows ?

– Sonny
Mar 8 at 11:21

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55061869%2fhow-to-drop-observations-with-inter-row-difference-being-less-than-a-specific-va%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

1928 у кіно

Захаров Федір Захарович

Ель Греко