How to drop observations with inter-row difference being less than a specific value The 2019 Stack Overflow Developer Survey Results Are InFastest way to drop rows with missing values?Assign value to specific data.table columns and rowsHow to add column based on specific row differences?How to filter out matrix rows with entries less than specific valueHow to drop groups when there are not enough observations?drop columns that take less than n values?How to update constant values on specific rows by group?Computing inter-value differences in data.table columns (with .SD) in RHow to count number of values less than 0 and greater than 0 in a rowMean of the values that have less than 10 months in stock
A female thief is not sold to make restitution -- so what happens instead?
A word that means fill it to the required quantity
Can withdrawing asylum be illegal?
Why can I use a list index as an indexing variable in a for loop?
Did any laptop computers have a built-in 5 1/4 inch floppy drive?
How do you keep chess fun when your opponent constantly beats you?
Is an up-to-date browser secure on an out-of-date OS?
How to quickly solve partial fractions equation?
Are there any other methods to apply to solving simultaneous equations?
RequirePermission not working
If my opponent casts Ultimate Price on my Phantasmal Bear, can I save it by casting Snap or Curfew?
What is the most efficient way to store a numeric range?
What does Linus Torvalds mean when he says that Git "never ever" tracks a file?
How to notate time signature switching consistently every measure
Can a flute soloist sit?
Correct punctuation for showing a character's confusion
What is the meaning of Triage in Cybersec world?
Why does the nucleus not repel itself?
How did passengers keep warm on sail ships?
What do these terms in Caesar's Gallic Wars mean?
Why are there uneven bright areas in this photo of black hole?
Short story: man watches girlfriend's spaceship entering a 'black hole' (?) forever
What do I do when my TA workload is more than expected?
Pokemon Turn Based battle (Python)
How to drop observations with inter-row difference being less than a specific value
The 2019 Stack Overflow Developer Survey Results Are InFastest way to drop rows with missing values?Assign value to specific data.table columns and rowsHow to add column based on specific row differences?How to filter out matrix rows with entries less than specific valueHow to drop groups when there are not enough observations?drop columns that take less than n values?How to update constant values on specific rows by group?Computing inter-value differences in data.table columns (with .SD) in RHow to count number of values less than 0 and greater than 0 in a rowMean of the values that have less than 10 months in stock
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11
For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:
I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this
# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11
Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.
Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.
Thank you in advanced!
r data.table
add a comment |
I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11
For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:
I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this
# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11
Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.
Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.
Thank you in advanced!
r data.table
Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied
– NelsonGon
Mar 8 at 11:36
1
@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")
– Soren
Mar 8 at 12:19
Thanks @Soren for that.
– NelsonGon
Mar 8 at 12:20
add a comment |
I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11
For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:
I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this
# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11
Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.
Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.
Thank you in advanced!
r data.table
I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11
For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:
I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this
# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11
Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.
Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.
Thank you in advanced!
r data.table
r data.table
asked Mar 8 at 11:03
CalebCaleb
114
114
Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied
– NelsonGon
Mar 8 at 11:36
1
@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")
– Soren
Mar 8 at 12:19
Thanks @Soren for that.
– NelsonGon
Mar 8 at 12:20
add a comment |
Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied
– NelsonGon
Mar 8 at 11:36
1
@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")
– Soren
Mar 8 at 12:19
Thanks @Soren for that.
– NelsonGon
Mar 8 at 12:20
Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied
– NelsonGon
Mar 8 at 11:36
Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied
– NelsonGon
Mar 8 at 11:36
1
1
@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")
– Soren
Mar 8 at 12:19
@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")
– Soren
Mar 8 at 12:19
Thanks @Soren for that.
– NelsonGon
Mar 8 at 12:20
Thanks @Soren for that.
– NelsonGon
Mar 8 at 12:20
add a comment |
3 Answers
3
active
oldest
votes
If I understood correctly, you could do:
library(data.table)
z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]
Explanation:
- First we calculate the minimum of
x
by each group; - Group is created by
cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))
. Therein, we check for each row whetherx >= shift(x) + 0.5
(difference betweenx
and previous row is larger or equal to 0.5). This evaluates toTRUE
orFALSE
which we turn to 1 and 0 with the+
sign; as the first row will always beNA
(as there is no previous one), we remove it with[-1]
after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply thecumsum
- the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion); - There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);
- After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.
Output:
x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11
1
Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.
– Caleb
Mar 8 at 15:00
You're welcome! Of course, will add a short description.
– arg0naut91
Mar 8 at 15:10
1
I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂
– Caleb
Mar 8 at 16:14
Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.
– arg0naut91
Mar 8 at 16:16
1
I see it when using paste( ) with collapse. Thx, bro!
– Caleb
Mar 8 at 16:25
add a comment |
As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")
find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)
while(!is.na(gap<-find_gaps(z))) z <- z[-gap]
z
Results:
[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE
Alternate
Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()
for (i in 2:length(x))
if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]
z1 <- z1[-(gaps)]
Benchmarking
microbenchmark::microbenchmark(times=100,
forway=
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()
for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i];
z1 <- z1[-(gaps)]
,
datatableway=
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")
z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
whileway=
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")
find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]
while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap]
)
(z1==z2) & (z2==z3[,.(x,t)])
Results:
Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE
Thank you! That's very intuitive.
– Caleb
Mar 8 at 15:05
Made an update for fastest approach -- a simply for() loop it seems!
– Soren
Mar 8 at 15:08
Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?
– Caleb
Mar 8 at 16:17
add a comment |
You can use dplyr::mutate
and filter
:
z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)
I kept diff
field for easy understanding purpose. You can do this in single filter statement also
It's not given the desired results.
– tmfmnk
Mar 8 at 11:15
Why row 7 should not be removed?
– Sonny
Mar 8 at 11:17
I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.
– tmfmnk
Mar 8 at 11:18
This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).
– Caleb
Mar 8 at 11:20
You saidbetween any two rows nearby
, so should it only only for +/- 2 rows ?
– Sonny
Mar 8 at 11:21
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55061869%2fhow-to-drop-observations-with-inter-row-difference-being-less-than-a-specific-va%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
If I understood correctly, you could do:
library(data.table)
z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]
Explanation:
- First we calculate the minimum of
x
by each group; - Group is created by
cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))
. Therein, we check for each row whetherx >= shift(x) + 0.5
(difference betweenx
and previous row is larger or equal to 0.5). This evaluates toTRUE
orFALSE
which we turn to 1 and 0 with the+
sign; as the first row will always beNA
(as there is no previous one), we remove it with[-1]
after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply thecumsum
- the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion); - There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);
- After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.
Output:
x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11
1
Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.
– Caleb
Mar 8 at 15:00
You're welcome! Of course, will add a short description.
– arg0naut91
Mar 8 at 15:10
1
I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂
– Caleb
Mar 8 at 16:14
Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.
– arg0naut91
Mar 8 at 16:16
1
I see it when using paste( ) with collapse. Thx, bro!
– Caleb
Mar 8 at 16:25
add a comment |
If I understood correctly, you could do:
library(data.table)
z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]
Explanation:
- First we calculate the minimum of
x
by each group; - Group is created by
cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))
. Therein, we check for each row whetherx >= shift(x) + 0.5
(difference betweenx
and previous row is larger or equal to 0.5). This evaluates toTRUE
orFALSE
which we turn to 1 and 0 with the+
sign; as the first row will always beNA
(as there is no previous one), we remove it with[-1]
after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply thecumsum
- the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion); - There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);
- After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.
Output:
x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11
1
Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.
– Caleb
Mar 8 at 15:00
You're welcome! Of course, will add a short description.
– arg0naut91
Mar 8 at 15:10
1
I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂
– Caleb
Mar 8 at 16:14
Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.
– arg0naut91
Mar 8 at 16:16
1
I see it when using paste( ) with collapse. Thx, bro!
– Caleb
Mar 8 at 16:25
add a comment |
If I understood correctly, you could do:
library(data.table)
z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]
Explanation:
- First we calculate the minimum of
x
by each group; - Group is created by
cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))
. Therein, we check for each row whetherx >= shift(x) + 0.5
(difference betweenx
and previous row is larger or equal to 0.5). This evaluates toTRUE
orFALSE
which we turn to 1 and 0 with the+
sign; as the first row will always beNA
(as there is no previous one), we remove it with[-1]
after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply thecumsum
- the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion); - There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);
- After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.
Output:
x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11
If I understood correctly, you could do:
library(data.table)
z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]
Explanation:
- First we calculate the minimum of
x
by each group; - Group is created by
cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))
. Therein, we check for each row whetherx >= shift(x) + 0.5
(difference betweenx
and previous row is larger or equal to 0.5). This evaluates toTRUE
orFALSE
which we turn to 1 and 0 with the+
sign; as the first row will always beNA
(as there is no previous one), we remove it with[-1]
after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply thecumsum
- the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion); - There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);
- After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.
Output:
x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11
edited Mar 8 at 15:29
answered Mar 8 at 11:53
arg0naut91arg0naut91
6,0291421
6,0291421
1
Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.
– Caleb
Mar 8 at 15:00
You're welcome! Of course, will add a short description.
– arg0naut91
Mar 8 at 15:10
1
I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂
– Caleb
Mar 8 at 16:14
Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.
– arg0naut91
Mar 8 at 16:16
1
I see it when using paste( ) with collapse. Thx, bro!
– Caleb
Mar 8 at 16:25
add a comment |
1
Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.
– Caleb
Mar 8 at 15:00
You're welcome! Of course, will add a short description.
– arg0naut91
Mar 8 at 15:10
1
I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂
– Caleb
Mar 8 at 16:14
Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.
– arg0naut91
Mar 8 at 16:16
1
I see it when using paste( ) with collapse. Thx, bro!
– Caleb
Mar 8 at 16:25
1
1
Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.
– Caleb
Mar 8 at 15:00
Thank you! That's genius and also hard to digest. Could you please tell me how to interpret <code>+(x > shift(x) + 0.5)[-1]</code>? I dont understand the uses of <code>+( ...)</code> and <code>[-1]</code>.
– Caleb
Mar 8 at 15:00
You're welcome! Of course, will add a short description.
– arg0naut91
Mar 8 at 15:10
You're welcome! Of course, will add a short description.
– arg0naut91
Mar 8 at 15:10
1
1
I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂
– Caleb
Mar 8 at 16:14
I really appreciate that. I thought <code>by </code> can be only used to control on specific group variables, but did not know it is so flexible for using in nested conditions. 😂
– Caleb
Mar 8 at 16:14
Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.
– arg0naut91
Mar 8 at 16:16
Indeed - it's very flexible & invisible at the same time - you don't need to remove it later on, can smoothly continue with the rest.
– arg0naut91
Mar 8 at 16:16
1
1
I see it when using paste( ) with collapse. Thx, bro!
– Caleb
Mar 8 at 16:25
I see it when using paste( ) with collapse. Thx, bro!
– Caleb
Mar 8 at 16:25
add a comment |
As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")
find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)
while(!is.na(gap<-find_gaps(z))) z <- z[-gap]
z
Results:
[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE
Alternate
Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()
for (i in 2:length(x))
if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]
z1 <- z1[-(gaps)]
Benchmarking
microbenchmark::microbenchmark(times=100,
forway=
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()
for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i];
z1 <- z1[-(gaps)]
,
datatableway=
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")
z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
whileway=
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")
find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]
while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap]
)
(z1==z2) & (z2==z3[,.(x,t)])
Results:
Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE
Thank you! That's very intuitive.
– Caleb
Mar 8 at 15:05
Made an update for fastest approach -- a simply for() loop it seems!
– Soren
Mar 8 at 15:08
Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?
– Caleb
Mar 8 at 16:17
add a comment |
As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")
find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)
while(!is.na(gap<-find_gaps(z))) z <- z[-gap]
z
Results:
[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE
Alternate
Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()
for (i in 2:length(x))
if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]
z1 <- z1[-(gaps)]
Benchmarking
microbenchmark::microbenchmark(times=100,
forway=
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()
for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i];
z1 <- z1[-(gaps)]
,
datatableway=
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")
z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
whileway=
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")
find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]
while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap]
)
(z1==z2) & (z2==z3[,.(x,t)])
Results:
Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE
Thank you! That's very intuitive.
– Caleb
Mar 8 at 15:05
Made an update for fastest approach -- a simply for() loop it seems!
– Soren
Mar 8 at 15:08
Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?
– Caleb
Mar 8 at 16:17
add a comment |
As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")
find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)
while(!is.na(gap<-find_gaps(z))) z <- z[-gap]
z
Results:
[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE
Alternate
Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()
for (i in 2:length(x))
if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]
z1 <- z1[-(gaps)]
Benchmarking
microbenchmark::microbenchmark(times=100,
forway=
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()
for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i];
z1 <- z1[-(gaps)]
,
datatableway=
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")
z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
whileway=
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")
find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]
while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap]
)
(z1==z2) & (z2==z3[,.(x,t)])
Results:
Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE
As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")
find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)
while(!is.na(gap<-find_gaps(z))) z <- z[-gap]
z
Results:
[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE
Alternate
Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()
for (i in 2:length(x))
if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]
z1 <- z1[-(gaps)]
Benchmarking
microbenchmark::microbenchmark(times=100,
forway=
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()
for (i in 2:length(x)) if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i); else last_x <- x[i];
z1 <- z1[-(gaps)]
,
datatableway=
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")
z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 ,
whileway=
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")
find_gaps <- function(dt)
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]
while(!is.na(gap<-find_gaps(z3))) z3 <- z3[-gap]
)
(z1==z2) & (z2==z3[,.(x,t)])
Results:
Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE
edited Mar 8 at 15:06
answered Mar 8 at 12:05
SorenSoren
1,2631711
1,2631711
Thank you! That's very intuitive.
– Caleb
Mar 8 at 15:05
Made an update for fastest approach -- a simply for() loop it seems!
– Soren
Mar 8 at 15:08
Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?
– Caleb
Mar 8 at 16:17
add a comment |
Thank you! That's very intuitive.
– Caleb
Mar 8 at 15:05
Made an update for fastest approach -- a simply for() loop it seems!
– Soren
Mar 8 at 15:08
Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?
– Caleb
Mar 8 at 16:17
Thank you! That's very intuitive.
– Caleb
Mar 8 at 15:05
Thank you! That's very intuitive.
– Caleb
Mar 8 at 15:05
Made an update for fastest approach -- a simply for() loop it seems!
– Soren
Mar 8 at 15:08
Made an update for fastest approach -- a simply for() loop it seems!
– Soren
Mar 8 at 15:08
Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?
– Caleb
Mar 8 at 16:17
Thx!😊 It's indeed fast, just not that convenient to use for/while loop inside a data.table, especially with groups?
– Caleb
Mar 8 at 16:17
add a comment |
You can use dplyr::mutate
and filter
:
z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)
I kept diff
field for easy understanding purpose. You can do this in single filter statement also
It's not given the desired results.
– tmfmnk
Mar 8 at 11:15
Why row 7 should not be removed?
– Sonny
Mar 8 at 11:17
I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.
– tmfmnk
Mar 8 at 11:18
This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).
– Caleb
Mar 8 at 11:20
You saidbetween any two rows nearby
, so should it only only for +/- 2 rows ?
– Sonny
Mar 8 at 11:21
|
show 1 more comment
You can use dplyr::mutate
and filter
:
z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)
I kept diff
field for easy understanding purpose. You can do this in single filter statement also
It's not given the desired results.
– tmfmnk
Mar 8 at 11:15
Why row 7 should not be removed?
– Sonny
Mar 8 at 11:17
I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.
– tmfmnk
Mar 8 at 11:18
This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).
– Caleb
Mar 8 at 11:20
You saidbetween any two rows nearby
, so should it only only for +/- 2 rows ?
– Sonny
Mar 8 at 11:21
|
show 1 more comment
You can use dplyr::mutate
and filter
:
z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)
I kept diff
field for easy understanding purpose. You can do this in single filter statement also
You can use dplyr::mutate
and filter
:
z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)
I kept diff
field for easy understanding purpose. You can do this in single filter statement also
answered Mar 8 at 11:13
SonnySonny
2,0361515
2,0361515
It's not given the desired results.
– tmfmnk
Mar 8 at 11:15
Why row 7 should not be removed?
– Sonny
Mar 8 at 11:17
I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.
– tmfmnk
Mar 8 at 11:18
This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).
– Caleb
Mar 8 at 11:20
You saidbetween any two rows nearby
, so should it only only for +/- 2 rows ?
– Sonny
Mar 8 at 11:21
|
show 1 more comment
It's not given the desired results.
– tmfmnk
Mar 8 at 11:15
Why row 7 should not be removed?
– Sonny
Mar 8 at 11:17
I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.
– tmfmnk
Mar 8 at 11:18
This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).
– Caleb
Mar 8 at 11:20
You saidbetween any two rows nearby
, so should it only only for +/- 2 rows ?
– Sonny
Mar 8 at 11:21
It's not given the desired results.
– tmfmnk
Mar 8 at 11:15
It's not given the desired results.
– tmfmnk
Mar 8 at 11:15
Why row 7 should not be removed?
– Sonny
Mar 8 at 11:17
Why row 7 should not be removed?
– Sonny
Mar 8 at 11:17
I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.
– tmfmnk
Mar 8 at 11:18
I think the OP is thinking about a solution that removes a row and then compares the next subsequent row with the last non-removed row.
– tmfmnk
Mar 8 at 11:18
This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).
– Caleb
Mar 8 at 11:20
This does not work because the row#7 would be removed, but I need to keep it. I've tried to calculate from the 1-st to N-th difference and generate tag to label them if it is qualified to removed, but very tedious and inefficient for a huge dataset (about 8GB size).
– Caleb
Mar 8 at 11:20
You said
between any two rows nearby
, so should it only only for +/- 2 rows ?– Sonny
Mar 8 at 11:21
You said
between any two rows nearby
, so should it only only for +/- 2 rows ?– Sonny
Mar 8 at 11:21
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55061869%2fhow-to-drop-observations-with-inter-row-difference-being-less-than-a-specific-va%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Error in as.Date.numeric(c(27, 32:34, 36:41)) : 'origin' must be supplied
– NelsonGon
Mar 8 at 11:36
1
@NelsonGon try : as.Date.numeric(c(27, 32:34, 36:41),origin="1970-01-01")
– Soren
Mar 8 at 12:19
Thanks @Soren for that.
– NelsonGon
Mar 8 at 12:20